Tuesday 26 March 2013

vSphere allows use of a RAW-mapped logical drive on two VMs in wrong way

I didn't write anything on my blog for a quite some time, mostly because no really interesting things happened.

However, I face a weird vSphere behavior, after which I had to recover two file-servers. vSphere allowed me to RAW-map an FC-connected logical drive to a VM, despite it was already mapped to another one.

Well, it was our bad, that we didn't mark this logical drive properly, so when I took a look into DS Storage manager, it showed me the name of this drive something like "DATA3". After I asked my colleagues, they told me that this is probably an unused piece of the storage.

Actually I was going to create another logical drive in order to extend LVM volume on one of ours Linux file-servers. So, being completely sure that vSphere will not allow me to RAW-map an already used logical drive, I've mapped this one to the VM, and have extended the file-system, adding those 2TB.

Because this was a production server, I've performed online resize (reiserfs), after re-mounting in r/o mode of course.

Two days after, our monitoring system notified me, that there are serious FS errors on this server. Well, I was quite surprised, because I did extend reiserfs partitions online before with no issues. However, thinking that this is a bug of partially-supported (?) FS connected with resizing relatively big partitions (3TiB to 5TiB), I've unmounted this volume and have performed a reiserfsck.

After recovery has completed (few errors were fixed) I've mounted it back and was happy..... until next day...


Next day I was notified about errors again. wtf? Performing reiserfsck again... After it is finished I was recommended to... rebuild the whole tree... s**t! Almost 3TB of data, several millions of small files on a shared SATA storage....


OK, needed means needed... Starting rebuild. At the same moment I began to recover data from backup to another partition - in order to compare differences. And at this moment.......


- "Hello, I cannot access home folders. Can you check please?"
- "One moment..... Uhm.... what the...?! Seems like we have a H/W failure, I'll call you back.."

Home folders on a windows file-server disappeared.... Drive manager shows 2TiB drive with "Unknown" partition.......

It was a blood-freezing lightning in my head!


Switching to the vSphere client console, Windows VM settings -> Hard disk 2 ->  Manage Paths.... Error!


vCenter somehow "forgot" about this RAW-mapping, and treated the logical volume as a free! S**T!


Recovery procedure

At this moment the FS tree rebuild has been completed. Switching the recovered partition to read-only and starting rsync to a new one. Luckily we have enough free space.

Recovery from backup process took a lot of time, because (of course) data was fragmented on different tapes, so most of time has been spent on watching the status "changing tape"...

Windows server... Disaster... Precisely few days before (in order to optimize our backups) the old backup has been completely removed, and a new configuration was in progress of preparation. Shame on ours backup administrators...

So, after rsync process on the Linux server has been completed, I've disconnected the "problematic" drive from the Linux server and began to scan it on the Windows server with a tool named Restorer2000 ultimate. Despite I've choose to search only for NTFS partitions, after scan was finished I had a few thousands of potential FS structures. So it took some time until I find the proper one, and the recovery process has began.

I like NTFS for it's recovery possibilities. I was able to recover almost all of 1.6TB data, except several damaged files.

It was the same on the Linux file-server. Checksum comparison with the backup copy showed no differences.

At the end we loose not so much data as we could. I cannot say that it was a new experience to me, because I had even worse disasters in my practice. But (again) I make sure of necessity of the proper labels!

PS: Unfortunately, under high users pressure I had no time to record everything, so currently I have not enough data to submit a bug to VMWare. And I have no free resources to try to reproduce it.