How to recover from a broken RAID5

How to recover from a broken RAID5


In this article I will describe an experience I had that began with the failure of some RAID5 disks at the Hospital of Pediatric Especialties, where I work. While I wouldn’t wish such an event on my worst enemy, it was something that made me learn about the power of knowledge—a deep knowledge, which is so important in the hacking culture.

Are you in dire straights with your hard drive?

We at Free Software Magazine had a major hardware crash. The good guys at DTI DATA performed hard drive recovery and saved our magazine!

Friday, April 29, 2005

This article has downloads!

A 5-disk (18GB each) RAID5 was mounted on a HP Netserver Rack Storage/12. Due to a power outage yesterday, it would no longer recognize the RAID. As a matter of fact, there were two more RAIDs on the rack that were recovered... but this one (holding about 60GB of data) just wouldn’t work.

The IT manager decided to call in some “gurus” to try to get the data back on-line. I (the only GNU/Linux user at the IT department) thought that something could be done with GNU/Linux. My first thought was: “If I get images of the separate disks, maybe I can start a software RAID on GNU/Linux. All I need is enough disk space to handle all of the images”. I told my crazy (so far) idea to the IT manager and he decided to give it a try... but only once the gurus gave up.

Monday, May 2, 2005

The gurus are still trying to get the data back on-line.

Tuesday, May 3, 2005

The gurus are still trying to get the data back on-line.

Wednesday, May 4, 2005

These guys are stubborn, aren’t they?

Thursday, May 5, 2005

The IT manager called me late in the afternoon. I was given the chance to Save the Republic. One of the disks of the array had been removed. I put the disks on a computer as separate disks (no RAID), booted with Knoppix (the environment of the IT department is Windows based, apart for my desktop, which has the XP that came with the HP box and Mandriva, which is where the computer normally stays) and made the four images of the four disks left from the original five:

# for i in a b c d; do dd if=/dev/sd$i of=image$i.dat bs=4k; done

I got all the files in a single HD and left the office.

Friday, May 6, 2005

I wanted to start a software RAID, fooling the kernel into thinking that the files where HDs. Just having the images was not enough to bring the RAID on-line. RAID5 has a number of options: algorithm (left/right parity, synchronous/asynchronous), chunk (strip) size, but most important: the order of the images in the RAID. I had to tell the kernel how the RAID controller had mounted them so it could replicate the RAID.

I had already been given the hint that the chunks were 64KB long. By the end of the day, the software RAID idea hadn’t worked at all. I started thinking about rebuilding the data the “hard” way: Making a single image of the RAID from the separate images.

Weekend, May 7 and May 8, 2005

I did some research during the weekend, plus a little study of the images. The images didn’t look encrypted at all. The first “chunk” of the four images looked like garbage, but one of the disks showed a Partition Table right on the second chunk and the other chunks appeared to have other kind of data:

# fdisk -lu discoa1
You must set cylinders.
You can do this from the extra functions menu.

Disk discoa1: 0 MB, 0 bytes
255 heads, 63 sectors/track, 0 cylinders, total 0 sectors
Units = sectors of 1 * 512 = 512 bytes

  Device Boot   Start     End   Blocks  Id System
discoa1p1       63  142175249  71087593+  7 HPFS/NTFS
Partition 1 has different physical/logical endings:
   phys=(1023, 254, 63) logical=(8849, 254, 63)

fdisk was complaining because it was a 64KB file, not the expected 72GB one (written in the partition table). I studied the images and noticed that the data chunks and the parity chunks were distinguishable from each other, and that they seemed to follow a plain RAID5 distribution and algorithm... I was hopeful.

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5
1 2 3 4 P
5 6 7 P 8
9 10 P 11 12
13 P 14 15 16
P 17 18 19 20
21 22 23 24 P
25 26 27 P 28

Table 1 - RAID5’s chunk disposition (in a 5-disk array)

I made a java class that could rebuild the RAID content from the separate images (Had I used C/C++, I would still be coding!). It was all about placing the right chunk from the right disk (image of disk) at the right place of the final image. I was missing one image, but it could be calculated with the help of the parity chunks spread all over the disks (see Textbox 1). The class was no big deal: selecting the right chunks from the disks, and using XORs to calculate the missing chunks. I guess it took about three or four hours at most to code it. I was finally ready to give it a try. The problem I hit was that while testing the software RAID at home I had damaged the images. So, I have to wait until Monday to test the class with the images of the RAID.

RAID stands for Redundant Array of Independent Disks. All it does is make a number of disks “look” like they are one to improve throughput or fault-tolerance. There are a number of ways to put them together. Some of them are:

Mirroring: in this case, each disk has exactly the same content. Size of the array: the size of the smallest disk. Redundancy: There must be at least a disk working for the data to remain intact.

Linear: one disk follows the other. The size of each disk doesn’t matter at all. Size of the array: the sum of the size of the separate disks. Redundancy: If you remove one disk, you will lose the information on that disk and potentially all of the data in the array.

RAID5: The information is spread in all of the available devices in a manner different from linear. Size of the array: the size of the smallest disk multiplied by the number of the disks minus one. Redundancy: At most one disk can be removed/replaced from the array without data loss. Instead of having disks that follow each other, the information is written in “chunks” of data, one disk at a time (see Table 1).

In Table 1, the numbers represent the order in which the chunks are written on the disks (in this example, it’s left parity, asynchronous). There is a parity chunk per every n – 1 chunks of data. That is done for redundancy.

It works like this: parity is calculated by XORing the n – 1 chunks of data in a row. This logical operator has a very interesting property for redundancy. If you remove one of the data chunks and use the parity chunk instead for the XOR operation, you will get the missing chunk of data:

a xor b xor c xor d xor e = p

If you remove c, then:

a xor b xor d xor e xor p = c

What does this mean for the RAID? It means that if you remove a whole disk from the array, the RAID can still work... though with a little overhead to calculate the missing chunks. Furthermore, if you replace a missing disk with a new one, the data that was in the removed disk can be rewritten to the new disk. There will be no loss of data (provided that no more than a single disk is missing at any given moment).

The process of making a RAID image wasn’t complicated. I started the Java class by telling it the conditions of the run: algorithm, images, order of the disks, chunk size, skipped chunks (remember there were 64KB of garbage at the beginning of every image), and output file.

I started thinking about rebuilding the data the “hard” way: making a single image of the RAID from the separate images

Monday, May 9 2005

I made some attempts at rebuilding the RAID content. Each try took roughly two or three hours. After a run, I had a RAID.dat file (about 72GB in size) that was the “supposed” image of a HD, just like doing:

# dd if=/dev/hda of=ata.dat

Please notice the lack** **of partition number in the input file (a raw HD block device).

Then I had to use that image as a hard drive. First, I had to use fdisk to know the “partitioning” of the hard drive (it had no problem handling the file at all). At that point, just as I had thought, I discovered that the file was the image of a HD and I could see a partition starting from sector 63. I was more than happy! There were no complaints from fdisk this time. Unfortunately, I can’t give you console output from now on, because the files have already been erased. Instead, I’ll show the commands that were involved:

# fdisk -lu RAID.dat

Then mounting. How could I make the kernel think that this file was a hard drive? Well... it took me some more research to learn that losetup is used to link loop devices to files. It felt like the solution was at hand! I had to link the file to a loop device starting from byte 32256 (I had to skip the first 63 sectors 512 bytes each, according to fdisk):

#losetup -o 32256 /dev/loop0 RAID.dat

It linked, no problem! Then mounted:

#mount -t ntfs /dev/loop0 /mnt/tmp

There was no complaint when mounting. All of the pieces were fitting together after all.

I just forgot to take into consideration one very important factor in the IT world: Murphy’s Law. The RAID was not going to give itself away so easily after all.

When I ran ls, in the mount point, I could see a few of the directories, but the information wasn’t usable. I couldn’t cd to those directories and dmesg said there were problems with the NTFS indexes. I guessed I must have made a mistake ordering the disks... or used the wrong algorithm. I tried twice (with different options), but failed.

Tuesday, May 10 2005

I had left another attempt working when I left the office. That one failed too. I was getting frustrated at the time. Three of the developers at the IT department offered their help and started analysing the whole thing with me. I made another class that rebuilt the missing image, which I felt would help us in the analysis—no matter what the algorithm, order or strip size, according to RAID theory, the missing image’s content would always be the same.

We noticed that I had indeed made a mistake when ordering the disks! (Hey, I can’t always be right, can I?) We studied the images a little further to make sure, and started the whole thing again. It was already getting late, so we had to wait until the next morning to see the results.

I just forgot to take into consideration one very important factor in the IT world: Murphy’s Law

Wednesday, May 11 2005

First thing in the morning (and I didn’t sleep very well because of the wait), I did a ls and...

Eureka! It worked.

All of the directories were there (otherwise, I wouldn’t have written this article in the first place, right?). I tried to work with some of the files in the partition... and it was perfect.

I suddenly became the spoiled kid of the IT department! I got a big chocolate cake—that’s what I call a bargain!

Even better, the experience caused some of the guys from the IT department to install GNU/Linux on their own personal computers. That’s quite an achievement!

Conclusion

I want to finish saying that I did nothing miraculous... but definitely clever! I certainly used the resources I had at hand... plus Knoppix. I also got a lot of help from the GNU/Linux community (through www.linuxquestions.org mostly). Thank you people!

It’s very important that you make sure backups be made on a regular basis to avoid this kind of situation. I don’t think it’s likely you will find yourself in the same situation we got ourselves into. But, if you do find yourself in the same boat, I hope this information allows you to not lose the data the Microsoft way (just format the disk, and forget about your data). Don’t freak out, get a Knoppix CD (if you can get a GNU/Linux guru along with it, all the better!), and with a little programming you will most likely solve the problem.

Thanks

I’d like to thank Simon Carreno, Heberto Ramos and Javier Machado, for their help in analysing the way the images of the RAID had to be put together. I’d also like to thank the IT crew as a whole for their support.

Category: 
License: 

Comments

admin's picture
Submitted by admin on

From: MAd MAco
Url:
Date: 2005-08-10
Subject: Don't fight against paleo-toys, buy a new one!

Ok.. you are fighting against this Baby-Dino-Disk... but have you ever mind about buy a cheap laptop with a 200GB HDD??? .. .. all these toys at your server have less info than my Digi-cam!!!

...please, don't be selfish... buy a real one!

From: melissa
Url:
Date: 2005-08-10
Subject: :)

although i'd say 89% of this article made no sense to me.....je je je.....i must say....nice work, sir ;)

From: Tormak
Url: lineak.sourceforge.net
Date: 2005-08-26
Subject: Device order?

How did you determine the order that the disks were in?

From: Redbox
Url: www.pv.com.pl
Date: 2005-08-27
Subject: I must say...

GREAT JOB!!!

From: Moxy
Url:
Date: 2005-08-27
Subject: More info required.

How about publishing your Java class to see how you analysed and rebuilt the aid image?

From: rich gregory
Url:
Date: 2005-08-28
Subject: disk based backup

Here is a simple way to add a disk based incremental backup

to a production file system.

http://www.people.virginia.edu/~rtg2t/samba/system.admin.html#backup

It MUST be integrated with a full backup system to tape or disk.

It is a simple way to use an older PC and a 80-120GB IDE drive to give sys admins of big raid systems some pice of

mind.

cheers,

rich

From: Adrin
Url:
Date: 2005-10-03
Subject: RAID 5 failer, Maybe you should look at RAID 1

While that is great you got the data back I have some questions. The Raid died on April 29, and you didn't get it back until May 15. I hope the practice wasn't down that long.

I hope you where able to restore form backup and get them going in the mean time.

Perhaps you should think about the raid setup. While raid 5 is great for makeing one large disk with smaller disks. You are screwed when there is a major failer. Perhaps you should think about a mirrored raid. Yeah you loos a lot of space, but one desaster and it is paid for.

You brought up NTFS partition in you article. Another reason for me to hate windows crash recovery. I have yet to see a good disk recovery. Like the one I use in Unix but this is not the place to spam it and it is not free.

Anonymous visitor's picture
Submitted by Anonymous visitor (not verified) on

lol 18 gig drives, those would have to be old..I'd go ahead and replace the server or at the very _minimum_ implement a backup system of some kind. although with that old of server, you could be looking at possible other hardware failing in the futute, so you'd want to make sure to backup the data to a medium that another machine definitely will read. such as, don't trust old HP dat tape drives!!

Edmundo Carmona's picture

I see there were some comments posted.... I think I replied to some in due time, but I'm not so sure... my memory sucks.

I can see there were some people talking about getting new hardware. Of course, that's one answer to the problem.... but when you are working on a public institution (in Venezuela, should I add) things are not that simple here (are they anywhere?). We have to work on a very very very tight budget and crack our heads to get the most out of everything we have at hands (even dyno-hardware).

No, the practice wasn't affected for that long. The main system of the hospital (though it was affected) was restored in basically no time. The data that was affected was "Users' Docs". Users had local copies of documents... but perhaps not all documents... and certainly not old documents. They carried on with what they had at hand at the time and waited for us to (hopefully) get the whole think back.

Coming to a more technical side of the article: how did I (we) determine the order of the disks? That took a little binary math to achieve. When we studied the images we noticed that there were markers that could be read in fixed positions in different chunks of the image of the disks (things like the string "FILE*" that were part of the structure of the NTFS partition). There were 5 disks in the array. For every chunk "row", 4 of them would have data and one disk would have the XOR checksum of the other 4 chunks (remember that checksums are spread among all the disks). If you calculate the XOR of 4 times the same value (like FILE*) you will get a beautiful 0 (for each byte), and so we were able to see were the data chunks and checksum chunks were (luckily there were 5 disks. Had they been an even number of disks, it would have taken a little more trickery because the XOR of an odd number of times of "FILE*" would be "FILE*"). We already knew where the partition table of the disk was (so the first data chuck) and could see where that image had its first checksum chunk, together with the position of the other checksums in the other disks, you can know which algorithm was applied and the order of the disks. I'm wondering if that was clear enough. :-? If you have questions, ask Tony for my email address. I'm sure he will kindly provide it. ;-)

The source code? It's there, right? It's a Free Software Magazine after all, isn't it? ;-)

Cheers!

mcontestabile's picture

Hello everyone...
My name is Marco, and I had the same experience...
2 weeks ago my server go down...all disks whit red led on...

The first thing I thought "i've lost all data...all db..all users folders..."...the last backup was a few months ago!!

i tried to change the controller, but nothing to do...the controller don't recover the raid configuration from the disk.

Searching in internet i found this guide...and reading it i've seen that what Edmundo write is the samne thing that happen in my office...i think "ok...i want to try..."...
My raid is a single raid5 array with 8 disks...in the array 2 logical drives.

After three days of work...and with the valuable help of Edmundo (How much patience has this guy? :)))) )...i've recovered all data from the disks...recovere all db and users folders...in few words...all i need to restore the office software.

I want to again thank Edmundo for his help... :))

p.s. the first thing that i've installed in my office after the recovery of the data...is a very good system backup :))

sorry for my english...i hope that who read this can understand :))))

Marco

yoavsil's picture
Submitted by yoavsil on

Hi Edmundo,

apparently the company Raid-Recovery-Online.com can recover all RAID arrays remotely in no time (providing 24/7 services)... So next time you need to Save the Republic , just contact them without all that fuss... and it's not even that expensive.

Cheers,
Yoav

Edmundo Carmona's picture

Hello, everybody.

With the desire to get rid of using java to rebuild the images I have decided to translate (and correct... I think I found a couple of problems) the library to python.

The library is here:
https://code.launchpad.net/~eantoranz/+junk/raidpycovery

I've already written a couple of articles on my blog about it:
http://maratux.blogspot.com/2010/11/broken-raid5-you-said-dont-use-java.html
http://maratux.blogspot.com/2010/11/testing-raidpycovery-through-mdadm.html

There you go!

ShaneW's picture
Submitted by ShaneW on

Hey man nice article. I was looking for abit of advice, if possible.

I have a 3 disk hardware raid 5 that has collapsed. The backup was incomplete :) so I have been asked to recover it. With abit of messing around with the hardware I have 2 full disk images and 1 half disk image. What I need advice with is determining the chuck size and algorithm. I had a go rebuilding the 3rd drive myself with left-sync and 512k chunk size then compared what I got with the partial image. I found 50k-ish chunks of identical data in identical locations but the rest was junk. Then I saw this article, what should I be looking for to determine chunk size? what should I be looking for to determine the algorithm?

Would it be worth my time to create the first say 200M of the third disk with your script (using a variety of variables) and compare to the other disk I have?

I am of course working in knoppix.

Cheers
Shane

Edmundo Carmona's picture

Hi!

First, I had said that it was translated to python and life was beautiful... but on python the recovery process was horrendously slow. Last night I migrated it to C++ and let me tell you that it's MUCH faster!

http://maratux.blogspot.com/2011/11/remember-times-with-i-used-python-for.html

Cheers!

Author information

Edmundo Carmona's picture

Biography

Edmundo is a Venezuelan Computer Engineer. He is working as a Freelance Java Developer in Colombia since very recently. He has also been a GNU/Linux user and consultant for several years.

After years of being retired from music, he's working right now to regain his classical flute skills.