Oh well, at least it's

different
Everything here is my opinion. I do not speak for your employer.
September 2008
October 2008

2008-09-08 »

SATA vs. SCSI reliability

Here's a guy who discusses SATA vs. SCSI disk reliability. Short conclusion: actual disk failures (MTBF) are almost exactly as likely wtih cheap SATA disks as expensive SCSI disks. But the bit error rate of SATA is much higher. In other words, the likelihood of not being able to read a sector because it got corrupted on SATA is vastly higher than SCSI. By his calculation, on a 1TB disk, you have about a 56% chance of not being able to read every single sector, which means rebuilding your RAID correctly in the case of a failed disk is usually impossible.

It's true, and I learned this the hard way myself. Back at NITI, we ran into exactly this problem. Back in the days when we introduced our software RAID support, typical disk sizes were around 20 GB, about 50x smaller than they are now. (Wow!) The bit error rates now are about the same as they were then, which means, assuming the failure percentage declines linearly(1), about a 1.1% failure rate in recovering a RAID.

In general, that 1.1% failure rate isn't so bad. Remember, it's 1.1% on top of the rather low chance that your RAID failed in the first place, and even then it doesn't result in total data loss - just some inconvenience and the loss of a sector here or there. Anyway, the failure rate was small enough that nobody knew about it, including us. So when we had about a 1.1% rate of weird tech support cases involving RAID problems, we looked into it, but blamed it on bad luck with hard drive vendors.

By the time disks were 200GB and failure rates were more like 10%, we were having some long chats with those hard drive vendors. Um, guys? Your disks. They're dropping dead at a pretty ridiculous pace, here.

You see, we were still proceeding under the assumption that IDE disk are either entirely good, or they're bad. That is, if you get a bad sector on an IDE disk, it's supposed to be the beginning of the end. That's because modern disks have a spare sector remapping feature that's supposed to automatically (and silently) stop using sectors when the disk finds that they're bad. The problem, though, is it has to discover this at write time, not at read time. If you're writing a sector, you can just read it back, make sure it was written correctly, and if not, write it to your spare sector area. But if you read it back and it fails the checksum - what then?

This is the "bit error rate" problem. It's not nice to think about, but modern disks just plain lose data over time. You can write it one day, and read-verify it right afterwards without a problem, and then the data can be missing again tomorrow. Ugh.

And the frequency - per bit - with which this happens is the same as ever. With SCSI it's less than with SATA, but as we have more bits per disk, the frequency per disk is getting ridiculous. A 56% chance that you can't read all the data from a particular disk now.

There are two reasons you probably haven't heard about this problem. First, you probably don't run a RAID. Let's face it, if your home disk has a terabyte of stuff on it, you just probably aren't accessing all that data. Most of the files on your disk, you will probably never access again. Face it! It's true. If you filled up a 1TB disk, you probably filled it with a bunch of movies, and most of those movies suck and you will never watch them again. Stasticially speaking, the part of your disk that loses data is probably in the movies that suck, not the movies that are good, simply because the vast majority of movies suck.

But if you're using a RAID, you occasionally need to read the entire disk, so the system will find those bad sectors, even in files you don't care about. Maybe the system will be smart enough not to report those bad sectors to you, but it'll find them.

Secondly, even when you lose a sector here and there, you usually don't even care. Movies, again: MPEG streams are designed to recover from occasional data loss, because they're designed to be delivered over much less reliable streams than hard disks. What happens if you get a corrupt blob of data in your movie? A little sprinkle of digital junk on your screen. And within a second or so, the MPEG decoder hits another keyframe and the junk is gone. Whatever, just another decoder glitch, right? Maybe. Maybe not. But you don't really care either way.

The Solution

At NITI, we eventually settled on a clever solution to this that won't lose data on RAIDs. Of course we can't protect data on a non-RAID disk in any direct sense, but we strongly recommended for our customers to do frequent incremental backups instead.

But on a RAID, the problem is actually easier: simply catch the problem before a disk fails. In the background, we would be constantly, but slowly, reading through the contents of all your disks, averaging about one pass per week. If your RAID is still intact but we find a bad sector, no data has been lost yet: the other disks in the RAID can still be used to reconstruct it. So that's exactly what we would do! Reconstruct the bad sector, and write it back to the failing disk which could then automatically use its sector remapping code to make the bad sector disappear forever.

The read-reconstruction part was never open sourced, so if you want that, you'd have to write it yourself. Luckily, it was easy, and now that we have ionice you don't have to be nearly as careful to do it slowly in the background.

The other part was to make sure Linux's software RAID could recover in case it ran into individual bad sectors. You see, they made the same bad assumption that we did: if you get a bad sector, the disk is bad, so drop it out of the RAID right away and use the remaining good disks. The problem is that nowadays, every disk in the RAID is likely to have sector errors, so it will be impossible to rebuild the RAID under that assumption. Not only that, but throwing the disk out of the RAID is the worst thing you can do, because it prevents you from recovering the bad sectors on the other disks!

A co-worker of mine at the time, Peter Zion(2), modified the Linux 2.4 RAID code to do something much smarter: it would keep a list of bad sectors it found on each disk, and simply choose to read that sector from the set of other disks whenever you tried to read it. Of course it would then report the problem back to userspace through a side channel, where we could report a warning about your disks being potentially bad, and accelerate the background auto-recovery process.

Sadly, while the code for this must be GPL as it modified the Linux kernel, the old svn repository at svn.nit.ca seems to have disappeared. I imagine it's an accident, albeint a legally ambiguous one. But I can't point you to it right now.

I also don't know if the latest Linux RAID code is any smarter out of the box than it used to be. As we learned, it used to be pretty darn dumb. But I don't have to know; I don't work there anymore. Still, please feel free to let me know if you've got some interesting information about it.

Footnote

(1) Of course the failure rate is not exactly a linear function of the disk size, for simple reasons of probability. The probability of a 1TB disk having no errors (0.44, apparently) is actually the same as the probability that all of a set of 50x 20GB disks has no errors. The probability of no failures on any one disk is thus the 50th root of that, or 98.4%. In other words, the probability of failure was more like 1.6% back in the day, not 1.1%.

(2) Peter is now a member of The Navarra Group, a software contracting group which may be able to solve your Linux kernel problems too. (Full disclosure: I'm an advisor and board member at Navarra.)

Update 2008/10/08: Andras Korn wrote in to tell me that the calculations are a bit wrong... by about an order of magnitude, if you assume a 1TB RAID. Though many modern RAIDs will be bigger than that, since the individual disks are around 1TB. Do with that information what you will :)

Update 2008/10/30: I should also note that my calculations are wrong because I misread what the Permabit guy calculated in the first place, and didn't bother to redo the math myself. His calculations are not off by an order of magnitude, although there is some disagreement about whether the correct number is 0.44 or 0.56.

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com