2008-08-01 »
ext2resize sucks, but apparently so do I, and also a lot of other things too
This has not been a happy week for my desktop workstation. It started off good: two brand new SATA 750 GB disks to replace my old 80 GB one. What a difference! Plus, this was finally the inspiration I needed to get organized and install Linux as my base system and Windows XP in VMware, instead of the other way around.
So the first thing I did was take out my 80 GB disk and put it aside for later, in order to avoid screwing up, which is what people tend to do during such activities. I booted a Debian Etch CD and proceeded to set up my two disks in a RAID-1 mirror, just like I had planned.
This is where I stop to complain about Debian Etch's installer. What the heck were you people thinking? Steal Ubuntu's, already! And the default mode is to install with LVM but not with a RAID device backing it? And in order to just set up a RAID on two identical disks, you have to go through like 25 steps in your awful installer UI? This is really, really disgusting.
pphaneuf's rule for flaming is that you should shut up and not complain unless you at least know a better way to do it. And better still, you should have done it better. Well, I have. Nitix's disk configuration stuff is fully automatic and pretty much foolproof. Sometimes it can be a little tricky to wipe out a disk that already has stuff on it. But that was on purpose. Other than that, it's just plain easy, and your disk gets configured with RAID and LVM layers included even if there's only one disk, because that makes things a heck of a lot more consistent, let alone making it really easy to add a RAID later.
Anyway, I got through the painful disk installer UI after a few tries (you have to reboot if you get it wrong! Genius!), and installed my system, and rebooted, and all seemed okay, for a while.
That's when I installed Firefox 3 and my system suddenly became prone to huge disk grinding delays. Every time a program started writing to the disk, Firefox would freeze, sometimes for many seconds at once. And Firefox 3, you see, also writes to disk, so this was no rare occurrence.
You see, Debian had installed my new ext3 filesystem using the kernel's default data=ordered option. data=ordered is one of those things that I'm shocked was allowed into the kernel at all, let alone made the default. Basically, it means that all relevant data will always be flushed to disk before its journal (metadata) entries can be flushed, so your files will never contain data that was just leftover if your computer crashes between metadata and data updates. Sounds great, right? Well yes, until you think about how you actually have to implement that. The journal is always flushed sequentially, so if someone calls fsync() on a file, we try to flush its data first, then all the metadata changes that were made on disk up to and including this file's metadata. But that metadata leading up to this change, of course, can't be flushed until we include all the data related to all that metadata. This is a long story, but essentially, it makes fsync() of a 4k file turn into essentially a sync() of your entire disk. Result? Firefox 3, which fsync()s like 300 times a minute for no good reason, grinds to a halt. And your system performance is craptastic because basically your disk's write cache is gone.
After reading the mostly retarded discussion in the Bugzilla case, I gave up on the Firefox guys actually fixing the problem anytime soon. fsync() slightly less often?? That's a fix?? What part of "every time I fsync() it syncs ALL outstanding transactions immediately to disk" do you not understand?
So, I made a shared library called libnofsync.so that just makes fsync() do nothing, and used LD_PRELOAD to load that into firefox. By bookmarks are not a bloody ACID database! Nobody cares if they get corrupted! I only ever visit like three sites, and two of them are GMail! Get over it. With this hack, I suppose my Linux Firefox 3 is probably the fastest one around, because data=ordered or not, everybody else is doing multiple synchronous disk writes every time they try to load a page. Good grief.
The next task is to get rid of the completely obnoxious data=ordered setting, which involved messing in my grub configuration file(s). Grub, if you haven't heard me rant about it before, is a total complete pukefest. It does absolutely nothing of value that lilo doesn't do, but it does it in a way that's about 1000x more complicated. Then Debian layers another pile of crud on top. The long and the short of this is that I COULD NOT FIND A WAY to add "rootflags=data=writeback" to my kernel command line in any sort of permanent fashion. Now, you can edit the kernel command line during boot, and there's an option right there that says "savedefaults" that sounds promising but does absolutely nothing, but (at least) one of the 1000 layers of garbage shoveled on top of grub overwrites and/or ignores my config file changes no matter how I try to make them.
So I uninstalled grub and installed lilo, after which things were trivially easy because lilo was not written by morons.
Around this time I tried out a 2.6.25 kernel from backports.org, which happily crashed my system repeatedly. What the heck was I thinking? The kernel hasn't actually been missing any significant features for something like five years. I don't know why I upgraded it, but I sure won't make that mistake again. Anyway, the crashes served mostly to teach me that if you crash your system while the RAID is rebuilding, it might not come back all by itself. Instead, it silently kicks the second disk from the RAID and leaves it idle, without actually telling anyone about this INCREDIBLY SERIOUS RELIABILITY LOSS. But that's par for the course by now. I added it back into my array and off I went.
Fast forward to a few days later; it took some suffering before I bothered to fix the firefox nonsense and got sick enough of the random kernel crashes to downgrade my kernel back to Debian's stable version.(1)
That was about when I noticed that my new, fancy-pants RAID was rebuilding at 80 MB/sec, which is fabulously quick, and yet had still not finished rebuilding, after being left uninterrupted for more than a day. What? Well, let's "watch cat /proc/mdstat". Hey, check it out! It's almost done! 98%... 99%... 100%... 100%... 100%... 0%? Hey! What the heck is this! Okay, look at dmesg. Aha, it had a bad sector right at the very end of the disk!
And so it decided to start rebuilding the RAID from scratch!
About once an hour since I installed it at the beginning of the week!(2)
HA HA HA HA!
Now, that's not actually something that would solve the problem even if I had bad sectors. But of course, I don't really have any bad sectors. What I do have is some partitions that apparently Debian's installer screwed up while creating, so they happily run right past the end of the disk.
Now okay, Debian's partitioning thingy has an excuse for being buggy; probably nobody ever uses it to make a RAID, because it's sure the heck not easy to do. But here's the thing: mdadm let me create a RAID on a partition with inaccessible sectors. Then mke2fs let me create a filesystem on that broken RAID. Did it not occur to anyone to sample a few of the sectors before you decided to actually use them? Didn't any of you ever wonder why Windows does that weird thing about "testing sector accessibility" whenever you make a partition? No, apparently not. Gargle.
Okay, so obviously I need to make my partition a little smaller. Apparently it's possible to resize ext3 partitions and RAID devices now. That's good news, right? Well sure! Let's try that!
So I switched down to single user mode, remounted my rootfs read only, and ran "ext2resize /dev/md0". It told me a magic number, which is the number of sectors it's currently using. I reduced that number by an overly large factor (hey, I've got whole gigabytes of data to waste here!) and ran "ext2resize -v /dev/md0 NUMBER". It grinded away for a while, giving me impressive yet scary messages about how it was moving inodes around, and so on. I figured it couldn't really do anything too harmful, since obvously the space at the end of the disk was nowhere near any of my actual data.
Boy, was I ever wrong.
I foolishly then ran "ext2resize /dev/md0" again to see if it would print out the new size. Except, it seems, that's not what it does. What it does is try to resize the partition again, this time to the maximum size. The maximum size is, as you recall, a size that involves some nonexistent sectors at the end of my partition.
So it moved a few more inodes around and then errored out. Ironically, ext2resize does apparently access that area at the end of the disk, even if mke2fs doesn't. Sadly however, it moves a bunch of crap around before erroring out and aborting midstream.
You might have imagined, as I might have, once, that ext2resize would maybe do a "hypothetical resize" operation, going all the way through the disk and confirming that everything would work - like the last sector it was about to resize into, for example - before it actually starts moving crap around. Or you might have imagined that it would undo those changes before it aborts due to an unexpected error. But if you thought that, then you, like me, would have been completely wrong. ext2resize does no such thing. Instead, it moves a bit of data around, and then aborts when it gets confused, halfway through the process.
As I learned, this makes your filesystem completely unusable. It turns out that, for no reason I can possibly imagine, moving around the completely empty unused section at the very end of my disk also involves rewriting inode 2 (which turns out to be the root directory), as well as an impressive number of other I-woulda-thought unrelated files and inodes. Of course, when you do this wrong, your filesystem stops mounting.
Time to boot the rescue disk one more time, pray a little, run e2fsck, and pray a little more.
e2fsck was not impressed. It correctly noted that inode 2 was, if I recall, "conflicted." Also lots of other horrible things. I asked it to repair everything. It crashed. Well, of course, it didn't crash exactly. It printed messages about "programmer error??" and then restarted the game all over again. I actually went through the game a few times before I finally caught on to the fact that it was the same every single time.
Luckily, when I bravely answered "no" to all the questions about whether it should fix things, it finally fixed things(3), complained that Oh God Your Filesystem is Still Broken Though, and exited. But whatever, my filesystem finally mounted again. I had to recover all my root-level folders from their new homes in /lost+found, but oh well, at least I still had my data.
And so I rebooted, and my RAID promptly started rebuilding itself again, and my filesystem was still corrupt, and it couldn't actually be fsck'd because e2fsck would still go into an infinite loop if I tried. Back to square one.
And now, this is the part where I flame myself, because I forgot something I already learned a long time ago and then sold to thousands of people:
Why the heck are you using a RAID in the first place, when you only have two disks, idiot? If you only have two disks, just back up your files to the second disk occasionally using rsync or something. That way, when you ext2resize or just delete a file by accident, the other disk doesn't reflect your idiotic mistakes until a while later. Remember?
So there you have it. All those things were dumb, but I was the dumbest of them all.
I dropped the second disk out of my RAID, repartitioned it correctly, ran mke2fs, copied all the files to the second disk, and booted from that. Done.
Actual Nice Things to Say
Lest it appear that I only hate things:
(1) Yes, Debian's stable 2.6.18 kernel actually works, and thus there's actually a reason they don't carry the highest-numbered one. Good job, guys.
(2) The guys who implemented the CFQ disk scheduler are my heroes. Unlike in the 2.4 kernel, where rebuilding your RAID hugely degraded your system performance, in 2.6 this operation happens at "idle" disk priority so it can go at pretty much full speed and yet have zero impact on your disk performance. That's why I didn't even notice for days that the rebuild was going on. Related tool: ionice -c3. Use it for your background compiles and stuff. It's awesome.
(3) You know what? I'm a big fan of e2fsck, even though it's supremely un-user-friendly and obviously had some bugs here. But despite those bugs, it didn't abort when it thought it had a "programmer error," and it saved my data from what I was sure by then was certain death. No program is perfect, but at least this one was written by sane people.
Why would you follow me on twitter? Use RSS.