Re: Reproducible filesystem corruption in lenny



On 2009-07-28 15:42 +0200, Josh Kelley wrote:

We use Debian for some embedded devices that use off-the-shelf flash
drives for their primary storage. Since upgrading from etch to lenny
and tweaking our partition layout, we've started seeing filesystem
corruption occur very rapidly after we clone the filesystem (via
partimage and resize2fs). While investigating, I've been able to
reproduce the corruption with both etch's and lenny's partimage, with
both etch's and lenny's e2fsprogs, with both the realtime-patched
kernel we used under etch and lenny's stock amd64 kernel, with flash
drives of different sizes, with different flash drive partition
layouts, and with one of our embedded devices, an off-the-shelf lenny
server, and an off-the-shelf etch server. This doesn't make any sense
to me.

While trying to figure all of this out, I've found that I can
reproduce filesystem corruption 100% of the time simply by executing
these commands:

mke2fs -O has_journal,resize_inode,dir_index,filetype,sparse_super,large_file
/dev/sdb2
tune2fs -c 29 /dev/sdb2 # /dev/sdb is an external flash drive
mount /dev/sdb2 /mnt/image
cd /mnt/image
tar xf ~/data.tar # data.tar is a 71MB archive of the /var partition
cd
umount /mnt/image
e2fsck -f /dev/sdb2

At this point, e2fsck starts complaining with errors like this:
Symlink /lib/python-support/python2.5/_dbus_glib_bindings.so (inode
#113416) is invalid.
Clear<y>?

Turning off has_journal or adding -o data=journal fixes the
immediately preceding problem. (I haven't tested it for our cloning
procedure.) However, I don't want to go back to ext2, and
data=journal seems to be barely documented. (What exactly does it
do?)

Quoting mount.8:

All data is committed into the journal prior to being
written into the main file system.

In other words, your data are written to disk twice.

We've seen other errors after cloning (subdirectories that point to
their parents, "resize inode not valid", etc.), but these particular
errors are completely reproducible. The corruption occurs on more
than one flash drive. badblocks -w /dev/sdb reports no errors
(although I seem to remember one of disks being bigger running
badblocks - do flash drives remap bad sectors?).

I think so.

I can't imagine that Linux or Debian would be released with this sort
of potentially severe reproducible bug but am at a loss to figure out
what I might be doing wrong or what's specific to my setup. And I
can't figure out why we're only seeing it since upgrading to lenny
when I can currently reproduce the problem under etch.

Any help would be greatly appreciated. Thanks.

I would suggest testing the flash drives with different filesystems
under different operating systems. Fill it up completely, re-plug the
device, read the data back and compare to the original.

There had been cases of USB memory sticks with manipulated controllers
produced by fraudulent manufacturers. These sticks reported a higher
capacity than they really had. They never reported read or write
errors, but once you filled more than half of the reported capacity, all
writes would go to the same sectors, producing massive data and
filesystem corruption. I had bought such a scam product myself, and it
cost me many hours of grief.

Sven


--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx
with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx



Relevant Pages

  • Re: [patch] ext2/3: document conditions when reliable operation is possible
    ... the data on the filesystem has not been horribly mangled. ... further writes to the disk can trash unrelated existing data because it's ... disk can trash unrelated existing data _anyway_, because the flash block ... Today we have cheap plentiful USB keys that act like hard drives, ...
    (Linux-Kernel)
  • Re: linux hard drive failed, clicking on bootup
    ... E) hard drives do and will fail; this can mostly only be predicted ... one's hardware, if the hardware is broken/defective (though it can ... * try a full read test of the filesystem, partition, volume, or full ... fsck/e2fsck - but with the -n option, so no writing is done to the ...
    (comp.os.linux.setup)
  • Re: Looking for a Text on ZFS
    ... Even a 64bit filesystem still has gigantic reserves of space and ... well the drives can still live on their own if they are ever seperated. ... Since I use HDs on my computers, I have had about 20 to 25 ...
    (freebsd-questions)
  • [semi-OT] Data archiving (was Re: Query on adding a USB hdd)
    ... encrypt filesystem for archives. ... Tape (using tar, and a media used by "large data processing shops", ... whiz specialized crap that NASA seems to love) or SCSI hard drives ... How would you get the source off if the filesystem is not ...
    (Debian-User)
  • Recommendations for servers running SATA drives
    ... I'm forking the thread on fsck/soft-updates in hopes of getting some practical advice based on the discussion here of background fsck, softupdates and write-caching on SATA drives. ... that fact ignored, then the filesystem is either 1) worthless, or 2) ...
    (freebsd-stable)