Re: Reproducible filesystem corruption in lenny



On 2009-07-28 15:42 +0200, Josh Kelley wrote:

We use Debian for some embedded devices that use off-the-shelf flash
drives for their primary storage. Since upgrading from etch to lenny
and tweaking our partition layout, we've started seeing filesystem
corruption occur very rapidly after we clone the filesystem (via
partimage and resize2fs). While investigating, I've been able to
reproduce the corruption with both etch's and lenny's partimage, with
both etch's and lenny's e2fsprogs, with both the realtime-patched
kernel we used under etch and lenny's stock amd64 kernel, with flash
drives of different sizes, with different flash drive partition
layouts, and with one of our embedded devices, an off-the-shelf lenny
server, and an off-the-shelf etch server. This doesn't make any sense
to me.

While trying to figure all of this out, I've found that I can
reproduce filesystem corruption 100% of the time simply by executing
these commands:

mke2fs -O has_journal,resize_inode,dir_index,filetype,sparse_super,large_file
/dev/sdb2
tune2fs -c 29 /dev/sdb2 # /dev/sdb is an external flash drive
mount /dev/sdb2 /mnt/image
cd /mnt/image
tar xf ~/data.tar # data.tar is a 71MB archive of the /var partition
cd
umount /mnt/image
e2fsck -f /dev/sdb2

At this point, e2fsck starts complaining with errors like this:
Symlink /lib/python-support/python2.5/_dbus_glib_bindings.so (inode
#113416) is invalid.
Clear<y>?

Turning off has_journal or adding -o data=journal fixes the
immediately preceding problem. (I haven't tested it for our cloning
procedure.) However, I don't want to go back to ext2, and
data=journal seems to be barely documented. (What exactly does it
do?)

Quoting mount.8:

All data is committed into the journal prior to being
written into the main file system.

In other words, your data are written to disk twice.

We've seen other errors after cloning (subdirectories that point to
their parents, "resize inode not valid", etc.), but these particular
errors are completely reproducible. The corruption occurs on more
than one flash drive. badblocks -w /dev/sdb reports no errors
(although I seem to remember one of disks being bigger running
badblocks - do flash drives remap bad sectors?).

I think so.

I can't imagine that Linux or Debian would be released with this sort
of potentially severe reproducible bug but am at a loss to figure out
what I might be doing wrong or what's specific to my setup. And I
can't figure out why we're only seeing it since upgrading to lenny
when I can currently reproduce the problem under etch.

Any help would be greatly appreciated. Thanks.

I would suggest testing the flash drives with different filesystems
under different operating systems. Fill it up completely, re-plug the
device, read the data back and compare to the original.

There had been cases of USB memory sticks with manipulated controllers
produced by fraudulent manufacturers. These sticks reported a higher
capacity than they really had. They never reported read or write
errors, but once you filled more than half of the reported capacity, all
writes would go to the same sectors, producing massive data and
filesystem corruption. I had bought such a scam product myself, and it
cost me many hours of grief.

Sven


--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx
with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx