Re: [patch] ext2/3: document conditions when reliable operation is possible



On Sun, 30 Aug 2009, Pavel Machek wrote:

I thought the reason for that was that if your metadata is horked, further
writes to the disk can trash unrelated existing data because it's lost track
of what's allocated and what isn't. So back when the assumption was "what's
written stays written", then keeping the metadata sane was still darn
important to prevent normal operation from overwriting unrelated existing
data.

Then Pavel notified us of a situation where interrupted writes to the disk can
trash unrelated existing data _anyway_, because the flash block size on the 16
gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
it's 4k or smaller. It seems like what _broke_ was the assumption that the
filesystem block size >= the disk block size, and nobody noticed for a while.
(Except the people making jffs2 and friends, anyway.)

Today we have cheap plentiful USB keys that act like hard drives, except that
their write block size isn't remotely the same as hard drives', but they
pretend it is, and then the block wear levelling algorithms fuzz things
further. (Gee, a drive controller lying about drive geometry, the scsi crowd
should feel right at home.)

actually, you don't know if your USB key works that way or not. Pavel has
ssome that do, that doesn't mean that all flash drives do

when you do a write to a flash drive you have to do the following items

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. write the updated data to the flash

5. update the flash trnslation layer to point reads at the new location
instead of the old location.


That would need two erases per single sector writen, no? Erase is in
milisecond range, so the performance would be just way too bad :-(.

no, it only needs one erase

if you don't have a pool of pre-erased blocks, then you need to do an erase of the new block you are allocating (before step 4)

if you do have a pool of pre-erased blocks, then you don't have to do any erase of the data blocks until after step 5 and you do the erase when you add the old data block to the pool of pre-erased blocks later.

in either case the requirements of wear leveling require that the flash translation layer update it's records to show that an additional write took place.

what appears to be happening on some cheap devices is that they do the following instead

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. erase the old eraseblock

5. write the updated data to the flash

I don't know where in (or after) this process theyupdate the wear-levling/flash translation layer info.

with this algortihm, if the device looses power between step 4 and step 5 you loose all the data on the eraseblock.

with deferred erasing of blocks, the safer algortihm is actually the faster one (up until you run out of your pool of available eraseblocks, at which time it slows down to the same speed as the unreliable one.

most flash drives are fairly slow to write to in any case.

even the Intel X25M drives are in the same ballpark as rotating media for writes. as far as I know only the X25E SSD drives are faster to write to than rotating media, and most of them are _far_ slower.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [patch] ext2/3: document conditions when reliable operation is possible
    ... Then Pavel notified us of a situation where interrupted writes to the disk can ... gig flash key I bought retail at Fry's is 2 megabytes, ... their write block size isn't remotely the same as hard drives', ... allocate an empty eraseblock to put the data on ...
    (Linux-Kernel)
  • Re: breaking passwords on discs containing Word docs
    ... from flash media. ... Word MVP web site http://word.mvps.org ... The OP said nothing about flash drives. ... Flash memory can get corrupted in exactly the same manner than HDD ...
    (microsoft.public.word.docmanagement)
  • Re: breaking passwords on discs containing Word docs
    ... OP said nothing about flash drives. ... corruption with flash memory than HDD memory, but in a NG, you will always ... My web site www.gmayor.com ...
    (microsoft.public.word.docmanagement)
  • Re: PROBLEM: KB->KiB, MB -> MiB, ... (IEC 60027-2)
    ... David> The way RAM and flash are measured is correct. ... RAM and flash *drives* are measured differently. ... This "16Mb" drive doesn't really have 16 megabytes of capacity - ...
    (Linux-Kernel)
  • Re: Reiser4 status: benchmarked vs. V3 (and ext3)
    ... We've already got about an order of magnitude improvement in mount time ... just use pointers directly into flash instead. ... we would otherwise have to glean by scanning the whole of the eraseblock ... it picks a 'victim' eraseblock, generally one of the 'dirtiest' (i.e. ...
    (Linux-Kernel)