Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes



Theodore Tso wrote:
On Fri, May 16, 2008 at 11:03:15PM +0100, Jamie Lokier wrote:
The MacOS X folks decided that speed is most important for fsync().
fsync() does not guarantee commit to platter. *But* they added an
fcntl() for applications to request a commit to platter, which SQLite
at least uses. I don't know if MacOS X uses barriers for filesystem
operations.

Out of curiosity, exactly *what* semantics did MacOS X give fsync(),
then? Did it simply start the process of staging writes to disk, but
not wait for the writes to hit the platter before returning? That's
basically the equivalent of ext3's barrier=0.

I haven't read the code and don't use MacOS myself.

From its fcntl() man page:

Note that while fsync() will flush all data from the host to the
drive (i.e. the "permanent storage device"), the drive itself may
not physically write the data to the platters for quite some time
and it may be written in an out-of-order sequence.

Specifically, if the drive loses power or the OS crashes, the
application may find that only some or none of their data was
written. The disk drive may also re-order the data so that later
writes may be present while earlier writes are not.

This is not a theoretical edge case. This scenario is easily
reproduced with real world workloads and drive power failures.

For applications that require tighter guarantess about the
integrity of their data, MacOS X provides the F_FULLFSYNC
fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered
data to permanent storage. Applications such as databases that
require a strict ordering of writes should use F_FULLFSYNC to
ensure their data is written in the order they expect. Please see
fcntl(2) for more detail.

Some notable things:

1. Para 2 says "if the drive loses power __or the OS crashes__".
Does this mean some drives will abandon cached writes when reset
despite retaining power?

2. Para 3 to be re-read by the skeptical.

3. Para 4 perpetuates the confused idea that write ordering is what
it's all about, for things like databases. In fact, sometimes
ordering barriers are all that's needed and flush is unnecessary
performance baggage. But sometimes an fsync() which only
guarantees ordering is insufficient. An "ideal"
database-friendly block layer would offer both.

I doubt if common unix mail transports use F_FULLSYNC on Darwin
instead of fsync(), before reporting a mail received safely, but they
probably should. I recall SQLite does use it (unless I'm confusing
it with some other database).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Problem with a Maxtor Onetouch II 300GB
    ... LaCie external drives ... > causing either vibration of the platter (the actual disk on which your ... > advanced data recovery techniques: ... The read head should be floating on a microscopic ...
    (comp.sys.ibm.pc.hardware.storage)
  • Re: Striping data across platters in a single hard disk
    ... >>>stripe data across the multiple platters in a hard drive to ... >>>the throughput to some extent based upon the number ... >>>the performance of a single platter drive. ... >> multiple platters for huge drives or if you make a change in form factor. ...
    (comp.sys.ibm.pc.hardware.chips)
  • Re: File system performance, hardware performance, ext3, 3ware RAID1, etc.
    ... > it isn't necessarilly an artifact of the drives themselves. ... since read-ahead and track buffers are trivial ways to get to true platter ... speeds for a lot of reasonable loads. ... Can somebody fill me in on what modern disks do, ...
    (Linux-Kernel)
  • Anyone know of affordable magnetic probe drive recovery?
    ... One of my drives took a dump and OnTrack was unable to recover the ... probe microscope, but when I talked to one of the companies that does ... that kind of recovery, their bid was $40,000 to start. ... If I can get a dump of what's on the platter, ...
    (comp.security.misc)
  • Re: libata in 2.4.24?
    ... With fsync I don't ... Moreover, some drives, reportedly IBM, tend to botch a sevtor ... result in a buffer being sent to the drive twice, ... SCSI does and caching the write on the drive and not returning done status ...
    (Linux-Kernel)