Re: libata in 2.4.24?

From: Greg Stark (gsstark_at_mit.edu)
Date: 12/02/03

  • Next message: Stephan von Krawczynski: "Re: Linux 2.4 future"
    To: Jeff Garzik <jgarzik@pobox.com>
    Date:	02 Dec 2003 15:10:19 -0500
    
    

    Jeff Garzik <jgarzik@pobox.com> writes:

    > So, today, no acknowledgement occurs until the data _really_ is in the
    > drive's buffers.

    The drive's buffers isn't good enough. If power is lost the write will be lost
    and the database corrupt. It needs to be on the platters.

    > > This doesn't happen with SCSI disks where multiple requests can be pending so
    > > there's no urgency to reporting a false success. The request doesn't complete
    > > until the write hits disk. As a result SCSI disks are reliable for database
    > > operation and IDE disks aren't unless write caching is disabled.
    >
    > This is not really true.
    >
    > Regardless of TCQ, if the OS driver has not issued a FLUSH CACHE (IDE)
    > or SYNCHRONIZE CACHE (SCSI), then the data is not guaranteed to be on
    > the disk media. Plain and simple.

    That doesn't agree with people's experience. People seem to find that SCSI
    drives never cache writes. This sort of makes sense since there's just not
    much reason to report a write success before the write can be performed.
    There's no performance advantage as long as more requests can be queued up.

    > If fsync(2) returns without a flush-cache, then your data is not
    > guaranteed to be on the disk. And as you noted, flush-cache destroys
    > performance.

    It's my understanding that it doesn't. There was some discussion in the past
    month about making the drivers issue syncs for journalled filesystems, but
    even then the idea of adding it to fsync or O_SYNC files wasn't the
    motivation.

    > There are three levels:
    >
    > a) Data is successfully transferred to the controller/drive queue (TCQ).
    > b) Data is successfully transferred to the drive's internal buffers.
    > c) The drive successfully transfers data to the media.

    Only the third is of interest to Postgres or other databases. In fact, I
    suspect only the third is of interest to other systems that are supposed to be
    reliable like MTAs etc. I think Wietse and others would be shocked if they
    were told fsync wasn't guaranteed to have waited until the writes had actually
    hit the media.

    -- 
    greg
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Stephan von Krawczynski: "Re: Linux 2.4 future"

    Relevant Pages

    • [PATCH 2/2] ext3: add data=guarded mode
      ... a workqueue where the real work of updating the on disk i_size is done. ... end_io handler for buffers that are marked as guarded. ... When we start tracking guarded buffers on a given inode, ... and it also takes a reference on the buffer head. ...
      (Linux-Kernel)
    • Re: High Avg. Disk Queue Length When Opening Shared Calendars
      ... Let's assume 10K spindles, a 3:1 read/write ratio, and IOPS/user of 1. ... Now let's look at the logs. ... the database LUN would become a bottleneck prior to the log LUN. ... A disk bottleneck on the database LUN. ...
      (microsoft.public.exchange.admin)
    • Re: Oracle Performance -- Possible Disk Bottleneck
      ... I've read that 15k drives can perform 180 IO's per second. ... confirm this is a disk bottle neck. ... so I'm hoping someone here has experience with SANs and ORACLE to help ... this is a disk bottleneck and that giving the database more spindles ...
      (comp.databases.oracle.server)
    • Re: [RFC] fsblock
      ... The other reasons are that supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a pain if it is not done this way when the write targets a hole as that requires all pages in the hole to be locked simultaneously which would mean dropping the page lock to acquire the others that are of lower page index and to then re-take the page lock which is horrible - much better to lock all at once from the outset and the other reason is that in NTFS there is such a thing as the initialized size of an attribute which basically states "anything past this byte offset must be returned as 0 on read, i.e. it does not have to be read from disk at all, and on write beyond the initialized_size you have to zero on disk everything between the old initialized size and the start of the write before you begin writing and certainly before you update the initalized_size otherwise a concurrent read would see random old data from the disk. ... commit_write the copied pages by dirtying their buffers ...
      (Linux-Kernel)
    • Re: Chaotic IMAP Message list
      ... bits of the database which are scattered about on your hard disk into a new ... break at an unfortunate location in the database. ... I was busy and didnšt read carefully so I thought you were advocating the more complete rebuild solution. ... or can you offer a thumbnail explanation of what happened and why a compact would fix it? ...
      (microsoft.public.mac.office.entourage)