Re: [Question] Does the kernel ignore errors writng to disk?

From: Richard B. Johnson (linux-os_at_analogic.com)
Date: 04/27/05

  • Next message: Randy.Dunlap: "Re: [Fastboot] Re: Kdump Testing"
    Date:	Wed, 27 Apr 2005 15:12:59 -0400 (EDT)
    To: mike.miller@hp.com
    
    

    On Wed, 27 Apr 2005 mike.miller@hp.com wrote:

    > Hello All,
    > I have observed some behavior under certain failure conditions that seems
    > as if the kernel may be ignoring write errors to disk.
    > During very heavy read/write io if we force a disk to fail requests
    > continue to be submitted until the controllers queue is full.
    > Ultimately, the requests are timed out by the controller. When this
    > happens we see filesystem corruption. Sometimes it's the file data,
    > other times it's filesystem metadata that has been timed out and
    > failed. Either way its obviously undesirable behavior.
    > It looks like the OS/filesystem (ext2/3 and reiserfs) does not
    > wait for for a successful completion. Is this assumption correct?
    >
    > Thanks,
    > mikem

    It depends. Obviously if you disconnect your hard drive, the writes
    will fail with a time-out. But they fail after a number of retries
    (it depends upon the type of disk and its driver). So, if you
    "force" a timeout by disconnecting a drive, you don't have
    the same situtation as a normally failed write.

    Disk/file writes go like this (assuming no sync() or fsync()).

    (1) File data gets flushed to a queue.
    (2) When the queue gets nearly full, based upon a LRU mechanism,
          data are written to the disk.
    (3) If the disk-write fails, the driver retries the write.
    (4) If the write continues to fail, i.e., timeout, no disk, etc.
          the kernel gives up and does not hang forever. If you have
          disconnected the drive, you won't have any syslog writes to
          the device so your next boot won't show the event. It looks
          as though it was ignored.

    You can observe the behavior by mounting a floppy disk and
    then removing it while it is being written. There are many
    attempts to write to the device and then that write is discarded.

    Cheers,
    Dick Johnson
    Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
      Notice : All mail here is now cached for review by Dictator Bush.
                      98.36% of all statistics are fiction.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Randy.Dunlap: "Re: [Fastboot] Re: Kdump Testing"

    Relevant Pages

    • Re: threading : make stop the caller
      ... In fact when the error is Disk Full, I want to stop the whole program ... because I know that the next task will fail too. ... then start copying the next file in queue. ... or the early-abort flag is set, all ten threads will terminate when ...
      (comp.lang.python)
    • Re: File Merge
      ... You cannot know what happened until it actually happens -- so the only way to know that you have reached end-of-input is to try to read something and get a failure. ... might fail: for example, input from a disk could fail in the event of a head crash, or input from a keyboard could fail if you spilled Coke Classic into the mechanism and shorted it out with caramelized sugar. ... Most of C's input functions report a kind of "generalized failure" no matter what the cause -- and the *only* reason feof() exists is to let you figure out that cause. ...
      (comp.lang.c)
    • Re: ext2/3: document conditions when reliable operation is possible
      ... +during write, filesystems can't handle that correctly, because success ... and all we can do is fail. ... And here we're talking about ext2. ... Sounds like broken disk, then. ...
      (Linux-Kernel)
    • Re: bad sectors on a mybook WD usb drive
      ... :In fact the Everest SMART report shows that it actually got to 87C and that is utterly obscene. ... the drive is still near the ambient temperature. ... It looks like you nearly cooked your disk to ... In all stages the disk ages very rapidly and may fail soon. ...
      (comp.sys.ibm.pc.hardware.storage)
    • Re: SQL 2000 extra instance into same group
      ... There is a trace flag that will disable the disk dependency check, ... If you had multiple instances running in the same resource group, ... If one were to fail, ... Every instance requires its own group/disk/IP/networkname in a cluster. ...
      (microsoft.public.sqlserver.clustering)