Re: Buffered I/O slowness

From: Jesse Barnes (jbarnes_at_engr.sgi.com)
Date: 10/29/04

  • Next message: William Lee Irwin III: "Re: My thoughts on the "new development model""
    To: linux-kernel@vger.kernel.org
    Date:	Fri, 29 Oct 2004 10:46:48 -0700
    
    
    

    On Monday, October 25, 2004 6:14 pm, Jesse Barnes wrote:
    > I've been doing some simple disk I/O benchmarking with an eye towards
    > improving large, striped volume bandwidth. I ran some tests on individual
    > disks and filesystems to establish a baseline and found that things
    > generally scale quite well:
    >
    > o one thread/disk using O_DIRECT on the block device
    > read avg: 2784.81 MB/s
    > write avg: 2585.60 MB/s
    >
    > o one thread/disk using O_DIRECT + filesystem
    > read avg: 2635.98 MB/s
    > write avg: 2573.39 MB/s
    >
    > o one thread/disk using buffered I/O + filesystem
    > read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
    > read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
    > write avg: 1394.99 MB/s
    >
    > Configuration:
    > o 8p sn2 ia64 box
    > o 8GB memory
    > o 58 disks across 16 controllers
    > (4 disks for 10 of them and 3 for the other 6)
    > o aggregate I/O bw available is about 2.8GB/s
    >
    > Test:
    > o one I/O thread per disk, round robined across the 8 CPUs
    > o each thread did ~450MB of I/O depending on the test (ran for 10s)
    > Note: the total was > 8GB so in the buffered read case not everything
    > could be cached

    More results here. I've run some tests on a large dm striped volume formatted
    with XFS. It had 64 disks with a 64k stripe unit (XFS was made aware of this
    at format time), and I explicitly set the readahead using blockdev to 524288
    blocks. The results aren't as bad as my previous runs, but are still much
    slower than they ought to be I think given the direct I/O results above.
    This is after a fresh mount, so the pagecache was empty when I started the
    tests.

    o one thread on one large volume using buffered I/O + filesystem
      read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
      write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s

    I'm intentionally issuing very large reads and writes here to take advantage
    of the striping, but it looks like both the readahead and regular buffered
    I/O code will split the I/O into page sized chunks? The call chain is pretty
    long, but it looks to me like do_generic_mapping_read() will split the reads
    up by page and issue them independently to the lower levels. In the direct
    I/O case, up to 64 pages are issued at a time, which seems like it would help
    throughput quite a bit. The profile seems to confirm this. Unfortunately I
    didn't save the vmstat output for this run (and now the fc switch is
    misbehaving so I have to fix that before I run again), but iirc the system
    time was pretty high given that only one thread was issuing I/O.

    So maybe a few things need to be done:
      o set readahead to larger values by default for dm volumes at setup time
        (the default was very small)
      o maybe bypass readahead for very large requests?
        if the process is doing a huge request, chances are that readahead won't
        benefit it as much as a process doing small requests
      o not sure about writes yet, I haven't looked at that call chain much yet

    Does any of this sound reasonable at all? What else could be done to make the
    buffered I/O layer friendlier to large requests?

    Thanks,
    Jesse

    
    

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/



  • Next message: William Lee Irwin III: "Re: My thoughts on the "new development model""

    Relevant Pages

    • Re: Caching control
      ... |> | invalidate/unmap them in order to discard the data from memory. ... |> writing out to disk. ... | easy to discard as clean disk cache. ... stating that a specific amount of RAM can be used only for I/O ...
      (comp.os.linux.development.system)
    • Re: Dynamic configure max_cstate
      ... fio is a disk I/O workload ... which doesn't spend much time with cpu, ... I also thought it's related to timer. ...
      (Linux-Kernel)
    • Re: Dynamic configure max_cstate
      ... fio is a disk I/O workload ... I also thought it's related to timer. ... But oprofile data shows acpi_pm has more cpu utilization. ...
      (Linux-Kernel)
    • Re: Dynamic configure max_cstate
      ... fio is a disk I/O workload ... are we actually taking ACPI Cx exit latency into account, ... I also thought it's related to timer. ...
      (Linux-Kernel)
    • Re: Short SMART check causes disk op timeouts
      ... Second, your short offline test runs at 0300, but the errors you're ... 0301 in the morning, and many of those are I/O bound. ... perform a lot of disk I/O, so it's possible that on Sunday specifically ... taking too long when internally suspending the SMART test. ...
      (freebsd-stable)