Re: Random file I/O regressions in 2.6

From: Ram Pai (linuxram_at_us.ibm.com)
Date: 05/11/04

  • Next message: Andrew Morton: "Re: 2.6.6-mm1"
    To: Andrew Morton <akpm@osdl.org>
    Date:	10 May 2004 15:39:28 -0700
    
    

    On Mon, 2004-05-10 at 13:21, Andrew Morton wrote:
    > Ram Pai <linuxram@us.ibm.com> wrote:
    > >
    > > > Ram, can you take a look at fixing that up please? Something clean, not
    > > > more hacks ;) I'd also be interested in an explanation of what the extra
    > > > page is for. The little comment in there doesn't really help.
    > >
    > >
    > > The reason for the extra page read is as follows:
    > >
    > > Consider 16k random reads i/os. Reads are generated 4pages at a time.
    > >
    > > the readahead is triggered when the 4th page in the 'current-window' is
    > > touched.
    >
    > Right. We've added two whole unsigned longs to the file_struct to track
    > the access patterns. That should be sufficient for us to detect when the
    > access pattern is random, and to then not perform readahead due to a
    > current-window miss *at all*.
    >
    > So that extra page can go away, and:
    >
    > --- 25/mm/readahead.c~a Mon May 10 13:16:59 2004
    > +++ 25-akpm/mm/readahead.c Mon May 10 13:17:22 2004
    > @@ -492,21 +492,17 @@ do_io:
    > */
    > if (ra->ahead_start == 0) {
    > /*
    > - * if the average io-size is less than maximum
    > + * If the average io-size is less than maximum
    > * readahead size of the file the io pattern is
    > * sequential. Hence bring in the readahead window
    > * immediately.
    > - * Else the i/o pattern is random. Bring
    > - * in the readahead window only if the last page of
    > - * the current window is accessed (lazy readahead).
    > */
    > unsigned long average = ra->average;
    >
    > if (ra->serial_cnt > average)
    > average = (ra->serial_cnt + ra->average) / 2;
    >
    > - if ((average >= max) || (offset == (ra->start +
    > - ra->size - 1))) {
    > + if (average >= max) {
    > ra->ahead_start = ra->start + ra->size;
    > ra->ahead_size = ra->next_size;
    > actual = do_page_cache_readahead(mapping, filp,
    >
    > _
    >
    >
    > That way, we read the correct amount of data, and we only start I/O when we
    > know the application is going to actually use the data.
    >
    > This may cause problems when the application transitions from seeky-access
    > to linear-access.
    >
    > Does it sound feasible?
    I am nervous about this change. You are totally getting rid of
    lazy-readahead and that was the optimization which gave the best
    possible boost in performance.
    Let me see how this patch does with a DSS benchmark.

    >
    > >
    > > Probably we may see marginal degradation of this optimization with 16k
    > > i/o but the amount of wastage avoided by this optimization (hack)
    > > is great when random i/o is of larger size. I think it was 4% better
    > > performance on DSS workload with 64k random reads.
    >
    > 64k sounds unusually large. We need top performance at 8k too.
    >
    > > Do you still think its a hack?
    >
    > yup ;)
    >

    :-(

    > > Also I think with sysbench workload and Andrew's ra-copy patch, we
    > > might be loosing some benefits of some of the optimization because
    > > if two threads simulteously work with copies of the same ra structure
    > > and update it, the optimization effect reflected in one of the
    > > ra-structure is lost depending on which ra structure gets copied back
    > > last.
    >
    > hm, maybe. That only makes a difference if two threads are accessing the
    > same fd at the same time, and it was really bad before the patch. The IO
    > patterns seemed OK to me with the patch. Except it's reading one page too
    > many.

    In the normal large random workload this extra page would have
    compesated for all the wasted readaheads. However in the case of
    sysbench with Andrew's ra-copy patch the readahead calculation is not
    happening quiet right. Is it worth trying to get a marginal gain
    with sysbench at the cost of getting a big hit on DSS benchmarks,
    aio-tests,iozone and probably others. Or am I making an unsubstantiated
    claim? I will get back with results.

    RP

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Andrew Morton: "Re: 2.6.6-mm1"

    Relevant Pages