RE: generic strncpy - off-by-one error

From: Peter Kjellerstedt (peter.kjellerstedt_at_axis.com)
Date: 08/20/03

  • Next message: CaT: "[PATCH 0/16] C99: 2.6.0-t3-bk7"
    To: "'Timothy Miller'" <miller@techsource.com>, Daniel Forrest <forrest@lmcg.wisc.edu>
    Date:	Wed, 20 Aug 2003 09:43:10 +0200
    
    

    [ I combined all my answers to your four mails into this one. ]

    > -----Original Message-----
    > From: Timothy Miller [mailto:miller@techsource.com]
    > Sent: Monday, August 18, 2003 18:07
    > To: Peter Kjellerstedt
    > Cc: 'Willy Tarreau'; linux-kernel mailing list
    > Subject: Re: generic strncpy - off-by-one error
    >
    > Peter Kjellerstedt wrote:
    > > For loops 2.867568 5.620561 8.128734 28.286289
    > > Multi byte fill 2.868031 5.670782 6.312027 11.336015
    > >
    > > And here are the numbers for my P4:
    > >
    > > For loops 3.060262 5.927378 8.796814 30.659818
    > > Multi byte fill 3.126607 5.898459 7.096685 13.135379
    > >
    > > So there is no doubt that the multi byte version is a clear
    > > winner (which was expected, I suppose).
    >
    > Cool! Hey, is this just an exercise, or are we actually going to
    > use this? I would be very happy to have something I contributed
    > to put into the kernel. :)

    I have no idea. But it sure seems like the generic strncpy() could
    use some improvement (and probably strcpy() too). However, I guess
    one can argue that it is better to keep the generic implementations
    as simple as possible, and then have each architecture implement
    their own optimized versions.

    > > Here is the code that I used:
    > >
    > > char *strncpy(char *dest, const char *src, size_t count)
    > > {
    > > char *tmp = dest;
    > >
    > > while (count && *src) {
    > > *tmp++ = *src++;
    > > count--;
    > > }
    > >
    > > if (count) {
    >
    > Good idea... bad to do so many checks if count is zero. On the
    > other hand, if count is rarely zero, then it's a loss. Maybe
    > benchmark with and without?

    Actually, I think this change would be lost in the noise in
    the measurements.

    > > size_t count2;
    > >
    > > while (count & (sizeof(long) - 1)) {
    > > *tmp++ = '\0';
    > > count--;
    > > }
    > >
    > > count2 = count / sizeof(long);
    >
    > I know that a good compiler should migrate code to help the CPU
    > pipeline, but how about moving this "count2 = " line up to before
    > the first fill loop. See if that helps any. Always good to
    > precompute well in advance.

    Cannot do that as the first loop modifies count.

    > > while (count2) {
    > > *((long *)tmp)++ = '\0';
    > > count2--;
    > > }
    > >
    > > count &= (sizeof(long) - 1);
    >
    > And move this to before the middle fill loop.

    In my later implementations this is not possible any longer.

    > > while (count) {
    > > *tmp++ = '\0';
    > > count--;
    > > }
    > > }
    > >
    > > return dest;
    > > }

    > Daniel Forrest wrote:
    > > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
    >
    > > Shouldn't this be:
    > >
    > > while (tmp & (sizeof(long) - 1)) {
    > >
    > >
    > >> *tmp++ = '\0';
    > >> count--;
    > >> }
    >
    > Oh, yeah! That's right. We need to check the address. Also need to
    > cast tmp to (int) or something (doesn't matter what it's cast to,
    > because we only care about the lower 2 or 3 bits).
    >
    > Peter, please see if this makes any speed difference. But it
    > definately needs this fix.

    Yes, I added it in my later versions.

    > Frankly, I'm surprised it works. In fact, it might not, but
    > it's hard to tell from the tests just benchmarks.

    I actually added verification to my benchmarking program too,
    so the later versions I mailed are verified to work the same
    as the standard glibc implementation at least.

    > Also, if you're doing dense addressing on Alpha, and you do byte
    > accesses the addresses for char are byte addresses, but the code
    > does read-modify-write to memory for byte accesses, because in
    > that mode, you can only do 32-bit and 64-bit accesses. The
    > performance increase could be even greater for Alpha than for x86.
    >
    > For Sparc, we might be able to do something with VIS instructions,
    > although I don't know what the setup overhead is. Sun's memcpy and
    > memset only use VIS when the size is greater than 512, because
    > otherwise, it's not worth it.
    >
    > I don't know enough about PowerPC other than the proper use of the
    > "eieio" instruction. :)

    Remember that many architectures already have their own architecture
    specific implementations. Also note that most of them have not been
    modified to do the null-padding... The following architectures have
    their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
    ppc, ppc64, s390 and sh. The following use the generic implementation:
    arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
    x86_64.

    > Daniel Forrest wrote:
    >
    > >
    > > - if (count) {
    > > + if (count >= sizeof(long)) {
    > > size_t count2;
    > >
    >
    > I like this size check here, but the comparison should be to
    > some number greater than sizeof(long). There is a threshold
    > below which it's not worth it to do all of the extra loops.
    > If you're going to fill only four bytes, it's probably best
    > to do it just using the last loop.
    >
    > Maybe through some trial and error, we could determine what that
    > threshold is. I'm betting it's something around 2* or 3* word size.

    The problem is that this probably varies a lot between different
    architectures, and also processor speeds. Probably best left for
    the architecture specific implementations.

    > Peter Kjellerstedt wrote:
    > > For unaligned source or destination the "Multi copy & fill"
    > > would degenerate into "Multi byte fill". However, for
    > > architectures like ix86 and CRIS that can do unaligned long
    > > access, it would be a win to remove the UNALIGNED() check,
    > > and use long word copying all the time.
    >
    > In fact, it's possible to do the copy even if the source and
    > dest are not aligned. It requires holding pieces of source in
    > a register and doing shifts. If the CPU is much faster than
    > the memory, this can be a huge win.

    This is true. However, this too is probably best left for the
    architecture specific implementations.

    > > Then whether using memset() or your filling is a win depends
    > > on the architecture and how many bytes needs to be filled.
    > > For a slow processor with little function call overhead (like
    > > CRIS), using memset seems to be a win almost immediately.
    > > However, for a fast processor like my P4, the call to memset
    > > is not a win until some 1500 bytes need to be filled.
    >
    > What is in memset that isn't in the fill code I suggested?

    In the CRIS case it sets 12 registers to zero and uses the
    movem instruction to copy them to memory at once (movem is
    normally used to store/restore registers to/from the stack
    on function entry/exit). Thus it can clear 48 bytes each
    time through the loop. And of course the rest of the function
    is hand crafted to give the best performance for any count.

    //Peter
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: CaT: "[PATCH 0/16] C99: 2.6.0-t3-bk7"

    Relevant Pages

    • OoO VAX (was: Code density and performance?)
      ... What architectural feature of the VAX would be a problem in the ... >some UNIX, they just couldn't compete. ... 386 architecture, too. ... >> either with pre-decode bits (as used in various 386 implementations), ...
      (comp.arch)
    • Re: PHP site development
      ... of application implementations for this architecture. ... a feel for the front-end design using your architecture. ... "Tony Marston" wrote in message ...
      (comp.lang.php)
    • Re: Implementations of VLIW
      ... Sounds like a homework problem to me. ... Anyhow, if the implementations aren't binary-compatible with each other, ... they aren't the same architecture, ... expect two implementations of the same VLIW architecture to be binary ...
      (comp.arch.embedded)
    • Re: Peer to peer replication on SQL 2000?
      ... can not close the loop with 2000. ... You can build a peer-to-peer architecture in 2005, ... The first conflict that you introduce will irrevocably damage ... > Paul Ibison SQL Server MVP, ...
      (microsoft.public.sqlserver.replication)
    • Re: Implementations of VLIW
      ... Anyhow, if the implementations aren't binary-compatible with each other, ... they aren't the same architecture, ... implementations have SSE3 instructions, as one example. ... expect two implementations of the same VLIW architecture to be binary ...
      (comp.arch.embedded)