Re: amd64 bitops fix for -Os

From: Randy.Dunlap (rdunlap_at_xenotime.net)
Date: 10/30/05

  • Next message: Steven Rostedt: "Re: [ketchup] patch to allow for moving of .gitignore in 2.6.14"
    Date:	Sat, 29 Oct 2005 16:41:49 -0700
    To: Alexandre Oliva <aoliva@redhat.com>
    
    

    On Sat, 29 Oct 2005 19:56:02 -0200 Alexandre Oliva wrote:

    > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171672
    >
    > This patches fixes a bug that comes up when compiling the kernel for
    > x86_64 optimizing for size. It affects 2.6.14 for sure, but I'm
    > pretty sure many earlier kernels are affected as well.
    >
    > The symptom is that, as soon as some change is made to the root
    > filesystem (e.g. dmesg > /var/log/dmesg), the kernel mostly hangs. It
    > was not the first time I'd run into this symptom, but this time I
    > could track the problem down to enabling size optimizations in the
    > kernel build. It took some time to narrow down the culprit source
    > with a binary search, compiling part of the kernel sources with -Os
    > and part with -O2, but eventually it was clear that bitops itself was
    > to blame, which should have been clear from the soft lockup oops I
    > got.
    >
    > The problem is that find_first_zero_bit() fails when called with an
    > underflown size, because its inline asm assumes at least one iteration
    > of scasq will run. When this does not hold, the conditional branch
    > that follows it uses flags from instructions prior to the asm
    > statement.
    >
    > When optimizing for speed, the generated code is such that the flags
    > will have the correct value, because of the side effects on flags of
    > the right shift of the size, that survive through to the asm
    > statement. When optimizing for size, however, the mov instruction
    > used to initialize %rax with -1 is replaced with a smaller or
    > instruction, that modifies the flags and thus breaks the
    > zero-trip-count case.
    >
    > Obviously the asm statement must not rely on the compiler setting up
    > flags by chance, so we have to either force the flags to be set
    > properly or make sure we run scasq at least once. In teh
    > find_first_zero_bit case, this comes at pretty much no cost, since we
    > already test size for non-zero, but we used to do that adjusting it
    > from bits to words; changing it should have no visible effect on
    > performance.
    >
    > As for find_first_bit, it's quite likely that the same bug is present
    > when it's called by find_next_bit in the same conditions, but
    > find_first_bit doesn't even test for zero. AFAICT, it has just been
    > luckier, so I went ahead and added the same guard code to it. This
    > unfortunately adds a test to the fast path, but I don't see how to
    > avoid that without auditing all callers.
    >
    > I actually introduce means to guard against these cases in the public
    > wrappers, but the BUG_ONs are disabled by default. I've left a kernel
    > running with them enabled for a bit, and they never hit, which is a
    > good sign, but I haven't tested it thoroughly or anything. We could
    > probably do away with these new tests by modifying the find_next*bit
    > functions so as to not call the find_first*bit functions if they've
    > already exhausted the range implied by the size argument. I'm not
    > sure whether that's worth doing, though, so I didn't.
    >
    > While staring at the code and trying to figure out what the problem
    > was, I removed some needless casts from find_next_zero_bit, by
    > constifying the automatic pointer properly, and also moved the actual
    > code from find_first_zero_bit to a separate internal function, such
    > that we could add the bug-check to the public interface only.
    >
    > I also noticed find_first_zero_bit was less efficient than
    > find_first_bit in that the former saved and restored rbx, because GCC
    > chose that to hold (addr) within the asm statement, instead of using
    > the readily-available and caller-saved rsi. I've thus changed the
    > code to prefer rsi, although in a perfect world the compiler would be
    > able to figure that out by itself.
    >
    > The compiler could do a bit better in find_first_zero_bit: if the
    > initial size turns out to be zero, it could return, like it does in
    > find_first_bit, but instead of sets rdx to zero and jumps to the end
    > of the function where rdx is copied to rax before the return
    > statement. This is a negative effect of the assignment of variable
    > res to rdx instead of rax, which gets the register allocator to map
    > the pseudo register representing the return value to rdx, requiring a
    > copy at the end and preventing (as far as the dumb compiler can see
    > :-) the direct use of a return in the zero-size case. I've verified
    > that this is not caused by the additional inline function that I
    > introduced.
    >
    > I tried to change the use of registers so as to enable the better code
    > for this path, but I couldn't come up with anything that was as
    > efficient, so I figured I wouldn't try to optimize the exceptional
    > path in expense of the common fast path and left it alone. If anyone
    > can come up with something better, please go ahead.
    >
    >
    > Anyhow, with this patch I could run 2.6.14, as in the Fedora
    > development tree, except for the change to optimize for size.
    >
    > Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
    >
    > --- arch/x86_64/lib/bitops.c~ 2005-10-27 22:02:08.000000000 -0200
    > +++ arch/x86_64/lib/bitops.c 2005-10-29 18:24:27.000000000 -0200
    > @@ -1,5 +1,11 @@
    > #include <linux/bitops.h>
    >
    > +#define BITOPS_CHECK_UNDERFLOW_RANGE 0
    > +
    > +#if BITOPS_CHECK_UNDERFLOW_RANGE
    > +# include <linux/kernel.h>
    > +#endif
    > +
    > #undef find_first_zero_bit
    > #undef find_next_zero_bit
    > #undef find_first_bit
    > @@ -13,11 +19,21 @@
    > * Returns the bit-number of the first zero bit, not the number of the byte
    > * containing a bit.
    > */
    > -inline long find_first_zero_bit(const unsigned long * addr, unsigned long size)
    > +static inline long
    > +__find_first_zero_bit(const unsigned long * addr, unsigned long size)
    > {
    > long d0, d1, d2;
    > long res;
    >
    > + /* We must test the size in words, not in bits, because
    > + otherwise incoming sizes in the range -63..-1 will not run
    > + any scasq instructions, and then the flags used by the je
    > + instruction will have whatever random value was in place
    > + before. Nobody should call us like that, but
    > + find_next_zero_bit() does when offset and size are at the
    > + same word and it fails to find a zero itself. */
    > + size += 63;
    > + size >>= 6;
    > if (!size)
    > return 0;
    > asm volatile(
    > @@ -30,11 +46,22 @@
    > " shlq $3,%%rdi\n"
    > " addq %%rdi,%%rdx"
    > :"=d" (res), "=&c" (d0), "=&D" (d1), "=&a" (d2)
    > - :"0" (0ULL), "1" ((size + 63) >> 6), "2" (addr), "3" (-1ULL),
    > - [addr] "r" (addr) : "memory");
    > + :"0" (0ULL), "1" (size), "2" (addr), "3" (-1ULL),
    > + /* Any register here would do, but GCC tends to
    > + prefer rbx over rsi, even though rsi is readily
    > + available and doesn't have to be saved. */
    > + [addr] "S" (addr) : "memory");
    > return res;
    > }
    >
    > +long find_first_zero_bit(const unsigned long * addr, unsigned long size)
    > +{
    > +#if BITOPS_CHECK_UNDERFLOW_RANGE
    > + BUG_ON (size + 63 < size);
    > +#endif
    > + return __find_first_zero_bit (addr, size);
    > +}
    > +
    > /**
    > * find_next_zero_bit - find the first zero bit in a memory region
    > * @addr: The address to base the search on
    > @@ -43,7 +70,7 @@
    > */
    > long find_next_zero_bit (const unsigned long * addr, long size, long offset)
    > {
    > - unsigned long * p = ((unsigned long *) addr) + (offset >> 6);
    > + const unsigned long * p = addr + (offset >> 6);
    > unsigned long set = 0;
    > unsigned long res, bit = offset&63;
    >
    > @@ -63,8 +90,8 @@
    > /*
    > * No zero yet, search remaining full words for a zero
    > */
    > - res = find_first_zero_bit ((const unsigned long *)p,
    > - size - 64 * (p - (unsigned long *) addr));
    > + res = __find_first_zero_bit (p, size - 64 * (p - addr));
    > +
    > return (offset + set + res);
    > }
    >
    > @@ -74,6 +101,17 @@
    > long d0, d1;
    > long res;
    >
    > + /* We must test the size in words, not in bits, because
    > + otherwise incoming sizes in the range -63..-1 will not run
    > + any scasq instructions, and then the flags used by the jz
    > + instruction will have whatever random value was in place
    > + before. Nobody should call us like that, but
    > + find_next_bit() does when offset and size are at the same
    > + word and it fails to find a one itself. */
    > + size += 63;
    > + size >>= 6;
    > + if (!size)
    > + return 0;
    > asm volatile(
    > " repe; scasq\n"
    > " jz 1f\n"
    > @@ -83,8 +121,7 @@
    > " shlq $3,%%rdi\n"
    > " addq %%rdi,%%rax"
    > :"=a" (res), "=&c" (d0), "=&D" (d1)
    > - :"0" (0ULL),
    > - "1" ((size + 63) >> 6), "2" (addr),
    > + :"0" (0ULL), "1" (size), "2" (addr),
    > [addr] "r" (addr) : "memory");
    > return res;
    > }
    > @@ -99,6 +136,9 @@
    > */
    > long find_first_bit(const unsigned long * addr, unsigned long size)
    > {
    > +#if BITOPS_CHECK_UNDERFLOW_RANGE
    > + BUG_ON (size + 63 < size);
    > +#endif
    > return __find_first_bit(addr,size);
    > }

    Thanks. I'll test this early next week.
    I always use -Os and (on x86_64) I always get a SOFT LOCKUP
    during boot or when loading X. For me, it's always in ext3fs,
    doing some flavor of bitops, so I have high hopes for this fixing
    that problem.

    ---
    ~Randy
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Steven Rostedt: "Re: [ketchup] patch to allow for moving of .gitignore in 2.6.14"

    Relevant Pages

    • Re: Kernel ARCH i586 i686 - need clarification
      ... The more interesting answer will be found in the source for the kernel ... The compiler flags are interesting. ... T o m M i t c h e l l spam unwanted email. ...
      (Fedora)
    • Lava Octopus 550 kernel crash
      ... Jul 14 22:40:34 haruhi kernel: IRQ handler type mismatch for IRQ 16 ... 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub ... Flags: bus master, fast devsel, latency 0 ...
      (Fedora)
    • amd64 bitops fix for -Os
      ... filesystem, the kernel mostly hangs. ... When optimizing for speed, the generated code is such that the flags ... Obviously the asm statement must not rely on the compiler setting up ... -inline long find_first_zero_bit(const unsigned long * addr, ...
      (Linux-Kernel)
    • Re: [patch 0/6] AMD C1E aware idle support
      ... Are you referring to one of the HPET related bugzillas at Kernel Bug ... Flags: bus master, 66MHz, medium devsel, latency 64 ...
      (Linux-Kernel)
    • Newbie Q: spca5xx webcam driver installation problem
      ... with the included default makefile. ... Building SPCA5XX driver for 2.5/2.6 kernel. ... # Setup the tools ... # Setup compiler warnings ...
      (comp.os.linux.setup)