Re: Random packets loss under x86_64 - routing?

From: linux-os (linux-os_at_analogic.com)
Date: 01/14/05

  • Next message: Michal Ludvig: "Re: [PATCH 1/2] CryptoAPI: prepare for processing multiple buffers at a time"
    Date:	Fri, 14 Jan 2005 11:37:47 -0500 (EST)
    To: Peter Kruse <pk@q-leap.com>
    
    

    On Fri, 14 Jan 2005, Peter Kruse wrote:

    > kernel: 2.4.28 smp x86_64
    >
    > Hello,
    >
    > We experience a problem in our amd64 beowulf clusters and could need
    > some help.
    > When ping'ing other machines in a cluster on the same
    > subnet, it fails for some machines. But only right after boot
    > and after a day or so of idle time. After some time (a few minutes) the
    > ping packets go through.
    >
    > Other things we observed:
    >
    > 1. it is not always the same machines that fail
    > 2. if it fails then no packets are sent or received (checked with
    > tcpdump on sending and target host) although all hosts are up.
    > 3. There is no difference if using a 64bit or 32bit ping
    > 4. It does not depend on the network adapter or other hardware, we have
    > machines with different NICs connected to different switches with the
    > same problem.
    > 5. It does however only happen on amd64 (biarch) systems and not on
    > pure i386 systems so it looks like related to the kernel.
    > 6. I have to reboot to reproduce the problem, it's not enough to
    > unload and load the network module.
    > 7. It only happens with ping, not with ssh.
    >
    > The ping always succeeds when running with the "-r" switch,
    > that bypasses "the normal routing tables and send directly to a host
    > on an attached interface". This makes us think that it indeed it is
    > related to routing - but how?
    >
    > I can provide an strace output if you think that could help.
    > What else can I do to gather more information?
    >
    > Please cc to me, as I'm not subscribed, thanks.
    >
    > Peter
    >

    When they 'disappear', use `arp -d hostname` to delete the
    entry from the ARP tables. Then see if you can ping it.
    It is possible that the destination machine got re-routed
    and the new router's HW address wasn't updated in the
    ARP tables. If this is the case, I don't know hot to 'fix'
    it, but it's a new data-point. When you have dynamic routing,
    there needs to be some way to update the ARP tables even though
    they eventually expire.

    The fact that `ping -r` works seems to show that the ARP table
    has stale entries in it.

    Cheers,
    *** Johnson
    Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
      Notice : All mail here is now cached for review by Dictator Bush.
                      98.36% of all statistics are fiction.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Michal Ludvig: "Re: [PATCH 1/2] CryptoAPI: prepare for processing multiple buffers at a time"