TOE brain dump

From: Werner Almesberger (werner_at_almesberger.net)
Date: 08/02/03

  • Next message: Russell King: "Re: [2.6.0-test1] yenta_socket.c:yenta_get_status returns bad value compared to 2.4"
    Date:	Sat, 2 Aug 2003 14:04:44 -0300
    To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org
    
    

    At OLS, there was a bit of discussion on (true and false *) TOEs
    (TCP Offload Engines). In the course of this discussion, I've
    suggested what might be a novel approach, so in case this is a
    good idea, I'd like to dump my thoughts on it, before someone
    tries to patent my ideas. (Most likely, some of this has already
    been done or tried elsewhere, but it can't hurt to try to err on
    the safe side.)

    (*) The InfiniBand people unfortunately call also their TCP/IP
        bypass "TOE" (for which they promptly get shouted down,
        every time they use that word). This is misleading, because
        there is no TCP that's getting offloaded, but TCP is simply
        never done. I would consider it to be more accurate to view
        this as a separate networking technology, with semantics
        different from TCP/IP, similar to ATM and AAL5.

    While I'm not entirely convinced about the usefulness of TOE in
    all the cases it's been suggested for, I can see value in certain
    areas, e.g. when TCP per-packet overhead becomes an issue.

    However, I consider the approach of putting a new or heavily
    modified stack, which duplicates a considerable amount of the
    functionality in the main kernel, on a separate piece of hardware
    questionable at best. Some of the issues:

     - if this stack is closed source or generally hard to modify,
       security fixes will be slowed down

     - if this stack is closed source or generally hard to modify,
       TOE will not be available to projects modifying the stack,
       e.g. any of the research projects trying to make TCP work at
       gigabit speeds

     - this stack either needs to implement all administrative
       interfaces of the regular kernel, or such a system would have
       non-uniform configuration/monitoring across interfaces

     - in some cases, administrative interfaces will require a
       NIC/TOE-specific switch in the kernel (netlink helps here)

     - route changes on multi-homed hosts (or any similar kind of
       failover) are difficult if the state of TCP connections is
       tied to specific NICs (I've discussed some issues when
       "migrating" TCP connections in the documentation of tcpcp,
       http://www.almesberger.net/tcpcp/)

     - new kernel features will always lag behind on this kind of
       TOE, and different kernels will require different "firmware"

     - last but not least, keeping TOE firmware up to date with the
       TCP/IP stack in the mainstream kernel will require - for each
       such TOE device - a significant and continuous effort over a
       long period of time

    In short, I think such a solution is either a pain to use, or
    unmaintainable, or - most likely - both.

    So, how to do better ? Easy: use the Source, Luke. Here's my
    idea:

     - instead of putting a different stack on the TOE, a
       general-purpose processor (probably with some enhancements,
       and certainly with optimized data paths) is added to the NIC

     - that processor runs the same Linux kernel image as the host,
       acting like a NUMA system

     - a selectable part of TCP/IP is handled on the NIC, and the
       rest of the system runs on the host processor

     - instrumentation is added to the mainstream kernel to ensure
       that as little data as possible is shared between the main
       CPU and such peripheral CPUs. Note that such instrumentation
       would be generic, outlining possible boundaries, and not tied
       to a specific TOE design.

     - depending on hardware details (cache coherence, etc.), the
       instrumentation mentioned above may even be necessary for
       correctness. This would have the unfortunate effect of making
       the design very fragile with respect to changes in the
       mainstream kernel. (Performance loss in the case of imperfect
       instrumentation would be preferable.)

     - further instrumentation may be needed to let the kernel switch
       CPUs (i.e. host to NIC, and vice versa) at the right time

     - since the NIC would probably use a CPU design different from
       the host CPU, we'd need "fat" kernel binaries:

       - data structures are the same, i.e. word sizes, byte order,
         bit numbering, etc. are compatible, and alignments are
         chosen such that all CPUs involved are reasonably happy

       - kernels live in the same address space

       - function pointers become arrays, with one pointer per
         architecture. When comparing pointers, the first element is
         used.

     - if one should choose to also run parts of user space on the
       NIC, fat binaries would also be needed for this (along with
       other complications)

    Benefits:

     - putting the CPU next to the NIC keeps data paths short, and
       allows for all kinds of optimizations (e.g. a pipelined
       memory architecture)

     - the design is fairly generic, and would equally apply to
       other areas of the kernel than TCP/IP

     - using the same kernel image eliminates most maintenance
       problems, and encourages experimenting with the stack

     - using the same kernel image (and compatible data structures)
       guarantees that administrative interfaces are uniform in the
       entire system

     - such a design is likely to be able to allow TCP state to be
       moved to a different NIC, if necessary

    Possible problems, that may kill this idea:

     - it may be too hard to achieve correctness

     - it may be too hard to switch CPUs properly

     - it may not be possible to express copy operations efficiently
       in such a context

     - there may be no way to avoid sharing of hardware-specific
       data structures, such as page tables, or to emulate their use

     - people may consider the instrumentation required for this,
       although fairly generic, too intrusive

     - all this instrumentation may eat too much performance

     - nobody may be interested in building hardware for this

     - nobody may be patient enough to pursue such long-termish
       development, with uncertain outcome

     - something I haven't thought of

    I lack the resources (hardware, financial, and otherwise) to
    actually do something with these ideas, so please feel free to
    put them to some use.

    - Werner

    -- 
      _________________________________________________________________________
     / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
    /_http://www.almesberger.net/____________________________________________/
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Russell King: "Re: [2.6.0-test1] yenta_socket.c:yenta_get_status returns bad value compared to 2.4"

    Relevant Pages

    • Re: How to tell if a firewall alert is suspicious or not
      ... > WHY this SBCGlobal DNS server would be contacting Adobe Acrobat on port ... They have to parts, a kernel and the userland, in which programs, which are ... With Internet Protocol and TCP it is so, that any network interface in the ... To initiate a TCP connection, first the server has to "listen" on a port. ...
      (comp.security.firewalls)
    • Re: pending changes for TOE support
      ... decision to load the TOE driver and understands the implications. ... I think I would prefer that our policy switch be the capenable flag, so that compiling things in or out doesn't change functional behavior for existing interfaces. ... carefully assert, at least for a few months, that TCP never "slips" into any transmission-related paths that could lead to truly odd and hard-to-diagnose behavior when runnning with TOE. ... I see problems introduced by offloading connections as being driver bugs much the same as problems caused by the driver's TCP segmentation offload or checksum offload. ...
      (freebsd-arch)
    • Re: pending changes for TOE support
      ... decision to load the TOE driver and understands the implications. ... I think I would prefer that our policy switch be the capenable flag, so that compiling things in or out doesn't change functional behavior for existing interfaces. ... carefully assert, at least for a few months, that TCP never "slips" into any transmission-related paths that could lead to truly odd and hard-to-diagnose behavior when runnning with TOE. ... I see problems introduced by offloading connections as being driver bugs much the same as problems caused by the driver's TCP segmentation offload or checksum offload. ...
      (freebsd-current)
    • kernel 2.6.18-92.1.6 produces errors when using nfs and nis
      ... kernel 2.6.18-53.1.21.el5, with no change to any conf files when I switch ... The problem is that when the nfs service starts I get the following error ... 100000 2 tcp 111 portmapper ... 100000 2 udp 111 portmapper ...
      (Linux-Kernel)
    • Network device driver KPI/ABI and TOE
      ... Kip Macy committed support for TCP offload to the FreeBSD CVS repository for the Chelsio 10gbps device driver. ... This e-mail is about how we want to treat the TOE interface with respect to third party device driver support, and more specifically to propose that we not consider the TOE interface to be part of our stable network device driver KPI/ABI once it appears in a RELENG_X branch. ...
      (freebsd-arch)

    Loading