Re: 2.6.11-rc1-mm1

From: Karim Yaghmour (karim_at_opersys.com)
Date: 01/15/05

  • Next message: Karim Yaghmour: "Re: 2.6.11-rc1-mm1"
    Date:	Fri, 14 Jan 2005 23:18:52 -0500
    To: Roman Zippel <zippel@linux-m68k.org>
    
    

    Hello Roman,

    Roman Zippel wrote:
    > This doesn't mean everything has to be put into a single call. Several
    > parameters can still be set after creation.

    I don't have a problem with that. If that's preferable, then we can do
    it this way too.

    > Why should a subsystem care about the details of the buffer management?

    Because it wants to enforce a data format on buffer boundaries.

    Let me explain how this applies in the case of LTT, but this easily
    generalizes itself to any sort of subsystem that needs to transfer
    large amounts of information between the kernel and user-space. And
    to avoid any confusion, let me repeat that relayfs is not intended
    just for conveying debug/performance/trace info.

    Basically, in the case of LTT at least, the kernel tracing infrastructure
    must provide a stream of data to the user-space tools that they will in
    turn process and display to the user. At this point it must be said that
    what you write and you how write it in the trace depends largely on a
    few key issues. Namely:
    - How much data you expect to be generating.
    - What you intend to do with it.

    Given ltt's target audience (mainstream developers, sysadmins, and power-
    users), one of the goals was to have a trace format that provided
    easy browsing forward and backwards, and random access. Initially,
    this was implemented using two 1MB buffers, one that was being written to
    while the other one was being written to disk. So, in essence, we had
    random access at 1MB boundaries. For reading backwards, the size of the
    event is written at the end of the event and we just need to read
    2 bytes prior to the current event to know where the previous event
    started.

    Eventually we found that this format was rather bulky, and that it
    recorded superfluous data. Amongst other things we relied on a single
    buffer, so with each event we logged the CPU-ID of the processor on
    which the event occured. So, in order to reduce the amount of data
    recorded and in trying to obtain better performance at runtime by
    avoiding a call to do_gettimeofday for every event, we did the
    following:
    - Eliminate the CPU-ID => use per-cpu buffers instead.
    - Stop calling do_gettimeofday when possible => instead write a
      complete time-stamp at sub-buffer boundaries (begining and end;
      because of clock drift) and only read the lower-half of the TSC
      for each event. Determining an event's actual time is done in
      post-mortem in user-space.

    So how does this translate in practice? Here's the trace header. This
    is written only once at the start of the trace:
    /* Information logged when a trace is started */
    typedef struct _ltt_trace_start {
            u32 magic_number;
            u32 arch_type;
            u32 arch_variant;
            u32 system_type;
            u8 major_version;
            u8 minor_version;

            u32 buffer_size;
            ltt_event_mask event_mask;
            ltt_event_mask details_mask;
            u8 log_cpuid;
            u8 use_tsc;
            u8 flight_recorder;
    } LTT_PACKED_STRUCT ltt_trace_start;

    This is written in the begining of every new sub-buffer:
    /* Start of trace buffer information */
    typedef struct _ltt_buffer_start {
            struct timeval time; /* Time stamp of this buffer */
            u32 tsc; /* TSC of this buffer, if applicable */
            u32 id; /* Unique buffer ID */
    } LTT_PACKED_STRUCT ltt_buffer_start;

    This is written at the end of every sub-buffer:
    typedef struct _ltt_buffer_end {
            struct timeval time; /* Time stamp of this buffer */
            u32 tsc; /* TSC of this buffer, if applicable */
    } LTT_PACKED_STRUCT ltt_buffer_end;

    As you can see, we can't just dump this information in an event channel.
    This is really intrinsic to how the trace data is going to be read
    later on. Removing this data would require more data for each event to
    be logged, and require parsing through the trace before reading it in
    order to obtain markers allowing random access. This wouldn't be so
    bad if we were expecting users to use LTT sporadically for very short
    periods of time. However, given ltt's target audience (i.e. need to
    run traces for hours, maybe days, weeks), traces would rapidely become
    useless because while plowing through a few hundred KBs of data and
    allocating RAM for building internal structures as you go is fine,
    plowing through tens of GBs of data, possibly hundreds, requires that
    you come up with a format that won't require unreasonable resources
    from your system, while incuring negligeable runtime costs for generating
    it. We believe the format we currently have achieves the right balance
    here.

    So what happens now is that ltt tells relayfs when creating a channel
    how much space it needs for these basic structures, and provides it
    with callbacks which are invoked at boundaries for filling the actual
    reserved space. In all other circumstances, here's what we are writing
    into the relayfs buffer for each event:
    - Event ID (1 byte)
    - Time delta (4 bytes) => this the low 32-bits from the TSC or a
      diff between the current do_gettimeofday and the one at buffer start.
    - Event details (variable length, see include/linux/ltt-events.h)
    - Event size (2 bytes)

    Of course there are possible improvements. For one thing, we've
    discussed dropping the "event size" altogether and rely on smaller
    buffers and dynamically create sub-buffer indexing tables for reading
    backwards. This is still part of a work in progress which aims at
    creating an even better and more flexible format. Of course in an
    ideal world this new format and the corresponding user tools would
    be available as we speak, but there's only so much that can be done
    without having an existing solid base to work off on. As usual,
    we're open to any other outside suggestions.

    > You could move all this into the relay layer by making a relay channel
    > an event channel. I know you want to save space, but having a magic
    > event_struct_size array is not a good idea. If you have that much events,
    > that a little more overhead causes problems, the tracing results won't be
    > reliable anymore anyway.

    I hope what I said above explains why this isn't possible.

    > Simplicity and maintainability are far more important than saving a few
    > bytes, the general case should be fast and simple, leave the complexity to
    > the special cases.

    I agree. I also realize that not all relayfs clients will have the
    same requirements as ltt. Already, ltt uses a few things from relayfs
    that others are unlikely to need. For example, it directly invokes
    relay_lock_channel() to directly lock a channel and relay_write_direct()
    to directly write to the buffers without relying on the usual
    relay_write() which takes care of both. This allows LTT to do
    zero-copy (i.e. no need to pack a buffer before comiting it.) Other
    subsystems may actually not use any relayfs function to write, but
    instead write directly to a channel as if it was an allocated buffer
    (which in fact it is). In all cases, though, the open(), mmap(),
    write() semantic makes it very simple for user-space applications
    to process channeled data.

    So here's a suggested change. Instead of the current relay_open()
    API, here are three replacement functions (inspired by Tim's input
    and your comments above):
    relay_open(channel_path, mode, bufsize, nbufs);
    relay_set_property(property, value);
    relay_get_property(property, &value);

    Is this more palatable?

    Karim

    -- 
    Author, Speaker, Developer, Consultant
    Pushing Embedded and Real-Time Linux Systems Beyond the Limits
    http://www.opersys.com || karim@opersys.com || 1-866-677-4546
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Karim Yaghmour: "Re: 2.6.11-rc1-mm1"

    Relevant Pages

    • [NEWS] How to Remotely and Automatically Exploit a Format Bug
      ... Exploiting a format bug remotely is not as difficult as one would think. ... We will use very minimalist server along this paper. ... Since the buffer is directly available to a malicious user, ... Guessing the address of the shellcode in the stack ...
      (Securiteam)
    • Re: quick overview of the perfmon2 interface
      ... The initial perfmon interface design as implemented in the 2.4 kernel series ... at their sampling buffer, they were aggregating samples instead of storing ... Our default sampling format stores samples sequentially, ...
      (Linux-Kernel)
    • [patch] x86, ptrace: support for branch trace store(BTS)
      ... Support for Intel's last branch recording to ptrace. ... access to this hardware feature and allows them to show an execution trace ... trace of the running application without instrumentation. ... cyclic buffer given to it by the OS. ...
      (Linux-Kernel)
    • [Patch 1/2] Trace code and documentation (updated)
      ... I updated the example code to eliminate the use of the temp print buffer ... +This small sample module creates a trace channel. ... +The trace interface provides a single layer in a complete tracing ...
      (Linux-Kernel)
    • RE: [patch] x86, ptrace: in-kernel BTS interface
      ... to having a well-specified internal interface to build from. ... Not all processors support all variants. ... Just have two bits: trace kernel ... no longer clear the BTS buffer. ...
      (Linux-Kernel)