Re: "Enhanced" MD code avaible for review

From: Justin T. Gibbs (gibbs_at_scsiguy.com)
Date: 03/19/04

  • Next message: Ulrich Drepper: "Re: PATCH - InfiniBand Access Layer (IBAL)"
    Date:	Fri, 19 Mar 2004 13:19:13 -0700
    To: linux-raid@vger.kernel.org
    
    

    [ CC trimmed since all those on the CC line appear to be on the lists ... ]

    Lets take a step back and focus on a few of the points to which we can
    hopefully all agree:

    o Any successful solution will have to have "meta-data modules" for
      active arrays "core resident" in order to be robust. This
      requirement stems from the need to avoid deadlock during error
      recovery scenarios that must block "normal I/O" to the array while
      meta-data operations take place.

    o It is desirable for arrays to auto-assemble based on recorded
      meta-data. This includes the ability to have a user hot-insert
      a "cold spare", have the system recognize it as a spare (based
      on the meta-data resident on it) and activate it if necessary to
      restore a degraded array.

    o Child devices of an array should only be accessible through the
      array while the array is in a configured state (bd_claim'ed).
      This avoids situations where a user can subvert the integrity of
      the array by performing "rogue I/O" to an array member.

    Concentrating on just these three, we come to the conclusion that
    whether the solution comes via "early user fs" or kernel modules,
    the resident size of the solution *will* include the cost for
    meta-data support. In either case, the user is able to tailor their
    system to include only the support necessary for their individual
    system to operate.

    If we want to argue the merits of either approach based on just the
    sheer size of resident code, I have little doubt that the kernel
    module approach will prove smaller:

     o No need for "mdadm" or some other daemon to be locked resident in
       memory. This alone saves you having a locked copy of klibc or
       any other user libraries core resident. The kernel modules
       leverage the kernel APIs that already have to be core resident
       to satisfy the needs of other parts of the kernel which also
       helps in reducing its size.

     o Initial RAM disk data can be discarded after modules are loaded at
       boot time.

    Putting the size argument aside for a moment, lets explore how a
    userland solution could satisfy just the above three requirements.

    How is meta-data updated on child members of an array while that
    array is on-line? Remember that these operations occur with some
    frequency. MD includes "safe-mode" support where redundant arrays
    are marked clean any time writes cease for a predetermined, fairly
    short, amount of time. The userland app cannot access the component
    devices directly since they are bd_claim'ed. Even if that mechanism
    is somehow subverted, how do we guarantee that these meta-data
    writes do not cause a deadlock? In the case of a transition from
    Read-only to Write mode, all writes are blocked to the array (this
    must be the case for "Dirty" state to be accurate). It seems to
    me that you must then provide extra code to not only pre-allocate
    buffers for the userland app to do its work, but also provide a
    "back-door" interface for these operations to take place.

    The argument has also been made that shifting some of this code out
    to a userland app "simplifies" the solution and perhaps even makes
    it easier to develop. Comparing the two approaches we have:

    UserFS:
          o Kernel Driver + "enhanced interface to userland daemon"
          o Userland Daemon (core resident)
          o Userland Meta-Data modules
          o Userland Management tool
             - This tool needs to interface to the daemon and
               perhaps also the kernel driver.

    Kernel:
          o Kernel RAID Transform Drivers
          o Kernel Meta-Data modules
          o Simple Userland Mangement
            tool with no meta-data knowledge

    So two questions arise from this analysis:

    1) Are meta-data modules easier to code up or more robust as user
       or kernel modules? I believe that doing these outside the kernel
       will make them larger and more complex while also losing the
       ability to have meta-data modules weigh in on rapidly occurring
       events without incurring performance tradeoffs. Regardless of
       where they reside, these modules must be robust. A kernel Oops
       or a segfault in the daemon is unacceptable to the end user.
       Saying that a segfault is less harmful in some way than an Oops
       when we're talking about the users data completely misses the
       point of why people use RAID.

    2) What added complexity is incurred by supporting both a core
       resident daemon as well as management interfaces to the daemon
       and potentially the kernel module? I have not fully thought
       through the corner cases such an approach would expose, so I
       cannot quantify this cost. There are certainly more components
       to get right and keep synchronized.

    In the end, I find it hard to justify inventing all of the userland
    machinery necessary to make this work just to avoid roughly ~2K
    lines of code per-metadata module from being part of the kernel.
    The ASR module for example, which is only required by those that
    need support for this meta-data type, is only 19K with all of its
    debugging printks and code enabled, unstripped. Are there benefits
    to the userland approach that I'm missing?

    --
    Justin
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Ulrich Drepper: "Re: PATCH - InfiniBand Access Layer (IBAL)"

    Relevant Pages

    • libata badness
      ... I'm running a raid5 array atop a few sata drives via a promise tx4 ... The kernel is the official fedora lk 2.6.8-1, ... raid5 xor sata_promise md5 ipv6 parport_pc lp parport ...
      (Linux-Kernel)
    • Reading a bad sector does not report failure as read error but hangs PC with Machine Check Exception
      ... Last night I discovered a problem in my RAID5 array ... tried to read that sector with the following error on ... Kernel panic - not syncing: ... Seagate 7200.9 drives on the NVIDIA SATA controller ...
      (Linux-Kernel)
    • Fileserver Issues
      ... It contains 2 x 4 drive raid5 arrays using mdadm. ... some "kernel bug" log messages. ... After a couple of hours the first array had ... thermal processor fan ...
      (Debian-User)
    • [PATCH][resend] md: Documentation/md.txt update
      ... This autodetection may be suppressed with the kernel parameter ... The md driver can support a variety of different superblock formats. ... An array should be created by a user-space tool. ...
      (Linux-Kernel)