Re: [PATCH 1b/7] dlm: core locking

From: Daniel Phillips (phillips_at_istop.com)
Date: 05/05/05

  • Next message: Kylene Hall: "[PATCH 9 of 12] Fix TPM driver -- remove unnecessary __force"
    To: "Stephen C. Tweedie" <sct@redhat.com>
    Date:	Thu, 5 May 2005 15:29:31 -0400
    
    

    Hi Stephen,

    On Thursday 05 May 2005 08:25, Stephen C. Tweedie wrote:
    > Hi,
    >
    > On Sat, 2005-04-30 at 10:09, Daniel Phillips wrote:
    > > As you know, this is how I currently determine ownership of such
    > > resources as cluster snapshot metadata and ddraid dirty log. I find the
    > > approach distinctly unsatisfactory. The (g)dlm is rather verbose to use,
    > > particularly taking into the account the need to have two different state
    > > machine paths, depending on whether a lock happens to master locally or
    > > not, and the need to coordinate a number of loosely coupled elements:
    > > lock status blocks, asts, the calls themselves. The result is quite a
    > > _long_ and opaque program to do a very simple thing.
    >
    > Why on earth do you need to care where a lock is mastered?

    That is just my point. I wish I did not have to care. But gdlm behaves
    differently - returns status in different ways - depending on whether a lock
    is mastered locally or not.

    > Use of ASTs
    > etc. should be optional, too --- you can just use blocking variants of
    > the lock primitives if you want. There's a status block, sure, but you
    > can call the lock grant function synchronously and the status block is
    > guaranteed unambiguously to be filled on return.

    Writing non-trivial code that is supposed to perform well under parallel loads
    is practically impossible with the blocking variants. As far as I can see,
    for nontrivial applications, the blocking calls are just "training wheels"
    for the real api,

    > So the easy way to use the DLM for metadata ownership is simply to have
    > a thread which tries, synchronously, to acquire an EX lock on a
    > resource. You get it or you stay waiting; when you get it, you own the
    > metadata. Pretty simple. (The only real complication in this
    > particular case is how to deal with the resource going away while you
    > wait, eg. unmount.)

    The complication arises from the fact that you then need to advise the rest of
    the cluster that you own the metadata. How? LVB, obviously. But then you
    run smack into the whole culture of LVB semantics and oddball limitations.
    For example, what happens when the owner of a LVB dies, what is the value of
    the LVB then? Can we prove that our metadata ownership scheme is still
    raceless?

    Honestly, using the dlm is a perverse way to establish metadata ownership when
    we have a _far_ more direct way to do it. As a fringe benefit, we lose the
    dlm dependency from a whole batch of cluster components, for example, all the
    cluster block devices. This is clearly the Right Thing to Do[tm].

    > > And indeed, instinct turns out to be correct: there is a far simpler way
    > > to handle this: let the oldest member of the cluster decide who owns the
    > > metadata resources.
    >
    > Deciding who owns it is one thing. You still need the smarts to work
    > out if recovery is *already* in progress somewhere, and to coordinate
    > wakeup of whoever you've granted the new metadata ownership to, etc.

    That is part of the service group recovery protocol ((re)start recovery/halt
    recovery/recovery success).

    > Using a lock effectively gets you much of that for free, once you've
    > done the work to acquire the EX lock in the first place.

    I'm afraid that "free" is an illusion here. The metadata ownership code is
    very, very ugly using the dlm approach, but is very nice using the cman event
    interface, and will become even nicer after we fix cman a little.

    The reason for this is, cman membership events fit the problem better. The
    dlm api just isn't suited to this. It can be made to fit, but the result is
    predictably ugly. We should not lose sight of the fact that the dlm is
    actually just implemented on top of a set of cluster synchronization
    messages, as is cman. In fact, cman provides the messaging layer that dlm
    uses (or it does in the code that I am running here, I have not seen the
    rumoured new version of cman yet). So the reason for not using cman directly
    is, what exactly?

    Or putting it another way, what value does the dlm add to the metadata
    ownership code? It sure does not simplify it. And in my opinion, it does
    not make it more obviously correct either, quite the contrary.

    > > Good instinct. In fact, as I've said before, you don't necessarily need
    > > a dlm in a cluster application at all. What you need is _global
    > > synchronization_, however that is accomplished. For example, I have
    > > found it simpler and more efficient to use network messaging for the
    > > cluster applications I've tackled so far.
    >
    > Yes, there is definitely room for both. In particular, the more your
    > application looks like client/server, the less appropriate a DLM is.

    True, very true. And metadata ownership problems tend to look just like
    client/server problems, even if we break the metadata up into multiple parts
    and distribute it around the cluster.

    Anyway, it turns out that a significant number of the cluster components have
    ended up looking like client/server architectures. For the time being, the
    single - and major - exception is the distributed filesystem itself. Oh, and
    gdlm and cman, which are self-fullfilling prophesies.

    So we have, from the bottom up:

      - cman: distributed
      - block device export: client/server
      - cluster raid: client/server
      - cluster snapshot: client/server
      - gdlm: distributed
      - gfs: distributed
      - applications: need more data

    We can be fairly sure that the current crop of gfs applications is _not_ using
    the dlm, for the simple reason that the dlm has not existed long enough. How
    do existing application synchronize then? The truth is, I do not know,
    because I have not conducted any survey. However, the one major application
    I have looked at porting to gfs already has its synchronization working fine,
    using point-to-point socket connections. The only major bit it needs to
    become a distributed cluster application is a distributed filesystem. In the
    end, there will be no dlm anywhere to be seen at the application level. The
    application is, in my humble opinion, better because of this.

    > > This suggests to me that the dlm is going to end up pretty much as
    > > a service needed only by a cfs, and not much else.
    >
    > But once you've got a CFS, it suddenly becomes possible to do so much
    > more in user-space in a properly distributed fashion, rather than via
    > client/server. Cluster-wide /etc/passwd? Just lock for read, access
    > the file, unlock. Things like shared batch/print queues become easier.
    > And using messaging is often completely the wrong model for such things,
    > simply because there's no server to send the message to. A DLM will
    > often be a far better fit for such applications.

    I think I can do a much better, cleaner job of distributed passwords by
    directly using cman service group events. Do you want to see some code for
    that?

    > > The corollary of that is,
    > > we should concentrate on making the dlm work well for the cfs, and not
    > > get too wrapped up in trying to make it solve every global
    > > synchronization problem in the world.
    >
    > Trouble is, I think you're mixing problems here. There are two
    > different problems: whether the DLM locking model is a good primitive to
    > use for a given case; and whether the specific DLM API in question is a
    > good fit for the model itself.

    Yes.

    > And your initial complaints about needing to know local vs. remote
    > master, dealing with ASTs etc. are really complaints about the API, not
    > the model.

    Those complaints were about the api. Other complaints are about the model,
    and I have more complaints about the model than the ones I have already
    mentioned. For example: the amount of data you can pass around together with
    a lock grant is pathetically limited. (Other complaints can wait for the
    appropriate thread to pop up.)

    > Using blocking, interruptible APIs gets rid of the AST issue
    > entirely for applications that don't need that level of complexity. And
    > you obviously want to have an API variant that doesn't care where the
    > lock gets mastered --- for one thing, a remotely mastered lock can turn
    > into a locally mastered one after a cluster membership transition.

    We should just lose the variant that cares. There is no efficiency argument
    for including that brokenness in the api.

    > So let's keep the two separate. Sure, there will be cases where a DLM
    > model is more or less appropriate; but given that there are cases where
    > the model does work, what are the particular unnecessary complications
    > that the current API forces on us? Remove those and you've made the DLM
    > model a lot more attractive to use for non-CFS applications.

    Yes. I would be perfectly happy to put aside the "alternatives to dlm" thread
    and concentrate purely on fixing the dlm api. Please do not misinterpret my
    position: we do need a dlm in the cluster stack. Now please let us ensure
    that _our_ dlm is a really, really nice dlm with a really, really, nice api.

    Regards,

    Daniel
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Kylene Hall: "[PATCH 9 of 12] Fix TPM driver -- remove unnecessary __force"

    Relevant Pages

    • Re: [PATCH 1b/7] dlm: core locking
      ... > cluster snapshot metadata and ddraid dirty log. ... > depending on whether a lock happens to master locally or not, ... So the easy way to use the DLM for metadata ownership is simply to have ...
      (Linux-Kernel)
    • Re: [PATCH 1b/7] dlm: core locking
      ... using the dlm to protect the shared storage resource ... So let's just expose that and have a better cluster ... >> found it simpler and more efficient to use network messaging for the ... > implemented via barriers, barriers via locks, locks via messages, global ...
      (Linux-Kernel)
    • RE: [Linux-cluster] Re: [Ocfs2-devel] [RFC] nodemanager, ocfs2, dlm
      ... OpenSSI Cluster project ... Subject: [Linux-cluster] Re: nodemanager, ocfs2, dlm ... I'm thinking a service might have a good reason to want to know the possible list of nodes as opposed to the currently active membership; though the DLM as the service in question right now does not appear to need such. ...
      (Linux-Kernel)
    • Re: [PATCH 1b/7] dlm: core locking
      ... > nothing to do with the dlm, ... > cman supported a stable ordering of cluster node longevity, ... > cluster node membership age, ... resource recovery: rely on the global ordering of nodes provided by the ...
      (Linux-Kernel)
    • Re: [Ocfs2-devel] [RFC] nodemanager, ocfs2, dlm
      ... dlm domains were in the context of the same cluster. ... Each kernel component "foo-kernel" has an associated user space ... infrastructure) are passed to foo-user -- not into the kernel. ...
      (Linux-Kernel)

    Loading