Re: [PATCH 0/7] dlm: overview

From: Steven Dake (sdake_at_mvista.com)
Date: 04/27/05

  • Next message: Benjamin Herrenschmidt: "Re: pci-sysfs resource mmap broken"
    To: Mark Fasheh <mark.fasheh@oracle.com>
    Date:	Tue, 26 Apr 2005 15:34:42 -0700
    
    

    On Tue, 2005-04-26 at 11:48, Mark Fasheh wrote:
    > On Tue, Apr 26, 2005 at 01:39:30PM +0800, David Teigland wrote:
    > > On Mon, Apr 25, 2005 at 01:39:52PM -0700, Wim Coekaerts wrote:
    > > > > This is a distributed lock manager (dlm) that we'd like to see added to
    > > > > the kernel. The dlm programming api is very similar to that found on
    > > > > other operating systems, but this is modeled most closely after that in
    > > > > VMS.
    > > >
    > > > do you have any performance data at all on this ? I like to see a dlm
    > > > but I like to see something that will also perform well.
    > >
    > > No. What kind of performance measurements do you have in mind? Most dlm
    > > lock requests involve sending a message to a remote machine and waiting
    > > for a reply. I expect this network round-trip is the bulk of the time for
    > > a request, which is why I'm a bit confused by your question.
    > Resource lookup times, times to deliver events to clients (asts, basts,
    > etc) for starters. How long does recovery take after a node crash? How does
    > all of this scale as you increase the number of nodes in your cluster?
    > Sure, network speed is a part of the equation, but it's not the *whole*
    > equation and I've seen dlms that can get downright nasty when it comes to
    > recovery speeds, etc.
    >

    Reading these requests and information about how locking works and
    performs, I have a suggestion to improve performance dramatically.

    Use implicit acknowledgement, self delivery, and message packing. With
    these approaches I think it is possible to support acquisition of
    15,000-30,000 locks per second on modern 3ghz cpus with 100mbit
    networking.

    Most important of these is implicit acknoweldgement, whereby instead of
    sending a request and waiting for a response from one node, the request
    is sent to all processors. Then self delivery is used to deliver the
    lock request to the lock service which processes it and accepts it if
    the lock can be taken or rejects it if the lock cannot be granted (or
    puts it in a list to be granted later, or whatever). This does require
    that all processors agree upon the order of the messages sent through
    implicit acknowledgement. But it removes the *reply* step which reduces
    latency and allows a processor to obtain a lock as soon as the message
    is self-delivered to the requesting processor. It also creates a
    redundant copy of the lock state that all processors maintain, likely
    improving availability in the face of stop or restart faults.

    Message packing is an important improvement, since I assume the size of
    the lock request structures are small. the grant time on ethernet is
    about the same for small or large packets, so reducing the number of
    times access to the ethernet must be granted can make a huge
    improvement. This allows packing a grouping of lock requests from one
    processor into a full MTU sized packet for 8x-10x performance gains for
    150 byte messages on 1500 mtu networks...

    Without these approaches, I'd expect performance somewhere in the
    1,000-2,000 grants per second... How does the implementation in this
    patch perform? I'd also be interested to know how long it takes from
    the time the request is made on the processor to the time it is
    completed (perhaps as measured by gettimeofday, or something similiar).

    Mark, how does oracle's dlm perform for your questions above?

    > > Now, sometimes there are two remote messages (when a resource directory
    > > lookup is needed). You can eliminate that by not using a resource
    > > directory, which will soon be a configurable option.
    > >
    > >
    > > > My main concern is that I have not seen anything relying on this code do
    > > > "reasonably well". eg can you show gfs numbers w/ number of nodes and
    > > > scalability ?
    > >
    > > I'd suggest that if some cluster application is using the dlm and has poor
    > > performance or scalability, the reason and solution lies mostly in the
    > > app, not in the dlm. That's assuming we're not doing anything blatantly
    > > dumb in the dlm, butI think you may be placing too much emphasis on the
    > > role of the dlm here.
    > Well, obviously the dlm is only one component of an entire system, but for a
    > cluster application it can certainly be an important component, one whose
    > performance is worth looking into. I don't think asking for this
    > information is out of the question.
    > --Mark
    >
    > > > I think it's time we submit ocfs2 w/ it's cluster stack so that folks
    > > > can compare (including actual data/numbers), we have been waiting to
    > > > stabilize everything but I guess there is this preemptive strike going
    > > > on so we might just as well. at least we have had hch and folks comment,
    > > > before sending to submit code.
    > >
    > > Strike? Preemption? That sounds frightfully conspiratorial and
    > > contentious; who have you been talking to? It's obvious to me that ocfs2
    > > and gfs each have their own happy niche; they're hardly equivalent (more
    > > so considering all the flavors of local file systems.) This is surely a
    > > case of "different", not "conflict"!
    > >
    > >
    > > > Andrew - we will submit ocfs2 so you can have a look, compare and move
    > > > on. we will work with any stack that eventuslly gets accepted, just want
    > > > to see the choice out there and an educated decision.
    > > >
    > > > hopefully tomorrow, including data comparing single node and multinode
    > > > performance.
    > >
    > > I'd really like to see ocfs succeed, but good heavens, why do we need to
    > > study an entire cluster fs when looking at a dlm!? A cluster fs may use a
    > > dlm, but a dlm is surely a stand-alone entity with _many_ applications
    > > beyond a cluster fs (which is frankly a rather obscure app.)
    > >
    > > We've made great effort to make the dlm broadly useful beyond the realm of
    > > gfs or cluster file systems. In the long run I expect other cluster apps
    > > will out-use the dlm by far.
    > >
    > > Dave
    > >
    > > -
    > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > > the body of a message to majordomo@vger.kernel.org
    > > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > > Please read the FAQ at http://www.tux.org/lkml/
    > --
    > Mark Fasheh
    > Senior Software Developer, Oracle
    > mark.fasheh@oracle.com
    >
    > -
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Benjamin Herrenschmidt: "Re: pci-sysfs resource mmap broken"

    Relevant Pages

    • Re: [PATCH 1b/7] dlm: core locking
      ... > cluster snapshot metadata and ddraid dirty log. ... > depending on whether a lock happens to master locally or not, ... So the easy way to use the DLM for metadata ownership is simply to have ...
      (Linux-Kernel)
    • Advice sought on how to lock multiple pages in ->prepare_write and ->writepage
      ... I am looking for advice on how to lock multiple pages in ->prepare_write ... We have a mounted NTFS volume with a cluster, i.e. logical block, size ... This means that the NTFS driver will receive a write request either via ...
      (Linux-Kernel)
    • Re: [Linux-cluster] redhat cluster running on debian 6.
      ... You *must* configure fencing for the cluster to work properly. ... Node 2 requests a DLM lock, is refused because the lock is out to Node A. ... The clyster will call a fence and DLM will block. ...
      (Debian-User)
    • Re: network wide lock
      ... > I have a bunch of servers in a cluster. ... > and i want to propagate the change to the other servers. ... > of the servers to process a request whilst the cluster is ... > I was thinking that if i could have some sort of network wide lock then ...
      (comp.lang.java.programmer)
    • Re: [PATCH 1b/7] dlm: core locking
      ... > I have always thought that an distributed application could use the DLM ... > until after nodes that are no longer a member of the cluster are safely ... depending on whether a lock happens to master locally or not, ... node membership age, ...
      (Linux-Kernel)