Re: [PATCH 00/16] DRBD: a block device for HA clusters



On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
When you do asynchronous replication, how do you ensure that implicit
write-after-write dependencies in the stream of writes you get from
the file system above, are not violated on the secondary ?

Are you telling me drbd doesn't currently do this?


No I am not. DRBD does exactly this!
But I am wondering how that is achieved in the MD/NBD stack when running
in async mode.

The explanation is below.

The issue is covered since the early days in DRBD, (back in 2000).
The issue, and the solution we have in DRBD is described in this paper:

http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf

The way nbd does it (in the updated tools is to use DIRECT_IO and
fsync).

Is that available in the existing tools ? -- Are the updated tools
something that will be available in the future ?

It's in the existing.

Are you telling me md/ndb (async) doesn't currently do this ?

I just described how it doe this ... I don't quite see how that
translates into telling you it doesn't do this.

There might be a disk scheduler on the secondary.

There usually is a disk scheduler ... you just have to take the required
action to persuade it to preserve ordering ... a simplistic way of doing
this is to switch to the noop scheduler.

The issue actually goes further down the stack. Not only the in kernel
disk scheduler might reorder something, also the driver and finally the
drive might do so.

What we have in DRBD boils down to:

* We obey all possible write after write dependencies in the stream of
writes we get from the upper layers. And generate DRBD internal
reorder barriers for the packet stream.
* On the secondary node we impose these barriers onto the stream of writes
submitted to the stack below us by either:

- Let previously submitted write-IO drain before we submit write-IO after
such an DRBD barrier. (That we have since 2000 or so)

- Additionally issue a blkdev_issue_flush()

- Use write requests with BIO_RW_BARRIER. This method has two advantages:
We can continue to submit writes after the DRBD internal barrier
immediately, and the number of requests with BIO_RW_BARRIER can be
further reduced.
See section 6 of
http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
for more details, and nice illustrations.

THere's a slight error in there ... we don't use ordered tags for
barriers (yet). I don't think it will really matter because the main
domain of ordering problems is the scheduler, which REQ_BARRIER does
cope with, it just means the queue drains for a barrier.

Unfortunately only high end SAN devices seem to benefit from this
method. For most in-machine-disk controlers this method does not
achieve the highest throughput.

Expressed in other words:
We allow reordering on the secondary node to an extend so that we can
guarantee that no implicit write-after-write dependencies are violated.

Coming back to the idea of disabling the in Linux IO scheduler. It might
solve the issue for some devices, but it does not guarantee to solve it.

I think you'll find the dio/fsync method above actually does solve all
of these issues (mainly because it enforces the semantics from top to
bottom in the stack). I agree one could use more elaborate semantics
like you do for drbd, but since the simple ones worked efficiently for
md/nbd, there didn't seem to be much point.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [PATCH 00/16] DRBD: a block device for HA clusters
    ... reorder barriers for the packet stream. ... I couldn't find a call to blk_queue_orderedin the DRBD 8.3.1 source ... This made me wonder how DRBD obtains information about barriers ... I was refering to implicit write after write dependencies, ...
    (Linux-Kernel)
  • Re: [PATCH 00/16] DRBD: a block device for HA clusters
    ... secondary node as the stream of writes was comming in on the primary? ... What I want to work out is, that in DRBD we have that capability to allow ... it's very hard to find a failure case where the write order on the ... flight into the network tap was in order). ...
    (Linux-Kernel)
  • Re: [PATCH 00/16] DRBD: a block device for HA clusters
    ... We obey all possible write after write dependencies in the stream of ...  reorder barriers for the packet stream. ... I couldn't find a call to blk_queue_orderedin the DRBD 8.3.1 source ... This made me wonder how DRBD obtains information about barriers ...
    (Linux-Kernel)
  • Re: [PATCH 00/16] DRBD: a block device for HA clusters
    ... functionality of drbd that you could use. ... Both the in kernel md/nbd and drbd do sync and async replication ... On the secondary node we impose these barriers onto the stream of writes ...
    (Linux-Kernel)
  • device mapper not reporting no-barrier-support?
    ... I'm currently stuck between Kernel LVM and DRBD, ... -LVM2/device mapper doesn't support write barriers ... the driver 3w-9xxx supports barriers and after moving my D ...
    (Linux-Kernel)