Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)



On 03/25/2009 10:31 PM, Eric Sandeen wrote:
Jeff Garzik wrote:
Eric Sandeen wrote:

What about when you're running over a big raid device with
battery-backed cache, and you trust the cache as much as much as the
disks. Wouldn't this unconditional cache flush be painful there on any
of the callers even if they're rare? (fs unmounts, freezes, unmounts,
etc? Or a fat filesystem on that device doing an fsync?)
What exactly do you think sync_blockdev() does? :)

It used to push os cached data to the storage. Now it tells the storage
to flush cache too (with your patch). This seems fine in general,
although it's not a panacea for all the various data integrity issues
that are being tossed about in this thread. :)

It is used right before a volume goes away. If that's not a time to
flush the cache, I dunno what is.

The _whole purpose_ of sync_blockdev() is to push out the data to
permanent storage. Look at the users -- unmount volume, journal close,
etc. Things that are OK to occur after those points include: power off,
device unplug, etc.

Sure. But I was thinking about enterprise raids with battery backup
which may last for days.

But, ok, I wasn't thinking quite right about the unmount situations etc;
even on enterprise raids like this, flushing things out on unmount makes
sense in the case where you lose power post-unmount and can't restore
power before the battery backup dies.

Enterprise raid systems don't have this issue. They all have sufficient battery power to safely destage the volatile cache to persistent storage on power outage (i.e., they keep enough drives spinning and so on to empty the cache).

Hardware RAID cards that you have in some servers (not external RAID arrays) would have this issue though and are probably fairly common.

I also wondered if a cache flush on one lun issues a cache flush for the
entire controller, or just for that lun. Hopefully the latter, in which
case it's not as big a deal.

Probably just for that LUN - LUN's are usually independent in almost all ways. Firmware of course could do anything, I would assume that most high end arrays would probably ignore a flush command in any case.

A secondary purpose of sync_blockdev() is as a hack, for simple/ancient
bdev-based filesystems that do not wish to bother with barriers and all
the associated complexity to tracking what writes do/do not need flushing.

Yep, I get that and it seems reasonable except ....

xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
not enabled, I think for that reason...
Enabling barriers causes slowdowns far greater than that of simply
causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH
CACHE issuance for all in-kernel filesystem journalled/atomic
transactions, in addition to whatever syscalls userspace is issuing.

The number of FLUSH CACHES w/ barriers is orders of magnitude larger
than the number of fsync/fdatasync calls.

I understand all that.

My point is that the above filesystems (xfs, reiserfs, ext4) skip the
blkdev flush on fsync when barriers are explicitly disabled.

They do this because if an admin disables barriers, they are trusting
that the write cache is nonvolatile and will be able to destage fully
even if external power is lost for some time.

In that case you don't need a blkdev_issue_flush on fsync either (or are
at least willing to live with the diminished risk, thanks to the battery
backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the
barriers off setting). With this change to the simple generic fsync
path, you can't turn it off for those filesystems that use it for fsync.

But I suppose it's rare that anybody ever uses a filesystem which uses
this generic sync method on any sort of interesting storage like I'm
talking about, and it's not a big deal... (or maybe that interesting
storage just ignores cache flushes anyway, I dunno).

My main concerns were that these extra cache flushes for fsync aren't
tunable, and that flushes on one lun might affect other luns. I guess
I've talked myself out of those concerns in a couple different ways now. ;)

-Eric

I do enthusiastically agree that we should not be doing barriers and the blkdev flush for file systems that do barriers correctly.

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices
    ... Using working barriers is important for normal users when you really ... write cache that will survive power outages... ... And second, "surprisingly", battery-backed RAID write caches tends to fail ... such a battery is enough to keep the data ...
    (Linux-Kernel)
  • Re: vfs: Add MS_FLUSHONFSYNC mount flag
    ... cache should be flush or not on fsync/fdatasync. ... it's built with a nonvolatile write cache or if it has no write cache. ... The main reason I decided to go for the mount option approach is to be ... consistent with what we do when it comes to write barriers. ...
    (Linux-Kernel)
  • Re: [PATCH 1/7] block: Add block_flush_device()
    ... some random filesystem shouldn't disable it. ... this whole insane belief that "write cache" has anything to do with "unable to flush" is just bogus. ... The way the barriers work does absolutely give you full ordering. ...
    (Linux-Kernel)
  • Re: vfs: Add MS_FLUSHONFSYNC mount flag
    ... cache should be flush or not on fsync/fdatasync. ... it's built with a nonvolatile write cache or if it has no write cache. ... The main reason I decided to go for the mount option approach is to be ... consistent with what we do when it comes to write barriers. ...
    (Linux-Kernel)
  • flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)
    ... If bit 0 is set to one, the mandatory FLUSH CACHE and FLUSH CACHE EXT ... commands support the RANGE bit, and user-supplied LBA ... and sector count specifying the limits of the cache flush. ...
    (Linux-Kernel)