A question on modules, caching and file systems - probably.

From: Steve H (netspam_at_shic.co.uk)
Date: 05/17/04


Date: 17 May 2004 04:38:24 -0700

Firstly, I apologise for sounding vague... though, from my
perspective, this is preferable to being misleading by sounding too
specific.

I am at the very early stages of planning a project (which may well
turn out to be of little interest to anyone but myself) as an
investigation into alternative long-term storage strategies for
dynamic web-content. The basis for this question is the premise that
I believe modern file-systems do not readily offer high performance,
high-reliability support the types of interaction I would like. I
have a back-of-a-napkin design for a storage API, and playing with
some user-land implementations has prompted me to question my original
ideas to simply layer my solution on top of the file-system.
Anecdotally, I comment that mainframe file-systems from the yesteryear
of ‘networked-database' programming with OS level support for ISAM and
DBMS-like transactions seem a far better (if rather antique) fit for
my requirements. I've looked at various file systems (and file-system
proposals) – ranging from, for example, ReiserFS, Ext3, Tux2
(previously discussed by W. Phillips), Spin-disk etc. etc. and
user-level libraries like DBM/SleepyCat DB libraries. To my mind, the
generic file-system API does not offer the features I want – for
example I will know the size of my files when I create them – but
contemporary FS interfaces assume all files are arbitrary length
streams –hence, I guess, precluding many advantageous optimisation.
While user-land ISAM libraries and their ilk offer many of the
features I would like to see, I am concerned about their increased
complexity from a reliability perspective. For example, if my
application (by way of the library) holds open files, I am concerned
that an event such as a process-abort may leave files in an
inconsistent state, and (depending upon the underlying file system)
maybe an unpredictable inconsistent state. In the event of an OS
abort, I understand, I have very few guarantees about the state of
pending write operations.

I have a hunch that it might be advantageous to implement at least
some of the functionality I require as if it were a file-system (with
a richer API that used by generic file systems.) I am interested in
investigating my options to write a module to implement an alternative
API to interact with persistent storage. I am not sure, as yet, if it
would be better to develop a kernel module offering a lower-level
interaction than currently offered by file-systems – or to go the
other way and embed a richer API to support the high level operations
I wish were supported by file-systems. When looking at the
possibility of writing a user-land library for this project I am faced
with two particularly "thorny" issues:

        1. What can be guaranteed about the durability and atomicity of write
operations? I can synchronise with msync() but the exact behaviour of
this seems to be a little hard to pin down and seems hardware
dependent. A typical interaction will result in a number of write
operations which can be processed in any order, followed by a
distinguished write operation which must only start after the previous
set of write operations are completed – only the final write needs to
be atomic.
        2. Both read and write caching will be required for large volumes of
data. While it is realistic to mmap(), say, 1Gb of data into VM space
support of realistic multi-terabyte volumes would require explicit
mapping and un-mapping of a working set of page-ranges... which
introduces several new caching complexities. It feels as if managing
this VM space at user-level is re-inventing the caching already used
to great effect by file systems. Furthermore, I am concerned that any
mmap() solution at user-level may adversely affect utilisation of RAM
on servers with lots of memory (say >4Gb) and as such act as a
performance bottleneck.

So, in respect to the above rambling, I have several more direct
questions:

        1. Does anyone know of similar or related projects for Linux (2.4 or
2.6 kernels) or BSD based systems either to offer a richer API (with
ISAM, transaction and historic-journal-like capabilities) or a lower
level API (permitting detailed control of concurrency, atom-size, etc.
etc.)?
        2. Do kernel hackers think I'm misguided in thinking that
kernel-level support would significantly improve performance and
reliability?
        3. Bearing in mind that most of my experience is at user-process
level, can anyone point me at minimalist samples to:
                a. Interact with in-memory caching of files for read and write to
disk from within a file-system module?
                b. Implement a skeletal Linux (or BSD) file-system module.

And – of course, I'm interested in any comments and/or constructive
criticism...

Thanks in advance,

Steve



Relevant Pages

  • A question on modules, caching and file systems - probably.
    ... have a back-of-a-napkin design for a storage API, ... ideas to simply layer my solution on top of the file-system. ... I've looked at various file systems (and file-system ... While user-land ISAM libraries and their ilk offer many of the ...
    (comp.os.linux.questions)
  • Re: directory data structure
    ... I found out that we could use an API provided by dirent.h header file, ... can be difficult to determine the "inode number" these days. ... with file systems that do not support the concept (eg. msdos). ...
    (comp.os.linux.misc)
  • Re: Help in finding a file needed
    ... basically says all file systems must provide the following API. ... UFS, etc. Underneath *that* you have virtual device code, underneath ... the explanation did okay for radio. ...
    (uk.comp.sys.mac)
  • Re: linux kernel without file system
    ... removing all the fs-related system calls, so the problem is not where ... the file-system is, but how to access devices without giving their ... One thing that you can do is remove support for block devices in somewhat ... File systems themselves ...
    (Linux-Kernel)
  • Re: which OutputStreams are buffered?
    ... I guess one of the reasons is that it can be very difficult to ... implement an API that make it 100% sure the data is at location that ... Cache in RAID controllers, cache in disk drives, ... From what I understand one can configure certain file systems to be truthful about their sync activity. ...
    (comp.lang.java.programmer)