Designing Another File System

From: John Richard Moser (nigelenki_at_comcast.net)
Date: 11/30/04

  • Next message: Rusty Russell: "[PATCH] sys_sched_setaffinity() on UP should fail for non-zero CPUs."
    Date:	Mon, 29 Nov 2004 23:32:05 -0500
    To: linux-kernel@vger.kernel.org
    
    

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Greetings.

    I've been interested in file system design for a while, and I think I
    may be able to design one. Being poor, I would like to patent/sell it
    when done; however, whatever the result, I'd assuredly give a free
    license to implement the resulting file system in GPL2 licensed code (if
    you're not making money off it, I'm not making money off it). This is
    similar in principle to how Hans Reiser sells his file system I think.

    I've examined briefly the overall concepts which go into several file
    systems and pooled them together to create my basic requirements. In
    these include things such as:

    - - a new journaling method that scales up and down to high loads and
    small devices

    - - localization of Inodes and related meta-data to prevent disk thrashing

    - - a scheme which allows Inodes to be dynamically allocated and
    deallocated out of order

    - - 64 bit indices indicating the exact physical location on disk of
    Inodes, giving a O(1) seek to the Inode itself

    - - A design based around preventing information leaking by guaranteed
    secure erasure of data (insofar that the journal even will finish wiping
    out data when replaying a transaction)

    I have 6 basic questions.

    1) Can Unix utilities in general deal with 64 bit Inodes? (Most
    programs I assume won't care; ls -i and df -i might have trouble)

    2) Are there any other security concerns aside from information leaking
    (deleted data being left over on disk, which may pop up in incompletely
    written files)?

    3) What basic information do I absolutely *need* in my super block?

    4) What basic information do I absolutely *need* in my Inodes? (I'm
    thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
    links}

    5) What basic information do I absolutely *need* in my directory
    entries? (I'm thinking {name,inode})

    6) How do I lay out a time field for atime/dtime/ctime/mtime?

    A few more specifics, as Block size grows, the file system does not lose
    efficiency. Fragment Blocks should skillfully subdivide huge Blocks
    with a third level index, increasing seek time O(+1) for the information
    stored in the Fragment Block:

    - -Inode
    |-Meta-data
    ||-Data Block
    |||-Data (2 seeks)
    ||-Fragment Block
    |||-Fragment
    ||||-Data (3 seeks)

    Fragment Blocks can store anything but Inodes, so it would be advisable
    to avoid huge Blocks; if a Block is 25% of the disk, for example, one
    Block must be dedicated to serving up Inode information. Also, with
    64KiB blocks, a 256TiB file system can be created. Larger Blocks will
    allow larger file systems; a larger file system will predictably house
    more files (unless it houses one gargantuan relational data base-- the
    only plausible situation where my design would require more, smaller
    blocks to be more efficient).

    Directories will have to be some sort of on disk BsomethingTree. . . B*,
    B+, something. I have no idea how to implement this :D I'll assume I
    treat the directory data as a logically contiguous file (yes I'm gonna
    store directory data just like file data). I could use a string hash of
    the Directory Entry name with disambiguation at the end, except my
    options here seem limited:

    - - Use a 32 bit string hash, 4 levels, 8 bits per level. 256 entries of
    32 bit {offset} for the next level per level, 1024B per level on unique
    paths. Worst case 1 path per file, 65536 files in the worst case
    measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
    (64M), total 128MiB, collision rate is probably low. Ouch.

    - - Use a 16 bit string hash, 2 levels, 8 bits per level. 256 entries of
    32 bit {offset) for the next level per level, 1024B per level on unique
    path. Worst case 1 path per file, 65536 files in the worst case measure
    1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
    rate is probably higher than that. Beyond 65536 creates collisions.
    Hitting a disambiguation situation is going to be fairly common.

    I guess the second would be better? I can't locate any directories on
    my drive with >2000 entries *shrug*. The end key is just the entry
    {name,inode} pair.

    Any ideas?

    - --
    All content of all messages exchanged herein are left in the
    Public Domain, unless otherwise explicitly stated.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.6 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFBq/fEhDd4aOud5P8RAofxAJ4hBzZfFiLAyU413zNj2jVqlLzXWACfcaqd
    4iwnewSp2xKrO94F9TF6dSY=
    =ZKm6
    -----END PGP SIGNATURE-----
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Rusty Russell: "[PATCH] sys_sched_setaffinity() on UP should fail for non-zero CPUs."

    Relevant Pages

    • Re: SMP on FreeBSD 6.x and 7.0: Worth doing?
      ... It'd be easy to rewrite it from scratch if IFS were recovered. ... In the slightly less intrusive Arla view of the world, cache files do appear in the UFS name space, but an independent namespace is maintained by the cache manager, each with two file system names: a normal path, and its NFS file handle, which can be used to open, stat, etc, the file without a normal file system namespace operation. ... The user application can allocate a set of inodes in some arbitrary directory tree using normal operations, but when it does so also query the NFS file handles for the files using getfh. ...
      (freebsd-stable)
    • Re: How many directories can I have ?
      ... >>many directories am I likely to be able to create on a file system ... >>there will be some limit that cuts in before disk space is exhausted. ... >>who replied he thought the system was out of inodes. ... I ran into an inode limit on a Linux server running innd (news ...
      (comp.sys.sun.admin)
    • Re: journaling fs and large mailbox format
      ... newfs'ing of the file system will be necessary. ... create a new data disk, formatted to fit your specs, then copy the data ... I'm not convinced that increasing the number of inodes in this way would be ... without seeing the files), that the inode problem is a symptom, and that ...
      (freebsd-hackers)
    • GPFS synchronous mirroring not working, maybe?
      ... nsd's using mmchdisk, when I created a file on the file system, it was ... sfs -r, ... LIVDB2# lslpp -L | grep gpfs ... Number of allocated inodes: 6168094 ...
      (comp.unix.aix)
    • SUMMARY: no space left on device
      ... The solution is that my file system was out of i-nodes so I took ... it increased the no of inodes and my file system is alright now. ... Since you have a mirror, you should be able to break ... >advise us immediately if you are not the named addressee or the ...
      (SunManagers)