Re: Starting a grad project that may change kernel VFS. Early research
- From: Jeff Shanab <jshanab@xxxxxxxxxxxxx>
- Date: Mon, 24 Aug 2009 21:23:35 -0700
Bryan Donlan wrote:
On Mon, Aug 24, 2009 at 10:05 PM, Jeff Shanab<jshanab@xxxxxxxxxxxxx> wrote:Easier was a bad choice of words. I really meant move is less expensive.
On Mon, Aug 24, 2009 at 04:54:52PM -0700, Jeff Shanab wrote:Could this could be done low priority in the background long after fsck and the boot process is done?
Is this something that you want to be stored in the file system, orI was thinking that a good way to handle this is that it starts with
a file change in a directory. The directory entry contains a sum already
for itself and all the subdirs and an adjustment is made immediately to
that, it should be in the cache. Then we queue up the change to be sent
to the parent(s?). These queued up events should be a low priority at a
more human time like 1 second. If a large number of changes come to a
directory, multiple adjustments hit the queue with the same (directory
name, inode #?) and early ones are thrown out. So levels above would see
at most a 1 per second low priority update.
just cached in memory? If it is going to be stored on disk, which
seems to be implied by your description, and it is only going to be
updated once a second, what happens if there is a system crash? Over
time, the values will go out of date. Fsck could fix this, sure, but
that means you have to do the equivant of running "du -s" on the root
directory of the filesystem after an unclean shutdown.
There will probably be a cutoff point where du -s after a command is better than the file by file, like when we recursively move a directory But I was gonna run tests and see how that went. Mv may be actually easier than cp, it is a tree grafting.
cp is easier than mv - in that it requires no explicit support from
your layer. 'cp' really just loops doing read() and write() - there
are some experimental copy-on-write ioctls for btrfs, I think, but
nothing standard there yet.
Also, directories aren't 'recursively moved' - if you're moving withinI should of been clear, that is what I meant by tree grafting :-)
a mount, you just rename() the directory, and it's moved in what is on
most filesystems an O(1) operation.
If you're moving between mounts,Now that is interesting, I am sure I would of realized that eventually,
the kernel gives you no help whatsoever - it's up to the 'mv' program
to copy the directory, then delete the old one.
I have certainly seen it in action. Just hadent thought of that this
time. Thanks.
So does mv essentially become copy when between mounts?
But the size of a subdirectory is not stored in the inode in this
You could write the size changes in a journal, but that blows up theYeah fsck is an interesting scenario.
size of information that would need to be stored in a journal. It
also slows down the very common operaton of writing to a file, all for
the sake of speeding up the relatively uncommon "du -s" operation.
It's not at all clear it's worthwhile tradeoff.
Databases have had to deal with this and maybe there are hints like the
two phase commit and
the WAL just for the size updates.
Maybe we set a flag in the directory entry when we update it, cause we
are writing this update to disk anyway.
Then when update completes at the parent, the flag is cleared. Now this
makes two writes for each directory but the process is resumable during fsk
No. Updating the size at the same time as the main inode write is far
cheaper than opening a second transaction just for the size update -
unless computing the new size is an expensive operation as well.
scenario, it is stored in the directory entry.
Or is it? Their is an inode for the directory file, maybe just adjust
the inode and return the subdir size if type is direntry.
Maybe this is on a flag and the directory can look like this ...
...
-rw-r--r-- 1 root root 14347 Jan 24 2009 thickbox.js~
-rw-r--r-- 1 root root 18545 Jun 10 18:56 unofficalTranscript.txt
-rw-r--r-- 1 root root 322183635 Aug 11 20:20 uw_mm_inflamm_ipodv.m4v
drwxr-xr-x 2 root root 440(56093) Nov 23 2007 varicaddemo
drwxr-xr-x 2 root root 144(10298) Oct 23 2007 varicaddemos
TOTAL 322217111 (322282918)
Where the number in parenthesis is the subdir total.
The Total at the end of the dir command, just like du or anything using
the stat command is now practical. (ever used filelight?)
I know it now does the sync for you, but the fact a sync must be done
I need to look at the cashing and how we handle changes already. Do we
write things immediately all the time? Then why must I "sync" before
unmount. hummmm
You don't need to sync before umount. umount automatically syncs the
filesystem it's applied on after it's removed from the namespace, but
before the umount completes. Additionally, dirty buffers and pages are
written back automatically based on memory pressure and timeouts - see
/proc/sys/vm/dirty_* for the knobs for this.
indicates there are buffers not written, correct?
I will sleep on the hard link issue. There must be an answer as DU mustIn addition, how will you handle hard links? An inode can have
multiple hard links in different directories, and there is no way to
find all of the directories which might contain a hard link to a
particular inode, short of doing a brute force search. Hence if you
have a file living in src/linux/v2.6.29/README, and it is a hard link
to ~/hacker/linux/README, and a program appends data to the file
~/hacker/linux/README, this would also change the result of running du
-s src/linux/v2.6.29; however, there's no way for your extension to
know that.
^^^ don't skip this part, it's absolutely critical, the biggest
problem with your proposal, and you can't just handwave it away.
handle this.
I can see where if I can't distinquish between which is the hard link
and which is not becasue they are implemented the same.
First think is to run an experiment in the morning
test/foo/bar/file
test/bar/foo/file
where file is the same file close to the disk block size.
does 'du -s in foo' + 'du -s in bar' = 'du -s' in test?
One thing you may want to look into is the new fanotify API[1] - itFUSE is an option I was keeping open.
allows a userspace program to monitor and/or block certain filesystem
events of interest. You may be able to implement a prototype of your
space-usage-caching system in userspace this way without needing to
modify the kernel. Or implement it as a FUSE layered filesystem. In
the latter case you may be able to make a reverse index of sorts for
hardlink handling - but this carries with it quite a bit of overhead.
Since I can dedicate a mountpoint to a file system and mount and umount
it and load and unload a kernel module FUSE, seemed like extra work with
little benefit.
That does sound like a lot of overhead.
PS - it's normal to keep all CCs when replying to messages on lkmlok, The other lists I am on are insistent that I only send to the list
(that is, use reply to all), as some people may not be subscribed, or
may prefer to get extra copies in their inbox. I personally don't mind
either way, but there are some who are very adamant about this point.
address.
[1] - http://lwn.net/Articles/339399/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: Starting a grad project that may change kernel VFS. Early research
- From: Bryan Donlan
- Re: Starting a grad project that may change kernel VFS. Early research
- References:
- Re: Starting a grad project that may change kernel VFS. Early research
- From: Jeff Shanab
- Re: Starting a grad project that may change kernel VFS. Early research
- From: Bryan Donlan
- Re: Starting a grad project that may change kernel VFS. Early research
- Prev by Date: Re: v2.6.31-rc6: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
- Next by Date: Re: [PATCH 0/2] eventfd: new EFD_STATE flag
- Previous by thread: Re: Starting a grad project that may change kernel VFS. Early research
- Next by thread: Re: Starting a grad project that may change kernel VFS. Early research
- Index(es):
Relevant Pages
|