Re: Ocfs2 performance bugs of doom
- From: Mark Fasheh <mark.fasheh@xxxxxxxxxx>
- Date: Fri, 3 Mar 2006 16:53:16 -0800
Hi Daniel,
On Fri, Mar 03, 2006 at 02:27:52PM -0800, Daniel Phillips wrote:
Hi ocfs2 guys,Thanks for continuing to profile this stuff. We've definitely got lots of
Actually, ocfs2 is not doomed - far from it! I just wanted to grab a few
more of those eyeballs that normally glaze over each time they see a
cluster mail run by on lkml[1]. Today's news: you are still not finished
with global lock systime badness, and I already have more performance bugs
for you.
profiling to do, so any help is welcome.
So the ocfs2 guys switched from double word to single word hash table headsHeh, well if you remember from irc I told you that there was more to do -
and quadrupled the size of the hash table. Problem solved, right? No!
specifically that I was looking at increasing the memory dedicated to the
hash.
I included two patches below. The first vmallocs a megabyte hash tableI completely agree that we need to increase the memory dedicated to the
instead of a single page. This definitively squishes away the global
lock hash table overhead so that systime stays about twice as high as
ext3 instead of starting at three times and increasing to four times
with the (new improved) 4K hash table.
hash, but I fear that vmallocing a 1 megabyte table per domain (effectively
per mount) is going overboard. I will assume that this was a straw man patch
:)
Of course the Right Think To Do is kmalloc a vector of pages and mask theRight. That's one possible approach - I was originally thinking of
hash into it, but as a lazy person I would rather sink the time into
writing this deathless prose.
allocating a large global lockres hash table allocated at module init time.
This way we don't have to make large allocs for each domain.
Before anything else though, I'd really like to get an idea of how large we
want things. This might very well dictate the severity of the solution.
The __dlm_lookup_lockres rewrite shows a measurable improvement inYes, this patch seems much easier to swallow ;) Thanks for doing that work.
hash chain traversal overhead. Every little bit helps.
I have some comments about it below.
I failed to quantify the improvement precisely because there are otherDefinitely if you get a chance to show how much just the lookup optimization
glitchy things going on that interfere with accurate measurement. So now
we get to...
helps, I'd like to know. I'll also try to gather some numbers.
Things to notice: with the still-teensy hash table, systime for globalIndeed. The real time numbers are certainly confusing. I actually saw real
lock lookups is still way too high, eventually stabilizing when lock
purging kicks in at about 4 times ext3's systime. Not good (but indeed,
not as excruciatingly painful as two days ago).
With the attached patches we kill the hash table overhead dead, dead,
dead. But look at real time for the untar and sync!
time decrease on most of my tests (usually I hack things to increase the
hash allocation to something like a few pages). I want to do some more
consecutive untars though.
Btw, for what it's worth, here's some single untar numbers I have lying
around. Sync numbers are basicaly the same.
Before the hlist patch:
real 0m21.279s
user 0m3.940s
sys 0m18.884s
With the hlist patch:
real 0m15.717s
user 0m3.924s
sys 0m13.324s
Ocfs2 starts little or no writeout during the untar itself. Ext3 does.Good idea. As you probably already know, we've never been beyond checking to
Solution: check ext3 to see what secret incantations Andrew conjured up
to fiddle it into doing something reasonable.
see how ext3 handles a given problem :)
Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doingDid you notice this during the untar? If not, do you have any reproducible
nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting
in io_schedule. Waiting for io that never happens? This needs more
investigation.
test case?
Delete performance remains horrible, even with a 256 meg journal[4] whichThis doesn't surprise me - our unlink performance leaves much to be desired
is unconscionably big anyway. Compare to ext3, which deletes kernel trees
at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more
like a minute.
at the moment. How many nodes did you have mounted when you ran that test?
Off the top of my head,the two things which I would guess are hurting delete
the most right now are node messaging and lack of directory read ahead. The
first will be solved by more intelligent use of the DLM so that the network
will only be hit for those nodes that actually care about a given inode --
unlink rename and mount/unmount are the only things left that still use the
slower 'vote' mechanism. Directory readahead is much easier to solve, it's
just that nobody has gotten around to fixing that yet :/ I bet there's more
to figure out with respect to unlink performance.
I would chase down these glitches too, except that I really need to getHey, we appreciate the help so far - thanks.
back to my cluster block devices, without which we cannot fully bake
our cluster cake. After all, there are six of you and only one of me, I
had better go work on the thing nobody else is working on.
[2] It is high time we pried loose the ocfs2 design process from secretWe're not trying to hide anything from anyone :) I'm always happy to talk
irc channels and dark tunnels running deep beneath Oracle headquarters,
and started sharing the squishy goodness of filesystem clustering
development with some more of the smart people lurking here.
about design. We've been in bugfix (and more recently, performance fix) mode
for a while now, so there hasn't been much new design yet.
- bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]);Is the likely() here necessary? It actually surprises me that the check even
-
- /* check for pre-existing lock */
- hlist_for_each(iter, bucket) {
- tmpres = hlist_entry(iter, struct dlm_lock_resource,
hash_node);
- if (tmpres->lockname.len == len &&
- memcmp(tmpres->lockname.name, name, len) == 0) {
- dlm_lockres_get(tmpres);
- break;
- }
-
- tmpres = NULL;
+ if (likely(res->lockname.name[0] != name[0]))
helps so much - if you check the way OCFS2 lock names are built, the first
few bytes are very regular - the first char is going to almost always be one
of M, D, or W. Hmm, I guess if you're hitting it 1/3 fewer times...
+ continue;Also here, lengths of OCFS2 locks are also actually fairly regular, so
+ if (likely(res->lockname.len != len))
perhaps the likely() isn't right?
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@xxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: Ocfs2 performance bugs of doom
- From: Daniel Phillips
- Re: Ocfs2 performance bugs of doom
- References:
- Ocfs2 performance bugs of doom
- From: Daniel Phillips
- Ocfs2 performance bugs of doom
- Prev by Date: slab: remove drain_array_locked
- Next by Date: [PATCH 2/6] add holders/slaves subdirectory to /sys/block
- Previous by thread: Ocfs2 performance bugs of doom
- Next by thread: Re: Ocfs2 performance bugs of doom
- Index(es):
Relevant Pages
|