Re: Profiling multithreaded C++ code

From: David Schwartz (davids_at_webmaster.com)
Date: 07/13/03


Date: Sun, 13 Jul 2003 12:10:25 -0700


"Bernd Strieder" <strieder@informatik.uni-kl.de> wrote in message
news:bemdr0$gok$1@news.uni-kl.de...

> > Your function to shutdown a network connection uses a slow path when
> > the
> > connection is in use during the call to shutdown. But did you realize
that
> > almost all calls to 'shutdown' occur while processing a command received
> > over that connection? Maybe you should just set a 'shutdown' flag and
> > check the flag and do the shutdown when you spin down the connection?

> Isn't that design a bit problematic, different threads doing I/O over the
> same channel possibly at the same time.

    No, it's not a problem at all. But in any event, different threads doing
I/O is not needed to cause the problem I mentioned. All that's needed is a
'CloseConnection' function that can be called either by a thread currently
servicing that connection are an unrelated thread. The 'CloseConnection'
function might have to defer the close if it can't do it immediately.

> > Your cryptographically-strong random number generator is fairly slow
> > and
> > being called while locks are held. Perhaps you should cache a bunch of
> > random numbers somewhere so that the code that needs the numbers doesn't
> > hold the lock for so long.

> A function with meaty children calling synchronisation functions is
> suspicious in the first place. This can be recognized even in a
> single-threaded case. If the locks are held for short times, the code can
> usually be refactored to seperate the locks from meaty code.

    In a single-threaded case, the locks never block and how long you hold
locks doesn't matter. You'd be very unlikely to see this problem in a
single-threaded profile. You might, it's not impossible, but it's unlikely.
Or, to put it another way, it's much harder to see, from a profile, that
you're holding a lock for too long if that lock is uncontended. On the other
hand, lock contention jumps out at you.

> > Your multithreaded application performs badly when X, Y, and Z happen
> > at
> > the same time. What could be better than a profile of the application's
> > usage of CPU time *while* it's suffering?

> This is true, in principle, but usually it is not good design. If there
are
> threads, then it should be clear from design, what interaction is
possible,
> and what is to be avoided. If these problems are not ruled out, then the
> effect of all optimizations gets really dependent on the machine, is it
UP,
> 2, 4, ... -way SMP. There is more redesign necessary than optimizing
> locally.

    All the local optimizations in the world won't find the interaction
problems. In fact, many interaction optimizations are local pessimizations.
I'll make the most strong statement I can make: Other than algorithmic
optimizations, optimizations should not be made to individual elements that
are intended to work together until it is seen how well or how badly they
work together.

    First, local optimizations can be major global pessimizations. Second,
until you see how your code works in the real world under real load, you
don't know what to optimize.

> > This type of support is not helpful, though it's often available. If
I
> > have a pool of ten worker threads that do most of the work, why do I
care
> > which of those threads happened to be the one that wasted the CPU? What
> > difference could it possibly make to me?

> The coincidences of one thread, one task, are hidden.

    One thread is not one task. One thread is one execution vehicle that
performs some subset of the work the application needs to do.

> > I'm really not talking about micro-optimizations either. I'm mostly
> > talking about the large region of cases where individually good pieces
go
> > bad when they work together. Call it high-level algorithmic failure if
you
> > like. See my real-world examples above.

> IOW, it's more like debugging. No tool support can be good enough. Every
bit
> of information you can get might be useful. Well, it's the way most
> software has to be developped these days, write some code, test and debug,
> deliver.

    It is somewhat a cross between debugging and micro-optimizing. But my
point is that there are many different categories of optimization and often
the biggest gains are often made at the module interaction level. In fact,
I'd argue that algorithmic optimizations are probably the most important.
For multi-threaded code, module interaction optimizations come second.
Micro-optimizations are a distant third, unless they apply to extremely
low-level code (like the memory allocator, locking primitives, and so on).

> > Here's an example -- you have a hash table that resizes itself when
it
> > finds it's too big or too small. Its searches are very fast, but add and
> > delete can be slow if they require a resize. Rapidly alternating adds
and

> A shared and expensive to be modified data structure should not have
> happened in the first place, but this can always be said when the
situation
> covers up.

    Well, that's the point. Reorganizing the hash table is a local speed up.
You can benchmark the reorganizing table and show that it performs better
than one that doesn't reorganize. The problem shows up when concurrent
threads need access to the table and one triggers a reorganization.

    Interestingly, in this case (which is based on an actual example), the
ideal solution was to replace their hash table with one that can re-organize
in smaller pieces (see Per-Ake Lawson's article, Dynamic Hash Tables). So,
in a sense, this was an algorithmic optimization. But it only became clear
that the algorithm was deficient by profiling it in real use by multiple
concurrent threads.

> > You may be missing major, quick, high-level optimizations.

> Maybe quick in LOC to be changed, but not uncommonly changing behaviour a
> lot and creating other problems.

    Changing the hash algorithm from one type of hash to another is not
likely to create other problems. Seriously, I have found a very large number
of "Oh, duh!" type instant optimizations this way.

> > Generally, the fixes you make as a result of the profiles are "oh,
> > duh!"
> > type fixes that are obviously right. See my examples above.

> So in the hash table case I can see two options, make it large enough to
> avoid resizings, or create thread local hash tables for those entries that
> will soon be deleted. Since the table was resizable in the first place,
> there are no numbers for a maximum size, so the first method just reduces
> the number of problematic cases. But you might slow down other cases,
> because the large hashtable might not perform as well as a better fitting
> one. The second method requires two lookups, where one was sufficient
> before, but saves lots of resizings.

    The solution (as I already wrote above) was to replace the hash table
with one whose average performance is a bit lower but that can resize
continuously without large reorganizations. If you're not familiar with
Per-Ake Larson's hash tables (as the person who originally wrote that code
wasn't) you should check it out.

> > There can be major performance issues involving the interaction with
> > I/O
> > and other modules. Maybe a high-level module is sending data to the I/O
> > modules in small chunks, causing the I/O module to handle things a very
> > inefficient way.

> This problem will happen without threads, too.

    Nope, it won't. Because all the small chunks will be one big chunk by
the time the high-level module is finished running. It's only if the I/O
module is running concurrently that each of the small chunks gets
individually processed. (Again, this is a real-world example.)

> > There's a huge area of performance optimization -- an absolutely
> > critical area -- that can rapidly be detected by profiling the full
> > application under real-world conditions. No joke, if a real-world
> > multithreaded application has never been profiled in this way, I can
> > generally get a 30-40% performance improvement in a day or two just by
> > scanning the profiles.

> That's about the differences in performance you can get with different
> compiler versions, different compiler options, different basic load of the
> machines. Change the machine, and the improvement might not be any more.
> Well, often enough life is not too bad, and it works.

    Sure, but that's 30-40% independent of those things, which you can still
mess with. More than once, I've been hired by companies that wanted to add a
CPU-intensive feature to their product but couldn't do it and still make
their performance targets. A 30% performance improvement in, say, a web
server, creates plenty of room for a new feature.

    Yes, more major optimizations could double or triple speed. But they're
not likely to be a four hour 'Oh, duh!' type optimization.

> > Case in point, I once profiled a multithreaded Solaris server
>
> > Fifteen minutes of looking at the profile revealed that the code was
> > creating and destroying a lot of objects. The code keeps counts of the
> > numbers of each type of object, which requires obtaining a lock,
>
> > It took about an hour to code a proper spinlock and rewrite the

> Even in a single-threaded testcase the number of mutex calls should have
> been suspicious.

    I don't think it would be possible to create the case with a single
thread. And I don't think the expense of the mutexes would have been quite
as visible with just one thread. Sure, it's possible someone would have
caught it, but in the multithreaded profile, it jumped right out at you.

    More importantly, if the application is multithreaded, you cannot
reproduce the actual circumstances in which the application has performance
problems with a single thread.

> I'm not convinced that profiling is the best way to get these kinds of
> problems out of code, but it is clear that the tools to help finding them
> in the multithreaded case by testing cannot be good enough. It is still a
> lot harder to create multithreaded testcases to show something, you might
> have to hope to get about one in time.

    You don't have to create testcases. Just put the program in the
real-world actual situation in which performance is an issue. Then look at
the profile. The problems will likely jump right out at you.

    DS