RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?





-----Original Message-----
From: J. Bruce Fields [mailto:bfields@xxxxxxxxxxxx]
Sent: Monday, June 16, 2008 12:44 PM
To: Weathers, Norman R.
Cc: Jeff Layton; linux-kernel@xxxxxxxxxxxxxxx;
linux-nfs@xxxxxxxxxxxxxxx; Neil Brown
Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?

On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:


-----Original Message-----
From: J. Bruce Fields [mailto:bfields@xxxxxxxxxxxx]
Sent: Friday, June 13, 2008 5:04 PM
To: Weathers, Norman R.
Cc: Jeff Layton; linux-kernel@xxxxxxxxxxxxxxx;
linux-nfs@xxxxxxxxxxxxxxx; Neil Brown
Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?

On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers,
Norman R. wrote:


The big one seems to be the __alloc_skb. (This is with 16
threads, and
it says that we are using up somewhere between 12 and 14 GB
of memory,
about 2 to 3 gig of that is disk cache). If I were to
put anymore
threads out there, the server would become almost
unresponsive (it was
bad enough as it was).

At the same time, I also noticed this:

skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170

Don't know for sure if that is meaningful or not....

OK, so, starting at net/core/skbuff.c, this means that
this memory was
allocated by __alloc_skb() calls with something nonzero
in the third
("fclone") argument. The only such caller is
alloc_skb_fclone().
Callers of alloc_skb_fclone() include:

sk_stream_alloc_skb:
do_tcp_sendpages
tcp_sendmsg
tcp_fragment
tso_fragment

Interesting you should mention the tso... We recently went
through and
turned on TSO on all of our systems, trying it out to see
if it helped
with performance... This could be something to do with
that. I can try
disabling the tso on all of the servers and see if that
helps with the
memory. Actually, I think I will, and I will monitor the
situation. I
think it might help some, but I still think there may be
something else
going on in a deep corner...

I'll plead total ignorance about TSO, and it sounds like a long
shot--but sure, it'd be worth trying, thanks.


Tried it, not for sure if I like the results yet or not...
Didn't seem
to make a huge difference, but here is something that will
really make
you want to drink, the 2.6.25.4 kernel does not go into the
size-4096
hell.

Remind me what the most recent *bad* kernel was of those you tested?
(2.6.25?)


The kernel that we were really seeing the problem with was 2.6.25.4, but
I think we may have figured out the 4096 problem, and it was probably a
mistake on my part, but it is important for the NFS users to see it so
they don't make the same mistake. I had found some performance tuning
guides, and in trying some of the suggestions, found that the setting
changes did seem to help on some things, but of course I never got to
run a check under full load (800 + clients). A suggestion was to change
the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
to 127. We think that this was actually causing the issue. I was able
to trace back through all of the changes, and I changed this setting
back to the default 3, and it immediately fixed the size-4096 hell. It
appears that the reordering just eats into the memory, especially in
high demand situations, and I guess that should make perfect sense if we
are actually buffering up packets for reorder, and we are slamming the
box with thousands of requests per minute.

We still have other performance issues now, but it appears to be more of
a bottleneck, the nodes do not appear to be backing off when the servers
are becoming congested.


Nothing jumped out at me in a quick skim through the commits
from 2.6.25
to 2.6.25.4.

The largest users of slab there are the size-1024 and still the
skbuff_fclone_cache. On a box with 16 threads, it will
cache up about 5
GB of disk data, and still use about 6 GB of slab to put
the information
out there (without TSO on), but at least it is not causing the disk
cache to be evicted, and it appears to be a little more
responsive. If
I up it to 32 or more threads, however, it gets very
sluggish, but then
again, I am hitting it with a lot of nodes.


tcp_mtu_probe
tcp_send_fin
tcp_connect
buf_acquire:
lots of callers in tipc code (whatever that is).

So unless you're using tipc, or you have something in
userspace going
haywire (perhaps netstat would help rule that out?), then
I suppose
there's something wrong with knfsd's tcp code. Which
makes sense, I
guess.


Not for sure what tipc is either....

I'd think this sort of allocation would be limited by the
number of
sockets times the size of the send and receive buffers.
svc_xprt.c:svc_check_conn_limits() claims to be limiting
the number of
sockets to (nrthreads+3)*20. (You aren't hitting the
"too many open
connections" printk there, are you?) The total buffer
size should be
bounded by something like 4 megs.

--b.


Yes, we are getting a continuous stream of the too many
open connections
scrolling across our logs.

That's interesting! So we should probably look more
closely at the
svc_check_conn_limits() behavior. I wonder whether some
pathological
behavior is triggered in the case where you're constantly
over the limit
it's trying to enforce.

(Remind me how many active clients you have?)



We currently are hitting with somewhere around 600 to 800
nodes, but it
can go up to over 1000 nodes. We are artificially starving with a
limited number of threads (2 to 3) right now on the older 2.6.22.14
kernel because of that memory issue (which may or may not be tso
related)...

So with that many clients all making requests to the server at once,
we'd start hitting that (serv->sv_nrthreads+3)*20 limit when
the number
of threads was set to less than 30-50. That doesn't seem to be the
point where you're seeing a change in behavior, though.


We were estimating between 40 and 50 threads was the cut off for being
able to service all of the (current) requests at once. I haven't ramped
back up to that level yet. I wasn't comfortable yet with letting it all
hang back out just in case we get into that hellish mode again, it can
be a pain to try and get into those systems once they are overloaded
(even over serial, sometimes it can just timeout the login). We had to
actually bring online a second option to help alleviate some of the back
congestion because the servers couldn't handle the workload.


I really want to move forward to the newer kernel, but we
had an issue
where clients all of the sudden wouldn't connect, yet other clients
could, to the exact same server NFS export. I had booted the server
into the 2.6.25.4 kernel at the time, and the other admin
set us back to
the 2.6.22.14 to see if that was it. The clients started
working again,
and he left it there (he also took out my options in the
exports file,
no_subtree_check and insecure). I know that we are running over the
number of privelaged ports, and we probably need the
insecure, but I am
having a hard time wrapping my self around all of the problems at
once....

The secure ports limitation should be a problem for a client
that does a
lot of nfs mounts, not for a server with a lot of clients.



Ah, OK. That makes sense.

--b.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: W23K-Terminalserver mit Qber_8GB_RAM_?=
    ... schreibst und somit auch bei Windows 2003 Enterprise Edition nur max. ... Wesentlich sinnvoller wären zwei kleinere Server gewesen, ... Jede Applikation bekommt seine eigenen 2 GB User Memory Space. ... Die anderen 2 GB Kernel Memory Space sind zwischen allen ...
    (microsoft.public.de.german.windows.terminaldienste)
  • Re: Blue screen with Event 1076
    ... yes it's a driver fault in the sense that win32k is a kernel mode driver. ... Invalid system memory was referenced. ... MCSE, CCEA, Microsoft MVP - Terminal Server ...
    (microsoft.public.windows.terminal_services)
  • RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
    ... Interesting you should mention the tso... ... the 2.6.25.4 kernel does not go into the size-4096 ... kernel because of that memory issue (which may or may not be tso ... where clients all of the sudden wouldn't connect, ...
    (Linux-Kernel)
  • Re: inode state
    ... I can not use linuxthreads since my server is also multiplexing via ... If I remove one of the two disks 5.2 boots but the kernel traps as soon as ... >> The system stores most many statistics in memory which is flushed to ...
    (freebsd-questions)
  • RE: Exchange 2003 and timeout errors on XP machines
    ... lose of connectivity on the server but only error 26 on the clients and then ... I will reboot the server and see what ... After replacing 1 client computer's network cable, ... > Store.exe will take as much memory as possible to provide better service ...
    (microsoft.public.windows.server.sbs)