Re: why sort is soo slow on RH9?
From: Ed Blackman (news_at_edgewood.to)
Date: 07/08/03
- Next message: bill: "need commands"
- Previous message: peter pilsl: "help with cvs (sticky tag is not a branch)"
- In reply to: Yi Jin: "Re: why sort is soo slow on RH9?"
- Next in thread: Villy Kruse: "Re: why sort is soo slow on RH9?"
- Reply: Villy Kruse: "Re: why sort is soo slow on RH9?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 8 Jul 2003 16:02:42 -0400
In article <beetgv$gpf@rac2.wam.umd.edu>, Yi Jin wrote:
> Thank you. I got your point. I am not sure what was the original
> default LANG setting.
Issuing "locale" in a command window where you haven't changed the
LANG variable will show you your locale settings.
> But after I set the environment LANG to C, the time for running sort
> shortened to a few seconds, from a few hours, for key sorting one
> file (150000 lines, 12 fields each line).
I'm pretty sure that you could just change LC_COLLATE (the locale
variable responsible for sorting behavior) and see the performance
improvement without affecting other langauge settings. I just use
LANG for testing because it's shorter than LC_COLLATE, and I'm lazy.
<grin>
For a numeric sort it doesn't matter, but if you have non-numeric
keys to sort, you might not want to use the C locale. Compare the
output of "echo -e 'b\na\nB\nAa\nA\na.\naa' | LC_COLLATE=C sort" and
"echo -e 'b\na\nB\nAa\nA\na.\naa' | LC_COLLATE=en_US sort"
> Why the LANG setting plays such a huge difference?
I'm pretty sure it has to do with UTF8 processing. The reason I
suspected UTF8 is because you said it happened on RH9 and not RH8 and
earlier, and I'm pretty sure that RedHat changed the default locale to
be UTF8-based in RH9.
Without looking at the sort and libc source code, I'd guess that this
is what happens:
1) sort passes input lines as byte arrays into the libc sort routine,
2) the routine converts the byte arrays into characters based on the
current locale,
3) compares them,
4) discards the converted characters and
5) returns a value that depends on the collation order of the arrays.
Then the process begins again with another two lines. Step 2 is
required because in UTF8, characters can be composed of more than one
byte, and you can't just sort on byte value. In the C locale, the
collation order is based on byte value, so no conversion is necessary.
I'm guessing that the overhead of those conversions is what you're
seeing. Of course, this is just speculation: if someone else knows
better, I'd appreciate it if they'd speak up and enlighten us.
Ed
- Next message: bill: "need commands"
- Previous message: peter pilsl: "help with cvs (sticky tag is not a branch)"
- In reply to: Yi Jin: "Re: why sort is soo slow on RH9?"
- Next in thread: Villy Kruse: "Re: why sort is soo slow on RH9?"
- Reply: Villy Kruse: "Re: why sort is soo slow on RH9?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|