Re: why sort is soo slow on RH9?

From: Ed Blackman (news_at_edgewood.to)
Date: 07/08/03


Date: Tue, 8 Jul 2003 16:02:42 -0400

In article <beetgv$gpf@rac2.wam.umd.edu>, Yi Jin wrote:
> Thank you. I got your point. I am not sure what was the original
> default LANG setting.

Issuing "locale" in a command window where you haven't changed the
LANG variable will show you your locale settings.

> But after I set the environment LANG to C, the time for running sort
> shortened to a few seconds, from a few hours, for key sorting one
> file (150000 lines, 12 fields each line).

I'm pretty sure that you could just change LC_COLLATE (the locale
variable responsible for sorting behavior) and see the performance
improvement without affecting other langauge settings. I just use
LANG for testing because it's shorter than LC_COLLATE, and I'm lazy.
<grin>

For a numeric sort it doesn't matter, but if you have non-numeric
keys to sort, you might not want to use the C locale. Compare the
output of "echo -e 'b\na\nB\nAa\nA\na.\naa' | LC_COLLATE=C sort" and
"echo -e 'b\na\nB\nAa\nA\na.\naa' | LC_COLLATE=en_US sort"

> Why the LANG setting plays such a huge difference?

I'm pretty sure it has to do with UTF8 processing. The reason I
suspected UTF8 is because you said it happened on RH9 and not RH8 and
earlier, and I'm pretty sure that RedHat changed the default locale to
be UTF8-based in RH9.

Without looking at the sort and libc source code, I'd guess that this
is what happens:
1) sort passes input lines as byte arrays into the libc sort routine,
2) the routine converts the byte arrays into characters based on the
current locale,
3) compares them,
4) discards the converted characters and
5) returns a value that depends on the collation order of the arrays.

Then the process begins again with another two lines. Step 2 is
required because in UTF8, characters can be composed of more than one
byte, and you can't just sort on byte value. In the C locale, the
collation order is based on byte value, so no conversion is necessary.

I'm guessing that the overhead of those conversions is what you're
seeing. Of course, this is just speculation: if someone else knows
better, I'd appreciate it if they'd speak up and enlighten us.

Ed



Relevant Pages

  • Re: sort -m and locale?
    ... problem they messed up the month code in sort. ... months from the system using the locale information. ... The months from the system are in upper and lower ... In the end I put up a red herring for everyone to follow with the locale ...
    (comp.unix.shell)
  • Re: SELECT . . . ORDER BY - SQL 2000 vs. JET 4
    ... mention locale. ... with this type of sort. ... I ended up doing a bubble-type sort on the key, and single inquiries ... use the ORDER BY statement, but, in the case of a key field, it may not be ...
    (microsoft.public.sqlserver.programming)
  • Locale not working with Unicode strings in Perl 5.8?
    ... After an upgrade from Perl 5.6 to perl 5.8, strings with character semantics ... I sort a latin1-encoded string and a utf8-encoded string. ... semantics sorted just fine as long as I used 'locale'. ...
    (comp.lang.perl.misc)
  • Re: sort -m and locale?
    ... I'm trying to get sort by month working. ... and my locale stuff is set like this: ... I'm aware there are many workarounds for this problem and I'm not really ...
    (comp.unix.shell)
  • Re: Weird Problem w/ DecimalFormat
    ... > Check the settings of your locale (LANG and LC_ environment variables, ... behaves this way intermittently in a consistent environment. ...
    (comp.lang.java.programmer)