Re: NFS Character Issues on Suse 9.1

From: Arthur Hagen (art_at_broomstick.com)
Date: 09/18/04


Date: Sat, 18 Sep 2004 12:10:59 -0400

Jo Schulze <antispam@feuersee.de> wrote:
> Arthur Hagen wrote:
>
>> It also breaks a great deal of existing utilities that rely on an
>> exact 1:1 ratio between characters and bytes,
>
> A lot of GNU utils have been upgraded to support UTF-8. Relying on a 1
> byte/char ratio for text processing is (always has been) obsolete, see
> below.

Text processing is one thing. Character processing another, and for
character processing, a 1:1 ratio (whether it's 8 bits or 16 bits) has
always been a given.

>> is a catalyst for buffer
>> overflows,
>
> Myth. A buffer is a thing measured in bytes, this has nothing to do
> with how these bytes are interpreted. The only thing that changes is
> that you may need to increase the size of the buffer in order to
> allow the same # of chars (_not_ bytes) to be entered. The overflow
> check stays the same.

Which overflow check? The buf=alloc(strlen((char *)string)+1)) construct is
bred in to generations of programmers, and if your strings get converted to
unicode, they no longer will fit. Or, your app may have an input field that
only allows 78 characters. A buffer of length 80 is then enough. Or was,
until unicode.

>> and make certain functions (like search) orders of
>> magnitude slower.
>
> UTF-8 has been well designed with sorting in mind, eg. cyrillic is a
> lot easier to sort in UTF-8 than in any other encoding.

It's still slower for western alphabets (which can easily mask off case),
and as stated, searches are REALLY much slower, among other things because
you can't jump to an arbitrary place in the string and get a letter -- the
letter may or may not start with the preceding character.
It's somewhat akin to the problem that DaNo users face when trying to treat
"Aa" and "Å" as equivalent (a problem that Unicode doesn't solve either).

> But you would agree that (after having ironed that out) it was worth
> to decide _not_ to stick to US-ASCII 'til eternity?

Indeed. But I also think that the early implementers didn't get the rewards
as much as they had to sort out the problems for the rest of us.
In a production situation, I don't want to be an early adopter. That's
reckless.

Silicon Graphics did a Smart Thing with IRIX -- they release their IRIX
distribution minor version and revision upgrades as two different
versions -- one "maintenance" version that contains well-tested software,
and one "feature" version that contains the newest software. The install
tool won't let you install a "feature" application to a production system
without reading warnings and switching over to feature mode, after which the
OS is identified as 6.5.15f instead of 6.5.15
In this case, having Unicode as default definitely belongs in a "feature"
mode.

Regards,

-- 
*Art


Relevant Pages

  • Re: Using strsafe.h and va_list
    ... Is sizeofcorrect for Unicode builds, or is it a new feature? ... > Tim wrote: ... > buffer in question. ...
    (microsoft.public.vc.mfc)
  • Re: Using strsafe.h and va_list
    ... Is sizeofcorrect for Unicode builds, or is it a new feature? ... > Tim wrote: ... > buffer in question. ...
    (microsoft.public.vc.mfc.docview)
  • Re: How to print the cookie value
    ... mixing char and wchar_t strings. ... TCHAR buffer; ... BOOL bRes = InternetGetCookie, name, ... have you defined both UNICODE and _UNICODE? ...
    (microsoft.public.vc.mfc)
  • Re: Help reading registry key
    ... Your buffer should be coming back with the same trash that was in ... It will write a UNICODE ... > I'll assume you know about enumering a registry key so I won't detail ... > DWORD nNameSize = nMaxKeyNameSize; ...
    (microsoft.public.pocketpc.developer)
  • Re: CString to LPCSTR
    ... The proper thing to do depends on what format of text the RichEdit control ... Since you're apparently building the project with UNICODE ... by the minimum size of the byte buffer, or the length of the string. ...
    (microsoft.public.vc.mfc)