Re: _mbslen vs strlen

From: Chris Vine (chris_at_cvine--nospam--.freeserve.co.uk)
Date: 08/19/05


Date: Fri, 19 Aug 2005 11:24:08 +0100

stork wrote:

> I am porting a C++ Windows application to Linux. For various reasons,
> many foolish, I did not use C++ strings and have C++ classes but with
> things like strlen or wcslen, etc. My Windows application was unicode
> but I have read that wchar_t in Linux is not commonly used because it
> is 32 bytes and everything in Linux is UTF-8. So I am, as an initial
> step, making my Windows stuff work with UTF-8 and do conversions behind
> the scenes calls to the W functions of Windows calls in order to do so.
> I change everything from wchar_t to char and now have to deal with the
> fallout of multibyte strings.
>
> Microsoft has a set of functions like _mbslen for multibyte strings.
> The only reference I have seen to those is in Wine, which says to me
> that such is not the GNU/Linux way. Under GNU, what does strlen
> return? The length in characters, or the length in bytes? I have read
> that NULL checking works under UTF8 because of the way the multibyte
> characters are mapped, so I think I get strlen returning the length of
> bytes.

C++ on Linux has wide character (wchar_t) characters as well as single byte
(char_t) characters - it is mandated by the standard (although the size of
wchar_t is not, and in practice it will be the size of int). So in terms
of the language you can happily use 32 bit characters and so accommodate
UCS4.

However, on Linux most Unicode-aware GUI libraries use utf8 for their user
interfaces (although internally they may implement their unicode support
using wide characters), whereas the Windows GUI interfaces use UCS4 (utf32)
and utf16.

Unicode-aware libraries for Linux would normally have functions for
converting from one codeset to another - glib, as used by GTK+ and GNOME
provides this for example. In view of this, for programs using such
libraries, it is usually best to code everything in terms of narrow
characters so that you only have to convert (if at all) for input and
output from outside the program, but it is not a requirement to do so.

strlen() will always return the number of bytes in a null terminated string.
This is nothing to do with "GNU" but is required by the standard. glib has
g_utf8_strlen() to provide the number of characters (rather than bytes) in
a utf8 null terminated string. Other libraries will have something
similar.

Chris

-- 
To reply by e-mail remove the --nospam-- in the address.


Relevant Pages

  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)
  • Horribly overdue update to unicode.txt
    ... of the Linux Assigned Names And Numbers Authority project. ... The Linux kernel code has been rewritten to use Unicode to map ... In particular, ESC (U is no longer "straight to font", since the font ... Actual characters assigned in the Linux Zone ...
    (Linux-Kernel)
  • Re: What is the maximal length of usernames on Solaris?
    ... You've just made the determination that eight characters is it ... >> that this is a reason not to try and change it. ... And each time they keep switching back to Windows. ... >> those people will appreciate the refinement of Windows over Linux. ...
    (comp.sys.sun.admin)
  • Re: Filename Encoding Help
    ... I suggest UTF-8, it's the most efficient for regular text, and it's the default for all methods reading and writing text files in .NET. ... UTF-16 but I am not sure what Windows Vista does. ... UTF-8 can represent the full Unicode spectrum, but many characters wind up encoded in just one or two bytes. ...
    (microsoft.public.dotnet.framework)
  • Re: flac/mp3 tagging Latin characters
    ... >> Latin Songs that I've ripped in Windows have their special ... >> display fine in Linux. ... I transfer my mp3's to Windows and vice-vesa and they all ... > which Spanish special characters etc). ...
    (Fedora)