Re: Encoding issues with literal strings (C++)



Carlos Moreno wrote:

Hi,

I'm a bit puzzled by the following.

My application is a client/server, where the server runs on Linux
and is written in C++. The client runs on Windows and is written
in Borland C++ Builder 6.

Since it is in Spanish (most of the users are hispanophones), I
have many messages that the server sends that include characters
with accent (in HTML, á é , etc.).

Some of these messages come from literal strings, with embedded
\x sequences to represent the special characters in ISO-8859-1
(or rather, Windows-1252).

For instance, in LATIN1 (ISO-8859-1) and in Windows-1252 encodings,
the a with acute accent has the code 0xE1; the o with acute accent
has code 0xF3 ... So I write those just like that (well, \xE1 and
\xF3 in the literal strings), and it works.

But I have two puzzling problems:

1) When I write the i with acute accent (which has code 0xED), that
one doesn't work (shows up as a greek letter beta on the client,
and the letter after that one doesn't show).

When I do a hexdump -C of the executable, I see that the string
is not the same!!! The \xED character has been replaced by a
0xDF, and the character after the \xED is missing !!! Here:

The literal string is: " ..... espec\xEDficos ..... "

The hexdump output (the relevant line) is:

65 73 20 65 73 70 65 63 df 69 63 6f 73 2e 20 20 |es espec.icos. |

Why did that happen? How do I avoid it? --- without having to
manually edit the executable, that is). I have the feeling that
it has to do with UTF-8 encoding, perhaps invalid UTF-8 sequences
that the compiler is "fixing" --- but, if that is the case, why?


2) The other thing is that I'm getting a compiler warning of hex
escape sequence out of range for the \xF3 --- yet that character
shows up ok (the o with acute accent).


They're both the same problem. I'm not sure if this is a bug or not, but gcc is taking more than two digits to make a string literal. In your example:
"espec\xEDficos"

Here gcc is taking the literal as 0xedf, which is out of range. The modulo value of 0xdf is what shows up in your output. I confirmed this behavior in gcc 3.4.4.

Again, I always thought C only uses two digits for \x escapes, so this smells like non-conformance to me. However, you can work around it by terminating the sequence with whitespace, or you can make it two strings as follows:
"espec\xED""ficos"

This is valid C syntax. The compiler will concatenate these two strings and produce the correct characters.

Cheers,
John
.



Relevant Pages

  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • Re: Unicode LISP??
    ... I'm not experienced with Common Lisp library, ... terms of strings rather than characters. ... have their representation upgraded if they are updated in place. ...
    (comp.lang.lisp)
  • Re: OWA for exchange 2003 - Garbled characters
    ... > If the issue appears on only one client, ... > ForceClientsDownLevel registry key on the Exchange server. ... OWA for exchange 2003 - Garbled characters ... I have a problem when replying to an email using outlook web access ...
    (microsoft.public.exchange.clients)
  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: How to check variables for uniqueness ?
    ... FI in English typography), so the correct uppercase version of those ... characters is the sequence SS. ... So you at least agree with me that it should be consistent with toUpperCase -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation. ... The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. ...
    (comp.lang.java.programmer)