Re: [opensuse] Another slightly OT c question, howto handle extended ascii chars? >127



On Monday 31 August 2009 05:15:58 Michal Hrusecky wrote:
David C. Rankin - 17:11 29.08.09 wrote:
Listmates,

Hi,

I'm parsing output that has the degree symbol in it in c. The character
code for the symbol is 167, but of course the ascii character set it
limited to 0-127.

Well, as degree symbol is non-ascii, it's code depends on encoding you
are using.

Believe it or not, using a cut-n-paste into the strtok delimiter set
works, but that just feels like a cludge. Example:

const char delimiters[] = " +°,;:!-";
<snip 1st call to strtok>
token = strtok (NULL, delimiters);

Will break the string on ° correctly. But looking at how c is handling
this clude causes concern:

for (i=0;i<strlen(delimiters);i++) {
printf("%c %d\n", delimiters[i],delimiters[i]);
}

Yields:

32
+ 43
??? -62
??? -80
, 44
; 59

: 58

! 33
- 45

Hmm, any time I see the little ??? character, that's a bad sign. So, is
there any trick to handling the chars that are outside of the normal
ascii chars, but we seem to run into all the time? Is this the area of
the thinly defined wchar_t?

You see these ??? as your terminal doesn't know how to handle this
character, I would guess that you are using iso8859-1 somewhere
(probably in sources) and utf-8 somewhere else. You don't have to worry
as C can handle non-ascii characters very well - char can handle 8 bits
so it can handle all 256 characters in all one byte encodings (like
iso8859-1) wchar is used for some multibyte characters.

--
Michal Hrusecky

Package Maintainer
SUSE LINUX, s.r.o
e-mail: mhrusecky@xxxxxxx

Hmmm. Michal suggests you might be using iso8850-1 and utf-8 in different
places. I suspect something like that too, but I also wonder if you
transposed digits of the character code you reported above. "man iso_8859-1"
shows the DEGREE SIGN code to be 176, not 167. A signed integer byte value
of 176 (hex B0) would be printed as -80.

The fact that your program output included -62 (hex C2) immediately before
the -80 leads me to believe that the encoding for the degree symbol in the C
source is UTF-8. The byte sequence 0xc2 0xb0 would be the UTF-8 encoding for
the degree symbol.

If you want to reduce some of your confusion when dealing with 8-bit character
codes in C, you might consider always using "unsigned char" in place
of "char". That way greater than or less than comparisons will work in a
less confusing manner (though not necessarily exactly the same as lexical
order, depending on character coding).
--
Jim

Attachment: signature.asc
Description: This is a digitally signed message part.



Relevant Pages

  • Re: doubly-linked list & sorting
    ... char Emess; ... diff = -1; ...
    (comp.lang.c)
  • Re: Mathematics of the Enigma cipher?
    ... A character is mapped to an integer which becomes ... All the Rotors should be advanced ... char val3=R2.GetCharacterIndex; ... I thought the Reflector was fixed. ...
    (comp.programming)
  • Re: Mathematics of the Enigma cipher?
    ... A character is mapped to an integer which becomes ... All the Rotors should be advanced ... char val3=R2.GetCharacterIndex; ... I thought the Reflector was fixed. ...
    (comp.programming)
  • Re: Pointer
    ... require in input a pointer char: ... unsafe public static extern int OpenFile; ... What is it pointing to exactly, is it pointing to a "Unicode character" array or is it pointing to a "Single byte character" array or is it pointing to something else? ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: memory leak?
    ... The issue is that in the TMS30C90, with 80-bit character types, this ... Yes, it is well-defined in terms of what happens, but in fact if EOF is a constant which ... is out-of-band for all char/unsigned char values (which is how it is done, ... to create a conforming implementation on a word-addressed machine. ...
    (microsoft.public.vc.mfc)