Re: Document encoding



sebzzz@xxxxxxxxx staggered into the Black Sun and said:
I didn't know that when I save a simple text document, it [gets saved]
with an encoding.

New to computers?

I encountered problems when I was designing [a] web page with Nvu
telling Nvu that my documents were in iso-8859-1, and [then I] changed
the charset to utf-8 after that in [another] editor. All the special
characters [were] missing [after that].

I came to the conclusion that depending on what encoding the file is
saved [in], there [are] special codes inside the file invisible to us

Who's "us", white man? :-) It's easy to tell what encoding is being
used if you're using a real text editor, or a hex editor. utf-8 is
going to be the winner in the long run for various reasons. But
encoding is less important if you're using HTML or XML, since non-ASCII
chars in those formats are represented by entities like é anyway.

What are the differences between iso-8859-1 and utf-8?

ISO8859-1 defines chars 0-127 and 160-255; it's the Western Europe code
page and has practically every char you need to write in western
European languages. UTF-8 defines ... well, almost every char that
exists. Chars that aren't ASCII chars are preceded by 1 or 2 special
marker bytes. Check out utf-8 on PickyWeedia for the full scoop.

How can I check the encoding of a file?

/usr/bin/file generally works.

How can I change the encoding of a file? What encoding I should use to
write text files?

/usr/bin/recode can convert among tons of text encodings. UTF-8 is
probably the most future-proof format.

What about those CR LF end of line [markers]? What does this mean?

Unix: \n means EOL
old Mac: \r means EOL (not used in the modern world AFAICT)
DOS: \r\n means EOL

....this is mostly historical, unless you have a program that only
expects a certain type of EOL and will barf if it sees the other.

What should be used and what difference it makes in a document?

In general, \r\n is better, because while Unix programs can handle DOS
EOL, DOS programs may not handle Unix EOL properly.

--
You have me mixed up with more creative ways of being stupid.
--MegaHAL, trained on random gibberish
Matt G|There is no Darkness in Eternity/But only Light too dim for us to see
.



Relevant Pages

  • Re: aps.net : BIG BUG in streamwriter
    ... look the BOM! ... editor which proceeds to rewrite it as UTF-16? ... when i want deserialize it with an utf-8 encoding... ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Fedora, unicode, console
    ... > to get UTF-8 enabled in console? ... *all* the Unicode characters: Fedora has chosen a good one, ... > has not all UTF-8 chars, ... Well, in vim, if you know the Unicode reference, try ...
    (Fedora)
  • Re: How to clean an xml files from non-utf-8 chars?
    ... anything else that relies on the xml files being utf-8. ... module UTF8 ... All chars that are not valid utf8 char sequences will be ...
    (comp.lang.ruby)
  • Re: replace chars
    ... > Yes I think that it might not be any standard transforming algorithm> for ... > I hope I won't have any issues, because the chars are UTF-8. ... Perl strings are in UTF-8*, but if you want to specify a character ... I have also seen that length($string) returns the number of bytes of $string, and not the number of chars. ...
    (perl.beginners)
  • Re: RfD: c-addr/len
    ... use variable-width encodings like UTF-8, ... Do they have fixed width like chars? ... You say that UTF-8 works on the top of octet bytes, octet characters, ... The confusion comes from the Forth94 standard. ...
    (comp.lang.forth)