Re: Apache unicode question

From: prg (rdgentry1_at_cablelynx.com)
Date: 04/19/05


Date: 19 Apr 2005 09:13:58 -0700

alex wrote:
> If anyone here can answer this, I would appreciate it. Failing that,
> please point me to a better place to ask - I couldn't find one, nor a
> FAQ which deals with this.
>
> I am running Apache from Fedora Core 3. Some of the pages that have
> been created for the server use unicode characters. ...

What do you mean by "use unicode characters"? Hardcoded, numeric,
Unicode representation? Non-ASCII characters?

> ... I don't know why
> they were created this way, we get lots of contributors to this site,

And thus a number of authoring platforms, tools, etc. all with their
own quirks/deficiencies/shortcuts :(

> and the person who was in charge of this died recently. But in any
> event, they were. The problem comes up with puctuation marks, which
> come up correctly in some browsers, but not in others. ...

Which browsers/versions. Do you really want to support every browser
quirk currently known to "force" a particular rendering?

> ... One of the
> pages that has the problem is
>
> http://diamond.boisestate.edu/gas/whowaswho/G/GrossmithGeorge.htm
>
> You should see, in some browsers, odd marks that are supposed to be
> quotes, apostrophes, etc.

"odd marks" means what? Squares? Asterisks? Platform or font
substitutes? Blanks?

Please give us specific character encodings used (ie., charset,
codepage, OS platform, etc.). These are absent from the above page, so
each browser is likely to "do it's best" according to how the user has
configured the browser.

> Is there some switch I can throw so that these get interpreted
> "correctly", or at least differently, as they are viewed? ...

What do you mean by "correctly"? Do you want _typographical_
apostrophe and quote? Just ASCII?

> ... Looking at
> the file under od, these "problem characters" are the only ones that
> have exented unicode encodings - all the others are straight ascii.

Do you mean running the file as stored on the server through od? As
delivered to some clients? What char codes does it show?

> I have a feeling the page in question may have been created in
Europe,
> while most of our pages were created in the US, if that is relevant.

The chars mentioned are rendered differently by my browser (Konqueror
on Linux) in "automatic modes". In "manual" modes (View|Set Encoding)
it varies even more depending on what I select ;)

Depending on the authoring tools and how they were configured and/or
their default behavior, this could present problems/inconsistencies
with some chars in the source files. How they will be interpreted may
be a crap shoot;)

There are ways to _attempt_ to force client browsers to render them as
"extended, typographical" chars, but the users' browsers may:

-- use a different, specified, default charset (encoding)
-- may not have an appropriate font (substitute) to render them
-- may simply ignore "instructions" from the server despite all your
efforts
-- may have a bug

These and some other chars are particulary difficult to handle,
especially without some MS TrueType fonts on the computer. These map
on Windows into the extended ANSI range in page positions peculiar to
Windows. But even the Unicode standard has trouble with these ;)

I did not look, but I seem to recall some scripts running around that
will "sanitize/scrub" source files looking for problematic chars and
inserting the site "standard". Perl?

Here's some links that _may_ help somewhat or at least provide some
ideas on how to achieve site wide consistency (without CSS).

http://www.w3.org/TR/REC-html40/charset.html
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
http://glasnost.beeznest.org/articles/139
http://httpd.apache.org/docs/mod/core.html#adddefaultcharset
http://httpd.apache.org/docs/content-negotiation.html
http://www.i18nguy.com/markup/serving.html

Googles:
http://www.google.com/search?&q=html%20encode%20unicode
http://www.google.com/search?&q=apache+html+encode+unicode
http://www.google.com/search?&q=apache+html+charset

good luck,
prg



Relevant Pages

  • Re: An APL Archive
    ... Morten -- are those glitches down to the browsers, ... "Full Unicode ... our goal will be to be as compatible as possible with what other APL ... This format was designed for Insight Systems more than 10 ...
    (comp.lang.apl)
  • Re: Best encoding for a Japanese web site to deliver?
    ... and the prophet was numbered among those called bobdc. ... Is there any down side to shipping Unicode, ... > than problems with older browsers? ... "Tautologizm to coś tautologicznego" ...
    (sci.lang.japan)
  • Re: Guide to using special characters in HTML
    ... But in fact Unicode goes higher, ... Browsers seem to handle them the same was as the BMP characters ... interpreted as Unicode code units, so Unicode over U+FFFF needs to be represented as two consecutive code units. ... In modern browsers, you can use Unicode over U+FFFF as such in JavaScript string literals, too, or you can write them using the \u notation for each of the code units. ...
    (comp.infosystems.www.authoring.html)
  • i18n/font problem in mozilla/firefox/seamonkey
    ... in the mozilla-derived browsers. ... find on the map that display these chars as '?'; ... the correct Chinese char. ... The problem chars are in the quan3 and yu3,4 entries, in the second ...
    (comp.sys.mac.apps)
  • RE: [PHP] Unicode Problem
    ... On Fri, October 6, 2006 12:29 pm, tedd wrote: ... References ... One could use the Unicode DEC value directly, ... regression testing of more ancient browsers -- on Mac OS. ...
    (php.general)