Fog Creek Software
Discussion Board




character encodings

Thanks, Joel, for this nice article.

foo
Friday, October 10, 2003

Joel,
the link you provided points to another page at

http://ca3.php.net/manual/en/function.utf8-encode.php

so I really don't know what the fuss is about. Stating that PHP had "almost complete ignorance of character encoding issues" stretches the problem a little bit too far. For instance, Python neither has transparent unicode handling for the string type, but the language's unicode support is good enough.

bar
Friday, October 10, 2003

I enjoyed your character encodings article and hope you will write more on the topic.  I noticed one thing that seemed a bit odd though.

When you give examples of non-ASCII characters (like "Gimel") you use an image instead of allowing the browser to render that character.

In the article you say that modern browsers can handle UTF-8...  and your page has the correct charset encoding set.  So why not allow the browser to render the character itself?

Sukotto

Sukotto
Friday, October 10, 2003

I've not learnt much from this article, but I wish I could have read it 18 months ago, when I had to figure this out. I may pass this to some of my colleagues who are still in complete mishmash with these issues (try to explain that an encoding is not a character set is not a code page and a code point is not a glyph...).
(Be prudent: on the Web, when the Content-type is given, it might as well be inaccurate).
Will the next essay deal with fonts, collation, comparison, or "bidi" ?

GP
Friday, October 10, 2003

Sukotto - IE cannot always simply install new character packages.  The OS must be set to support right-to-left character sets, and in my case, must also be set up to also allow Indic character sets, before IE can use them.  For more esoteric character sets (like Elvish), you can't just assume people will be able to support them, as very few people will have fonts available to render that Unicode code point. 

That's the beautry of "Arial Unicode MS", if you got it...

Ankur
Friday, October 10, 2003

BTW, you may be interested to find out that UCS-16, the Windows- and Java-style 16-bit Unicode charset, may be in a little bit of trouble in the future.

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF notes: 'In the early days of the Unicode consortium, there was some thought that Unicode would be a sixteen-bit design, and the notion of a "16-bit Unicode character" is still often encountered. ... While this ... notion is fundamentally wrong (because of the extra material in the "astral planes") it's hard to stamp out because it's almost right. I've never had the need to deal with a character outside of the BMP, and such beasts are likely to remain rare at least in the near term.'

For a bit more data, http://www.t-ide.com/2003/tcli18n.htm notes 'From Unicode version 3.1, the 16-bit limit was transcended for some rare writing systems, but also for the CJK Unified Ideographs Extension B - apparently, even 65536 code positions are not enough. The total count in Unicode 3.1 is 94,140 encoded characters, of which 70,207 are unified Han ideographs; the next biggest group are over 14000 Korean Hangul. And the number is growing. Unicode 4.0.0 is the latest version, reported to contain 96,248 Graphic characters, 134 format characters, 65 Control characters, 137,468 "private use", 2,048 surrogates, 66 noncharacters.  878,083 code points are reserved for what the future will bring. From www.unicode.org/versions/Unicode4.0.0 : "1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants)."'
Java is having to deal with this; see JSR 204: http://www.jcp.org/en/jsr/detail?id=204 .

wchar_t is now 4 bytes long on modern UNIX platforms like Linux glibc 2.3.

(I must thank Tim Bray at http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings for pointing all this out BTW.)

So, IMO, using a UTF, like UTF-8, as the native string rep makes more sense; *currently*, we're OK, as most of the Asian languages can be expressed without using these "astral plane" code points.  But to express *all* languages, from now on, in a future-proof manner, a UTF is more reliable IMO.  It'll expand to fit, as Unicode grows ever and ever larger... which, hopefully it won't have to, but I'm not making any bets either way ;)

--j. ( http://taint.org/ )

Justin Mason
Friday, October 10, 2003

We use TCHAR's where I work, but after reading the article I wonder whether we shouldn't just switch over to wchar_t.  God knows what would happen if we ever actually flipped the UNICODE switch on and tried to build.
    Of course, it probably doesn't matter for strings that will never see the light of day.  We use strings simply to allow our components to communicate, thus they'll never actually be displayed to the user.

Ken
Friday, October 10, 2003


Joel,
I believe,
"be strict in what you emit and liberal in what you accept" is from
Jon Postel, not Larry Wall., at least originally.

And I very agree it's not a great engineering principle.
Matt

Matthew Hannigan
Friday, October 10, 2003

My own ignorance on character encodings has stemmed from an utter lack of good explanations like this.  The other attempts to explain it that I've come across just manage to confuse the issue.

Patrick Lioi
Friday, October 10, 2003

A few articles on Unicode from well-known Python programmers which are general enough to be useful with other languages. For those who don't know it, Python has a very intuitive implementation of Unicode (the codecs module, u'a string'.encode('encoding'), and 'a string'.decode('encoding') are enough for most programming tasks).

- Unicode for Programmers,  Jason Orendorff
http://www.jorendorff.com/articles/unicode/index.html

- Unicode Support in Python, Marc-André Lemburg
http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf

more Python-centric:

- Dive Into Python, chp 6.4. Unicode, Mark Pilgrim
http://diveintopython.org/xml_processing/unicode.html

ludo
Friday, October 10, 2003

This is perfect timing, I have been scouring the Internet for a clear explanation of encodings and mostly failing.

For those that are interested, I did find one reasonable explanation at http://www.webreference.com/dlab/books/html/39-0.html.

printed and filed
Friday, October 10, 2003

Hey, guess what?  Not all of us code for the web.  Such a blanket statement about character sets is pretty broad and, in my case, pretty useless.  I dunno, maybe the scientific  community might want me to learn this, but then again, they may want the algorithms that actually solve their problems written.  Given a choice, worrying about character encoding seems like a priority when you deal with breadth (ie, reaching alot of people) than depth (being in a very small market with very directed goals).

Grumpy Scientific Coder
Friday, October 10, 2003

A very useful and well-written article, which I'll encourage my staff to read, but the following statement is so obviously from someone who has never programmed outside the Windows world that I can't just let it go without comment:

"EBCDIC is not relevant to your life. We don't have to go that far back in time."

EBCDIC is still being used today.  Some software does actually need to run on big IBM mainframes and IBM's AS-400 boxes as well as Windows and Unix servers.  And, in fact, the EBCDIC world has the same encoding problems as ASCII -- multiple code pages for different languages.

Yoda
Friday, October 10, 2003

Interesting article Joel.  I remember covering this in my Data Fundamentals class.  We spent nearly a whole week learning ASCII, EBCIDIC, UNICODE and the origins of character encoding.  Very interesting stuff.

Dave B.
Friday, October 10, 2003

It's fundamentally solving the wrong problem.

The problem isn't character encodings, it's that not everyone speaks English. ;)

Sum Dum Gai
Friday, October 10, 2003

i posted a response to joel at:

http://weblog.randomchaos.com/index.php?date=2003-10-11&title=PHP+does+have+limited+unicode+support

which i concluded "i get the impression joel hasn't actually tried to develop an international web application with PHP before declaring it 'darn near impossible'."

scott reynen
Friday, October 10, 2003

Two points regarding EBCDIC.  EBCDIC was at least spared the problem of people using the high bit for their own purposes since it is an 8 bit code.  Also ASCII is actually older than EBCDIC having originated in the telegraph industry.  That's what all those strange control are about.

Also we need to remember that high on the list of "wicked" "dim bulbs" was they guy who wrote the C-shell and vi who used the high bit to indicate quoting.  Vi was internationalized only after years of struggle and the C-shell never was.  Oh, that was Bill Joy.

Hank Cohen
Saturday, October 11, 2003

Every line of code dealing with strings while ignoring character encoding issues is a bug waiting to be noticed. Encoding incompatibility issues are not a "research problem", but frustrating reality for anyone speaking a language that has "funny characters" in it - which includes English, as well.

Dear American programmmers: intentionally misspelling words like "déjà vu", "cliché" and "über" is NOT an excuse for writing lousy code...  ;-)

Great article, Joel.

Florian Häglsperger
Saturday, October 11, 2003

For those interested in the broader issue of writing code for an international audience, there's the excellent, albeit out of print, "Developing International Software" (Nadine Kano, MS Press).

www.amazon.com/exec/obidos/tg/detail/-/1556158408/

A 2nd edition is available, but I haven't read it:

www.amazon.com/exec/obidos/tg/detail/-/0735615837/

In one company where I worked and was looking in how to support Japanese and other Far-East languages, it was just easier to have developers read this instead of having to give them a class on the Japanese language :-)

Frederic Faure
Saturday, October 11, 2003

The references to ANSI are misleading. Microsoft used the magic word 'ANSI' to avoid having to explain why they'd created a completely incompatible and nonsensical encoding system. If you ask the people involved which ANSI standard their codes refer to they'll probably either laugh, or look uncomfortable and shuffle away.

So an “ANSI code page” means a “MS DOS code page” or a “MS Windows code page” (the two aren't even interchangeable) and indeed the Unicode consortium considers them to be vendor legacy encodings just like MacRoman.

The Tamil joelonsoftware is broken BTW. It claims to be UTF-8 but it's actually using a font-specific encoding and then re-encoding that as Latin-1 characters in UTF-8. The resulting documents might as well have been printed out and then scanned as bitmaps for all the practical use they are. Please do the Right Thing™

I believe the dual API approch to Unicode in Windows was a fundamental mistake. My gut feeling is that UCS-2 was a bad choice too, even before Unicode expanded beyond the BMP. As with LP64[*], it looks more and more as though Microsoft is suffering delayed complications from "shot in foot" disease.

[*] 32-bit Windows has the same basic C types as 32-bit Unix, but 64-bit Windows has different types from 64-bit Unix. As you can imagine, with a 10 year head start practically all the existing 64-bit C and C++ software is for Unix...

Nick Lamb
Sunday, October 12, 2003

I'm not sure about the Win32 Unicode API but the .NET Framework uses strings with UTF-16 encoding that do support multi-word characters, i.e. Unicode characters that use more than one System.Char of size 16 bit.

Granted, this brings us back to the situation where you can't use Char to (reliably) store any character, which is obviously not ideal, but .NET strings (if not chars) can indeed contain all Unicode code points.

And since System.Char is a separate data type it might get expanded to 32 bit at some point in the future, with full backwards compatibility to programs that don't use naughty typecasts.

Chris Nahr
Sunday, October 12, 2003

Joel,

In the article you say: "So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits) ...."

I'm not too experienced with character encodings, but it was to my understanding that UCS-2 and UTF-16 are not the same, and that UTF-16 is a 2-byte encoding for UCS-4.  i.e., UTF-8:UCS-2::UTF-16:UCS-4.  Is this not correct?

James
Wednesday, October 15, 2003

*  Recent Topics

*  Fog Creek Home