Fog Creek Software
g
Discussion Board




character encoding questions

Hi,

I've a few questions regarding the latest article on character encoding issues.

1. Does the concept of code pages apply to Unicode code points. Do different languages use the same code point for different characters like 8-bit ASCII code pages? For example 0xABCD might be one character in Japanese but another in Chinese ? Has this happened ?

2. I can understand simple encodings like UTF-8, UCS-2, but what are those language encoding in IE (in the View->Encoding menu) ?

3. How do font files fit in the picture ?

BY
Saturday, October 11, 2003

1. No. No. No. No. :) Code points are unique. Code pages are a completely different deal (they're different ways of slicing up the limited character space, pre-Unicode).

2. Those are code pages.

3. Some fonts are Unicode compliant (like Arial Unicode MS, I think it's called), but most are not. They are designed to be used with specific code pages. Unicode is much simpler than the code page mess.

Brad Wilson (dotnetguy.techieswithcats.com)
Saturday, October 11, 2003

By "compliant", I guess you mean that some fonts hold signs (don't know the correct term. "Rendering"?) for the entire set, while "non-compliant" means that this font only contains a sub-set, eg.  a Unicode font sold in Japan would only contain the English + Japanese signs, nothing more.

Frederic Faure
Saturday, October 11, 2003

Yeah, I was simplifying some.

There are very few fonts that contain glyphs for all the known Unicode code points (at least, whatever version of Unicode is supported by the NT-kernel Windows OSes).

Brad Wilson (dotnetguy.techieswithcats.com)
Saturday, October 11, 2003

Code points are unique, but characters are not; Some characters are repeated several times for completeness. Thus, it's possible to encode a string which would have the same visual representation (and essentially same human interpretation) in more than one way.

The hebrew letter 'Aleph', (which looks like "א", hopefully your browser and Joel's ASP script will collaborate to display this properly) is also a mathematical symbol representing the power of the continuum. It has two codes - one in the math symbol area, and one in the hebrew area.

The Russian character 'C' is actually associated with the sound that 'S' makes in English; And Russian 'P' is associated with the sound that 'R' makes in English. Both 'C' and 'P' (and many others) are repeated in the Russian area of the unicode set, even though there is no visual difference.

Unicode also has several normalized forms and many de-normalized forms. For example, an 'o' with an umlaut (two small dots above it) can be represented as a precomposed character (having one code), and as a composition - umlaut (a code of its own) + o (a code of its own).

This could lead to very subtle bugs - e.g., a user saves a file called 'CP' with both characters being from the Russian set; Later, when she tries to open them, then she can select them from explorer, but can't open the file by name. Furthermore, the sort order in explorer will look wrong.

Or a "find" feature in an editor will look for the string that was typed, precomposed, and silently ignore the decomposed string, even though they are visually the same and have the same meaning.

Proper support for Unicode is extremely hard - and it's not because of the spec, but rather because of the many details  of the languages that need to be taken care of.

Ori Berger
Saturday, October 11, 2003

There are also sometimes many ways to get the same character - for example, Unicode mostly follows Latin-1 for the code points from 128-255, so there are alot of legacy accented characters in that region.  However, Unicode has a separate way of accenting *any* character by combining any letter with an "accent conjugate", using the "Combining Diacritical Marks" section of the BMP.  Any Unicode compliant font is supposed to know that when that the accent follows the letter, it's supposed to combine them into a single character.  So, you can get á with U+00E1 or by using U+0061 followed by U+0301.  The latter is more general, the former is currently more common.  You application has to treat both ocurrences the same.

Ankur
Sunday, October 12, 2003

*  Recent Topics

*  Fog Creek Home