Fog Creek Software
Discussion Board




UTF confusion.

Hi,

I'm hoping someone can help (actual question at the bottom if you want to miss the background).

I am playing about with UTF in C#.  I am writing a string of 5 chars where the last char is £ (0xa3). When I convert the string to bytes it is 6 bytes long so "Hell£" becomes

Hex 0x48, char H, denary 72
Hex 0x65, char e, denary 101
Hex 0x6c, char l, denary 108
Hex 0x6c, char l, denary 108
Hex 0xc2, char A, denary 194
Hex 0xa3, char £, denary 163

If however I change the string to é (0xa9) then the output I get is ...

Hex 0x48, char H, denary 72
Hex 0x65, char e, denary 101
Hex 0x6c, char l, denary 108
Hex 0x6c, char l, denary 108
Hex 0xc3, char A, denary 195
Hex 0xa9, char c, denary 169

Now I probably MISTAKENLY thought that 0xc3 would be the marker byte, but it appears not because in my first example it is 0xc2.

My question I guess is, "How does UTF-8 mark 16bit chars?"

R
Thursday, October 23, 2003

For the benefit of anyone else reading:

UTF-8 is a prefix code that aims to represent multi-byte characters using a variable number of characters, with the (North American) advantage that ASCII characters encode to themselves.

We can see how this works by looking at decoding.

A character whose bit pattern is 0xxxxxxx decodes to itself.

Now set the most significant bit to a 1.  The next code is

10xxxxxx

This is used as a "continuation code".  The advantage of using continuation codes is that given an arbitrary position in a UTF-8 string, one can tell if the position is at the "start" of a new character, or "in the middle" of a character.

The next code is:

110xxxxx 10xxxxxx

This two-character sequence uses one continuation code, and can encode any 11-bit character.  Most European characters will fit in there.

Next,

1110xxxx 10xxxxxx 10xxxxxx

This three-character sequence can encode 16 bits.

And so on...

David Jones
Thursday, October 23, 2003

*  Recent Topics

*  Fog Creek Home