Fog Creek Software
Discussion Board




Welcome! and rules

Joel on Software

UTF-8 confusion

Hi,

I'm hoping someone can help (actual question at the bottom if you want to miss the background).

I am playing about with UTF in C#.  I am writing a string of 5 chars where the last char is £ (0xa3). When I convert the string to bytes it is 6 bytes long so "Hell£" becomes

Hex 0x48, char H, denary 72
Hex 0x65, char e, denary 101
Hex 0x6c, char l, denary 108
Hex 0x6c, char l, denary 108
Hex 0xc2, char A, denary 194
Hex 0xa3, char £, denary 163

If however I change the string to é (0xa9) then the output I get is ...

Hex 0x48, char H, denary 72
Hex 0x65, char e, denary 101
Hex 0x6c, char l, denary 108
Hex 0x6c, char l, denary 108
Hex 0xc3, char A, denary 195
Hex 0xa9, char c, denary 169

Now I probably MISTAKENLY thought that 0xc3 would be the marker byte, but it appears not because in my first example it is 0xc2.

My question I guess is, "How does UTF-8 mark 16bit chars?"

R
Thursday, October 23, 2003


Ooops.
If in doubt ... bottom of page 3, RFC 2279.

R
Thursday, October 23, 2003

Why bother? What's wrong with:
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(string);

Duncan Smart
Thursday, October 23, 2003


Because I am interested?

R
Friday, October 24, 2003

UTF-8 is a good spec to read because of how elegant the solution is. and it's always good to learn about how the computer does something.

looks like you found one of the very few cases where one of the unencoded values happens to show up in the encoded output. a good exercise would be to figure out all situations where this happens.

mb
Friday, October 24, 2003


To be honest this all started when we were writing large text blobs into a database and when we pulled it out it we had lots of EM codes (0x19).  Figured the original author must have been doing something strange like writing in as UTF-8 and reading out as ASCII. But then I just got sidetracked :)

R
Friday, October 24, 2003

>> UTF-8 is a good spec to read because of how elegant the solution is

Yes, it reminds me of how IP addressing works - i.e. you determine whether an address is class A, B or C from looking at the first few bits...

Duncan Smart
Thursday, October 30, 2003

*  Recent Topics

*  Fog Creek Home