Fog Creek Software
Discussion Board




on utf-8 encoding

I have enjoyed reading the nice article on unicode and character sets ( http://www.joelonsoftware.com/articles/Unicode.html ), but I have a simple question regarding this point from the article:

... In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. ...

The question is: how is the system going to understand that a character is using 2 and not 1 byte? I mean a set of two adjacent 1-byte characters could correspond to a 2-byte character. How are they distinguished?
Thanks

Andrea
Wednesday, April 28, 2004

on the first byte, the very last bit is used as a signal that the subsequent byte is part of the letter "word"

so BBBB BBBF

if F = 0 then it's a one byte letter, if F = 1 it's a UTF8 letter with more than 1 byte used... I believe all subsequent bytes signal furthur usage in similar ways (although not identical)

Li-fan Chen
Wednesday, April 28, 2004

as always please correct me if i am wrong... thankies

Li-fan Chen
Wednesday, April 28, 2004

Hence, what you normally see as IBM control codes or special characters used by windows is now either 0 or 1, and when it's 1 it has no meaning except to say that the next byte expands the reference space. So you'll generally see 1 bytes representing most of the regular alphabets and numerics, basic symbols and control characters and most common printable extended characters.

Li-fan Chen
Wednesday, April 28, 2004

If the high bit of a byte is set, the byte is part of a multi-byte sequence representing a codepoint > 127. A more detailed explanation is here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

John C.
Wednesday, April 28, 2004

I suggest reading the UTF-8 encoding RFC http://www.faqs.org/rfcs/rfc2279.html . The answer to your question is in section 2 (UTF-8 definition.)

GinG
Thursday, April 29, 2004

*  Recent Topics

*  Fog Creek Home