![]() |
![]() |
![]() |
on utf-8 encoding I have enjoyed reading the nice article on unicode and character sets ( http://www.joelonsoftware.com/articles/Unicode.html ), but I have a simple question regarding this point from the article:
Andrea
on the first byte, the very last bit is used as a signal that the subsequent byte is part of the letter "word"
Li-fan Chen
as always please correct me if i am wrong... thankies
Li-fan Chen
Hence, what you normally see as IBM control codes or special characters used by windows is now either 0 or 1, and when it's 1 it has no meaning except to say that the next byte expands the reference space. So you'll generally see 1 bytes representing most of the regular alphabets and numerics, basic symbols and control characters and most common printable extended characters.
Li-fan Chen
If the high bit of a byte is set, the byte is part of a multi-byte sequence representing a codepoint > 127. A more detailed explanation is here:
John C.
I suggest reading the UTF-8 encoding RFC http://www.faqs.org/rfcs/rfc2279.html . The answer to your question is in section 2 (UTF-8 definition.)
GinG
|