Fog Creek Software
Discussion Board




UTF-8,MBCS

I'd like to know how UTF-8 compression converts Unicode characters/strings to MBCS. Also, how is "isleadbyte()" implemented for a particular code page? How is it different from ismbblead()? I work on Windows 2000/XP.

This weekend I was doing some studying on MBCS,WCS, DBCS & Unicode. What are the differences in the last 3 character systems? Simple explanations, links to easily understandable articles will help. Thanks a lot.

John
Sunday, January 26, 2003

Simple explanations:
MBCS - multi byte encoding i.e a character encoded into more one bytes
DBCS - languages that have characters in thousands are encoded in 2 bytes(Japanese, Chinese, Korean).

UNICODE -
Each OS or organisations earlier had their language encodings eg Windows have codepages, Mac -scriptcodes etc etc. and they were not at all compatible. So these big guys came together and said lets make everything one so came UNICODE which is 2 bytes alltogether.
The unicode is guided by glyphs i.e character image which they give a code. So in Japanese, Chinese, Korean there are common which are not given different encoding while earlier all had different codes according to that language codepage.

UTF8-
Since Unicode is 2 bytes it contains NULL characters for Lead bytes which fall below 255 range. So to transmit across net UTF8 came where all the characters are 8 bit encoded. Form of MBCS
more to know about UTF8 - Unicode visit www.unicode.org 

isleadbyte()
If you set your locale to the language code page you are processing it will return to you whether the byte passed is a lead byte of a double byte char which that locale describes.

ismbblead()
little better as it uses unsigned instead of signed value.

Use this instead
IsDBCSLeadByteEx which lets you set the code page instead of depending on Locale setting. Before that check whether the particular code page is installed.

I hope this helps.

R K
Monday, January 27, 2003

Hi RK,
Thanks. Sounds very interesting.

John
Monday, January 27, 2003

Interesting, but slightly wrong. Unicode does NOT fit into 2 characters - see
http://www.unicode.org/faq/utf_bom.html#6

or just browse through this website...

John Styles
Wednesday, January 29, 2003

*  Recent Topics

*  Fog Creek Home