Fog Creek Software
g
Discussion Board




UCS-2 means no Unicode 3.1

I have greatly enjoyed Joel's writings, and the Unicode article is no exception. That is, until I got to the part about CityDesk using UCS-2 internally.

So I guess this means if we want to use Unicode Western Musical Symbols, or anything else added to Unicode since 3.1, we're out of luck? We're a music software company looking at changing some of our Web tools, so that would be bad news.

I know a lot of software doesn't support Unicode 3.1 yet, but we're hoping to turn that around, program by program. Unicode doesn't fit in 16 bits anymore, and you need UTF-8, UTF-16, or UCS-4 to deal with that.

If I misunderstood this limitation in CityDesk, please enlighten me.

Thanks so much for these excellent articles!

Music software guy
Tuesday, October 14, 2003

You ought to be banging on the door of the tools vendors and compiler writers!

As it is, most modern compilers support a 'widestring' type of some sort, or equiv.  So using UCS-2 is an easy declaration change.

If compilers equally supported a 'verywidestring' type of some sort, or equiv, then getting your music symbols in would also be an easy declaration change.

(And a test run, of course!)

i like i
Tuesday, October 14, 2003

It was already known, back in 1993 when Microsoft released its first Unicode-enabled OS, that 16-bit wasn't enough. So, bitch at Microsoft about it. There's nothing Joel (or me or you) can realistically do about it.

Brad Wilson (dotnetguy.techieswithcats.com)
Tuesday, October 14, 2003

Are you sure this "UCS-2" label means that the application doesn't support surrogate sequences (is that the right term?) of 16 bit characters?

I know that I could never keep UCS-2 and UTF-16 apart, and as I wrote in another thread, the .NET Framework which is built on the Windows Unicode factility does support multi-character sequences to represent every Unicode codepoint.

Chris Nahr
Tuesday, October 14, 2003

UCS-2 has no surrogate characters and only supports characters in the Basic Multilingual Plane (BMP).  UTF-16 is an encoding mechanism that allows for 21-bit Unicode characters (that is, it doesn't support all the possible Unicode values that UCS-4 does, either, however the intention is to not assign characters beyond the 21bit point).  UCS-2 data can be interpretted as UTF-16.

UCS = Universal Character Set
UTF = UCS Translation Format

Joe
Tuesday, October 14, 2003

Okay, so... can CityDesk really only handle UCS-2 (no code point above 65,536) or can it handle UTF-16 (with surrogate pairs) and they just miswrote the documentation or advertisement? That's what I was wondering.

Chris Nahr
Tuesday, October 14, 2003

I'm 99.9999% sure that "Unicode string" in NT parlance means UCS-2, not UTF-16.

Brad Wilson (dotnetguy.techieswithcats.com)
Tuesday, October 14, 2003

Looks like you're right. I dug up the Platform SDK, and it refers to Unicode 2.0, says that Unicode can handle 65,536 different characters, and there don't seem to be any functions that can handle 16-bit surrogate pairs. So this was apparently added entirely by the .NET Framework.

Chris Nahr
Wednesday, October 15, 2003

According to this article  http://www.microsoft.com/globaldev/DrIntl/columns/002/default.mspx  Windows switched from UCS-2 to UTF-16 for its internal encoding starting with Windows 2000.

However it doesn't look like surrogate support (in Windows) is that widespread yet:  http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp

Rod Martens
Thursday, October 16, 2003

Thanks! But I think your second URL should be this one (the Surrogates link directly below the one you gave):

http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp

This page clears up the whole ugly mess about 16-bit surrogate pairs on Windows.

Your first URL came up with an error, could you please check again? Microsoft may have moved the page location.

(Why is MS always moving around pages, anyway? They seem to change all their internal URLs at least once per year. Perhaps so that they don't get moldy or something.)

Chris Nahr
Thursday, October 16, 2003

"Your first URL came up with an error, could you please check again? Microsoft may have moved the page location."

No, it is the dreaded buggy regular expression for URL bug of this board again that takes in a trailing space. Come on Joel. this is a very public and very annoying bug hanging around in plain sight in the middle of your shop window. Spend the five minutes nescessary to clear this up instead of wasting another week on PHP.

Just me (Sir to you)
Thursday, October 16, 2003

"UTF-16 is an encoding mechanism that allows for 21-bit Unicode characters (that is, it doesn't support all the possible Unicode values that UCS-4 does, either, however the intention is to not assign characters beyond the 21bit point)."

- Unicode characters are 21-bit.

UCS-4 was defined by ISO 10646, not Unicode, and did allow 31 bits of the 32 bits in each unit to be allowed, indeed it once provided for private use code positions in groups 60 to 7F (0x60000000 through 0x7FFFFFFF) and in planes E0 to FF (0x00E00000 through 0x00FFFFFF). This provision has been removed, which along with the JTC1/SC2/WG2 "Principles and Procedures" statement that they will not assign characters in the 15th or higher plane means that UCS-4 is essentially the same as UTF-32 (the Unicode-defined encoding of Unicode characters in 32-bit code units) except that UTF-32 being Unicode rather than ISO 10646 has additional semantics.

Hence UTF-16 can encode any Unicode character present or future.

Jon Hanna
Thursday, October 23, 2003

*  Recent Topics

*  Fog Creek Home