Fog Creek Software
Discussion Board




Knowledge Base
Documentation
Terry's Tips
Darren's Tips

Suppress UTF-8

In my current situation the UTF-8 (of CD 2) coding is disturbing me. Is there any way (other than to go back to CD 1) to have my old fashioned "ISO..." charset back again?

Volker

Volker Halstenbach
Thursday, May 08, 2003

No, CityDesk 2.0 is UTF-8 only. What's the problem with UTF-8?

(I've been using it for almost a year now on the global versions of my site, in 25-odd languages and with thousands of daily readers, and not a single person has reported a problem reading web pages composed in UTF-8. That's why we finally decided it is the best and simplest way to encode all output.)

Joel Spolsky
Thursday, May 08, 2003

The problem is not UTF-8. It's the old browsers out there owned by non-technical people. If they constitute a significant part of your audience you might have a problem. I suppose that not many from this forum reads Danish, but I can assure you that it's an even uglier experience if your browser can't read UTF-8 ... ;-)

Jorgen Brenting
Thursday, May 08, 2003

Which browsers don't support it?

Netscape 3 does. IE 4 does, I'm not sure about IE 3 but very few people installed that.

Anything older, and you are talking about a vanishingly small number of people, well under 0.1% even in non-technical audiences.

Joel Spolsky
Thursday, May 08, 2003

Open CityDesk.exe in a hex editor. Locate the meta tag that includes the UTF-8 directive - it's in plain text. Change it to whatever you like. Save the file.

No warranty. You break it, you didn't hear it from me.

Mike Gunderloy
Thursday, May 08, 2003

One issue I have seen mentioned is to do with CGI - which I gather does not support UTF-8. See for example:

http://lists.template-toolkit.org/pipermail/templates/2002-November/003992.html

I haven't looked deeply into this, so can't verify it (too tired gotta fly tomorrow).

However if correct could raise a few probs in some circumstances...

MeJ

James Roberts
Thursday, May 08, 2003

Joel, the browser doesn't have to be that old. You only need to have the Auto-Detect off in Character Coding and the default set to something other than UTF-8. That will make the pages into something from Outer Space ... and I can assure you that the words 'Character Coding' is enough to make many ordinary users go blank.

Well, maybe I just worry too much, but I have found that it is hard to underestimate many users willingness to solve problems on other peoples web sites. Better sites are just a click away.

Jorgen Brenting
Thursday, May 08, 2003

I dont get it...it seems to work fine here after setting the
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
on my templates.

in what cases doesnt my swedish characters display correctly?

http://www.jacobsson.nu/hattrick

Fredrik Jacobsson
Thursday, May 08, 2003

Mike, just changing the UTF directive won't change the fact that CityDesk itself goes to great effort to actually encode all the text that it outputs in the UTF-8 encoding. (The string you're finding in CityDesk.EXE is probably just the default contents of a blank template and has nothing to do with the fact that we call a function converting the internal Unicode to UTF-8 before emitting it.)

Jorgen, as long as the HTML file specificies that it is encoded in UTF-8, the browsers shouldn't try to autodetect, they'll just interpret it as UTF-8. At least that is my understanding of web browsers, although of course my understanding could be wrong.

If you can show me an example of a web browser that doesn't display, for example, danish.joelonsoftware.com correctly, I'll be happy to change my opinion :) So far, hundreds of people go to that site every day and not one has complained about it not displaying correctly.

You can read a great deal more about this here:
http://www.cs.tut.fi/~jkorpela/www.html

Joel Spolsky
Thursday, May 08, 2003

On the one hand its really great and easy to have Unicode/UTF-8 support in CD 2.

On the other hand I use CD to create a website and send the Index.html as a Newsletter. I tried this with different Listserver services around. Esp. those located in the US (Topica, ConstantContact) just take away my HTML META tag (UTF-8) - thus, my Newsletter looks rubbish...

Volker Halstenbach
Friday, May 09, 2003

... more on the Newsletter / E-Mail issue:
For reasons of convenience I have to create HTML and TEXT only Newsletter - I want to use CD for both variants (perhaps create multipart TEXT/HTML webpages). If I get the TEXT part encoded with UTF-8 my readers will have a problem, because its interpreted in ASCII at their place.

Nicest feature would perhaps be: give us a Variable {$.Encoding=XXXX$} which we can place within anywhere... (Default might bei UTF-8).

Volker Halstenbach
Friday, May 09, 2003

And then there's the problem of CityDesk not being able to create a plaintext version of an article... they're all HTML encoded unless you do some post-processing on them.

www.marktaw.com
Friday, May 09, 2003

Joel, I think you are right about UTF-8. In my efforts to be on the safe side I tried all possible combinations of foolish preferences in various browsers and got very strange results. But it seems now that you have to ACTIVILY do something stupid to ruin the browsers detection of character coding. But it IS possible – also on 'Joel on Software' (nice site by the way) and that was what worried me.

In theory I'm all for Unicode, but I can see from other mails in this thread that there seems to be other things to consider than just the appearance in browsers. Sometimes it must be a bit too interesting to make software.

Jorgen Brenting
Friday, May 09, 2003

For plain text email messages you just need to add the header

Content-Type: text/plain;charset=UTF-8

Joel Spolsky
Friday, May 09, 2003

Joel, that is fine for HTTP, but how do server server sides scipts handle that...  when they expect PHP, ASP or perl?
For 95% of the situations (html files) it is perfect, much better than anything we had. But I'm not sure the other 5% can be neglected.

Suggestion: allow text/iso on a per-template basis (as a template setting, overriding the default UTF-8). This will make most people happy; those who want very basic HTML, text only or plain scripts.

Adriaan van den Brand
Friday, May 09, 2003

Oh boy, Adrian you are right. I'm back to CD1 again!

Jorgen Brenting
Friday, May 09, 2003

Ah well, hacking the executable was just a theory...of course for 90% of pages the encoding won't matter because there won't be any encoded characters. So there might be an opportunity for some post-processing here.

Just thinking (?) out loud...

Mike Gunderloy
Friday, May 09, 2003

I don't see the problem with PHP or ASP scripts. In both ASP and PHP anything that is sent directly to the browser (between %> and <%) is just sent directly to the browser. Nothing about UTF-8 will prevent that.

Once again, I don't mean to be argumentative, but I'm waiting to hear someone say, "I can't use UTF-8 because of the following situation" and then describe an exact situation where UTF-8 doesn't work. UTF-8 has been reasonably standard on the web for 5 years now and is a lot better than the mishmash of incompatible encodings we used to have (there were 4 different encodings just for Cyrillic!) It's also a lot better than what CityDesk 1 did, which was, completely ignore the issue and hope that it magically works OK :)

For us to support arbitrary encodings other than UTF-8 is a ton of work, and it makes our product harder to use by confusing people with the issue of character encodings when there is a nice standard that everyone has been using for 5 years successfully that makes it unnecessary to confuse people...

Joel Spolsky
Friday, May 09, 2003

When PHP or ASP is included in html (e.g. in <% %>) then you're right. However... if the script is entirely to be executed (e.g. script only, for instance an include file used by php) then the header is not read by the browser...

I'm very short on time. I have found iconv http://sourceforge.net/project/showfiles.php?group_id=25167

My idea is to make a small script: convert_utf <startdir> <extension[|extension][..]> <charset>

convert_utf (perl or anything) will then recurse all directories starting from startdir and convert any files with extension in set to the characterset (param3). It can just call iconv.

is there anyone with some time on his/her hands? It would solve the problem for the very few of us who (fear to) have a problem, saving Joel tons of time which can be spend on the other wishes we have... ;-)

As I said: in 95% of the situations UTF-8 is the best characterset for the world...

Adriaan van den Brand
Saturday, May 10, 2003

OK, wait a sec...

If you just want to publish an ASP/PHP script with CityDesk, not an article, you just drop the file into CityDesk and it will be published byte-for-byte the way it looked when you dropped it in. No UTF-8 if you didn't encode it that way. This is just like dropping a picture in -- the file that is published is binary equivalent to what you dropped in.

The only place CityDesk will actually convert to UTF-8 output is when you have templates + articles.

Now, if for some reason you are using ASP/PHP as your templates and combining them with articles, I would guess that your article itself is going in between the %> and the <%, where UTF-8 is fine.

One thing to remember about UTF-8 is that all characters with codes < 128 are unchanged. Since the syntax of the ASP and PHP scripting languages never require or even tolerate characters above 128, there's no problem writing ASP and PHP scripts themselves in UTF-8, because all the characters < 128 are the same in UTF-8 as they are in any encoding -- it's ASCII.

The one remaining tiny problem I can see is if you have a script with a string literal, where the string literal comes out of a CityDesk article. In ASP:

Dim a : a = "{$ .body $}"

We would encode that string literal in UTF-8. So you would need some way of telling the ASP or PHP engine that your script itself is encoded in UTF-8. In ASP, that's trivial -- there is a @CODEPAGE directive that does it (I believe you have to set it to 65001 to get UTF-8). I assume there is an equivalent for PHP.

In any case I still don't see a reason why UTF-8 is not going to just work for everybody. Please forgive me for dragging on this thread; it's not because I'm being argumentative. On the contrary I genuinely want to hear in advance if there is a real problem with our UTF-8 strategy and so far I just don't think there is one.

Joel Spolsky
Saturday, May 10, 2003

I think ASP is one step ahead on this to PHP, but the developers seem to be aware of this.

As long as the output in at some point in time intended for the browser, it should be ok, even if strings are UTF-8 encoded, as long as the resulting page is an UTF-8 page.

For plain english the problem is relatively small. However: don't we all use left/right quotes rather than simple ones?

For plain western-european languages the problem is bigger:
- Euro sign = &euro; = &0128
- ëïöü are quite common in German, but to a lesser extend (mostly ë, and some è and é) also in Dutch
- áàâ etc are the basis of French
- the script looks weird if edited in a text editor (since these are not generally aware of UTF-8

Again, I think it is very good that Citydesk works UTF-8 internally. UTF-8 can be transcoded to any characterset.

I would very much like a simple script tag:
{$setCharacterSet ISO-8859-1$}
{$setCharacterSet UTF-8$}
{$setCharacterSet$} (to return to Citydesk default which happens to be UTF-8 at the time)

which would force Citydesk to use this characterset for all following statements in {$$}
This would do, since if it is the first statement in the file, it is 'plain english'.

Is it difficult to do this? If PHP has utf8_decode and utf8_encode, there must be something around like that in your runtime environment? That should just be called before any substitution is made...

Adriaan van den Brand
Sunday, May 11, 2003

I've checked it
áàäâéèêëìíîïoòóôöùúüû€
comes out as
áàäâéèêëìíîïoòóôöùúüû€
when viewed in a text editor....

May be this is not a normal text:
Café would be Café
If then someone would change it in the text editor to be Café (encoded in default ISO..)
It would show in Citydesk as
Caf

Adriaan van den Brand
Sunday, May 11, 2003

I had trouble with this same problem.  Here's how I fixed it.

To test if your server is screwing up your UTF-8, you can use lynx.  Here's my output.  Notice the "Content-type" is ISO-8859-1:

------------------------------------------------------------------------
[root@samana conf]# lynx -mime_header http://www.buddhasasana.com | head --lines
=14
HTTP/1.1 200 OK
Date: Fri, 16 Apr 2004 19:54:13 GMT
Server: Apache/2.0.40 (Red Hat Linux)
Last-Modified: Mon, 12 Apr 2004 16:25:39 GMT
ETag: "18c136-1bee-322486c0"
Accept-Ranges: bytes
Content-Length: 7150
Connection: close
Content-Type: text/html; charset=ISO-8859-1

<html>
<HTML><HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="Fog Creek CityDesk 2.0.19" />
------------------------------------------------------------------------

If you have access to the configs, change AddDefaultCharset"

------------------------------------------------------------------------

#AddDefaultCharset ISO-8859-1
AddDefaultCharset utf-8

------------------------------------------------------------------------

Restart apache.  Close your Browser, and clear your browser cache, just in case.

The header then changes to:

Content-Type: text/html; charset=utf-8

And, the unicode is readable.

Edward Holtz
Friday, April 16, 2004

*  Recent Topics

*  Fog Creek Home