Fog Creek Software
Discussion Board




Python unicode strangeness

Running on a windows 2000 machine (little endian) with cygwin.

I have a file (web.txt) which contains 3 arabic characters in unicode.  When opening the file from the python interactive command line, I get the following:

>>> import sys
>>> file = open("./web.txt", "r")
>>> file.readline()
'\xff\xfe-\x060\x06A\x06\r\x00\n'

Which is correct, and what I expect.  When I execute the following (same) lines in a script:
file = open("./web.txt","r")
line = file.readline()

The output is:
■-♠0♠A♠

If I replace the readline with
  file = codecs.open("./web.txt", "r", "utf-16le")
  file.readline()
the output is:
  File "c:\Python23\lib\encodings\utf_16_le.py", line 26, in 
  readline
    raise NotImplementedError, '.readline() is not implemented for UTF-16-LE'

Finally, reading the line like this:
  file = codecs.open(args[0],"r", "utf-16le")
  line = file.read()
Gives:
      return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

Does anyone have any idea what going on?  I'm a bit confused...

Furious George
Monday, July 26, 2004

Your file begins with a BOM, so it's not UTF-16-LE, it's UTF-16.

Iago
Monday, July 26, 2004

Thanks for the reply.

Changing the argument to "utf-16" yields this result:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>
Note the----------------^^^
(0-2: character maps to <undefined>, as opposed to 0-3).

Furious George
Monday, July 26, 2004

Works for me - given a file containing the string you specify, and the Python code you specify, I get a "truncated character" as expected on the \n; appending another null byte to the file, the string reads in correctly.

I'm not familiar enough with Python to be able to guess what's happening differently when you run it.

Iago
Monday, July 26, 2004

>>> file = open("./web.txt", "r")

This is not correct. You should never manually open unicode file in text mode. Always use binary mode ("rb").

>>> When I execute the following (same) lines in a script:
>>> The output is: &#9632;-&#9824;0&#9824;A&#9824;

You're probably using unicode-aware console or writing it to a file and using unicode-aware editor.

>>> file = codecs.open("./web.txt", "r", "utf-16le")
>>> file.readline()
>>> raise NotImplementedError, '.readline() is not implemented for UTF-16-LE'

Same here. No luck in Python 2.3.3

>>>  file = codecs.open(args[0],"r", "utf-16le")
>>>  line = file.read()

It works. Try print repr(line). I bet you're trying to print it on a limited text console. Try using graphical unicode-aware console like IDLE (comes with Python distribution). Then you don't need to use repr(line).

Serge
Tuesday, July 27, 2004

*  Recent Topics

*  Fog Creek Home