Fog Creek Software
Discussion Board




Unix File Encoding

Hello!

After reading the following article:

http://www.joelonsoftware.com/articles/Unicode.html


I got curious about how files are encoded.


Questions:

(1)  How do I find out how a file is encoded in Solaris or Mandrake Linux?  (In Windows using TextPad, we can see document properties.)

(2)  Can file encodings change during transfer?  Would using a Windows based hex editor tell me some special characters are encoded correctly?

Thank You!
Matt

Matt
Monday, November 03, 2003

With really old UNIX systems you might have to worry about endianess. Otherwise most plain text are assumed to be 7bit ASCII. Read Joel's article on internationalization. Basically it really depends on the file format.

Li-fan Chen
Monday, November 03, 2003

CRLF might be translated into LF and vise versa. Try FTP with binary mode.. try it with text mode. You'll see what happens.

Li-fan Chen
Monday, November 03, 2003

On unix(ish) systems, use 'file'.  On my OSX box:

ahurst@sdcsi4 Desktop$ file filetest.txt
filetest.txt: ASCII text
ahurst@sdcsi4 Desktop$ file 03-CFO-Final.pdf
03-CFO-Final.pdf: PDF document, version 1.2
ahurst@sdcsi4 Desktop$ file fma9_0_2_10_1.zip
fma9_0_2_10_1.zip: Zip archive data, at least v1.0 to extract

It works rather well.  Haven't run across any unicode files on this system to test with yet though...

Andrew Hurst
Monday, November 03, 2003

What do you mean by file encoding? In UNIX a file is just a sequence of bytes. It is entirely up to applications how to treat those bytes.

Passater
Monday, November 03, 2003

Matt,

Seems like we don't know what you're asking.

Are you imagining a plain text file consisting of unicode characters and wondering how to tell which encoding it uses?

Tony Chang
Tuesday, November 04, 2003

Hello!

Thank you all for your comments, I am very grateful!


To clarify the question, I was wondering if files can be encoded into say, UTF8 instead of ASCII?


Perhaps that question does not make sense, I've just started learning about all of this stuff, and am trying to understand an end-to-end solution:
 
incoming files -> processing -> database <- retrieval <- web-display

How does an ASCII files store character above 127 then? I assume (based on Passater's comment) that in ASCII a 2 byte character would be stored as 2 ASCII characters, and left up to your viewer (shell, vi, whatever) to display it correctly?

My real interest in ensuring that we are receiving our files with multi-langauge characters correctly, and to code a way to test that.

The confusion begins when I log into our Solaris box using Putty, I see the same thing displayed differently if I use telnet or ssh.  (telnet shows <e with acute> and ssh shows \350.  We are expecting a <e with acute>)

Therefore, I cannot trust my display...

Thank You!
Matt

Matt
Wednesday, November 05, 2003

Matt,

your clarifications don't really clarify... Of course files can be stored in UTF8 instead of ASCII.

I think that part of the problem is, the people dealing with the foreign character sets aren't aware that they are. We'll call text files encoded in iso-8859-1 an "ASCII"-file although that's not the encoding actually used fro the file.  "ASCII" is bascially synonymous with "Plain-Text-File".

2 byte characters don't exist in ASCII. ASCII is only 7 bits and never stores any characters above 127. Any text-file you find with the first bit set in any character may be termed "ASCII", but the encoding used will most likely be UTF-8 or 8859-X.

Since you seem to be dealing with french text, the encoding you're probably using is iso-8859-1.  \350 is e-accent-grave.

Unfortunately you can only tell by context what charcterset is being used, e.g. you can't automatically tell if the \350 character is meant to be a e-accent or a theta (as it is in 8859-7).

It's necessary to establish the character set  you're using with whoever is providing you with the files. In case you need to change the encoding, check out the documentation from your database or -if on unix- look at the man pages for 'recode'.

  -tim

a2800276
Thursday, November 06, 2003

Thanks Tim!

What you have said has cleared up my understanding of files.
I've also read a few articles on the topic and think I have my head around it.

Thank you for your help!
Matt

Matt
Thursday, November 06, 2003

*  Recent Topics

*  Fog Creek Home