Fog Creek Software
Discussion Board




Reverse strings/words and i18n

We've started asking candidates to write a function that reverses strings in C (one of our developers started after reading it here).  Nearly everyone gets it close to right, but some people goof up the end condition.  Others end up including the final null in the swap.  Nothing surprising there.

The thing that I find interesting is that nobody considers that the characters might be multi-byte.  Actually, the only person who asked the question was a junior developer that we already hired (he came over from our support group and was never asked the question in his interview).  He had recently goofed something up while fixing a bug and someone beat it into him.

So now after the candidate gets the answer right, we ask them if there are there any inputs that this function won't work with.  Most people don't have any idea, which I find amazing.  These are usually people with 5-10 years C/C++ development experience.  The ones who do get it usually don't really know how to fix the problem.  The best answer we received was that we should use UNICODE (which would work, but might force you to redesign your app).

Joel had an article that discussed this issue.  The string/word reversal problem is a good way to see if people understand the issue (and at what level).

Allen Hadden
Wednesday, February 09, 2005

What I find annoying is that the question may not be precise.

I would be one of the programmers that wouldn't consider multilingual case if asked to provide a string inverter.  I'd get really annoyed when the person brings it up saying my solution would fail in the case of multibyte when they didn't specify that multibyte could be a possibility. 

The programmer may be wrong to assume 8bit characters but I wouldn't fault them if the question did nothing to correct this.

Define the problem properly, get a proper answer.

zekaric
Thursday, February 10, 2005

Interviewer: "That's a fine answer, except that you ...  Now here's a follow up, does this work with Unicode?"

That should generate an interesting conversation, rather than smacking the candidate upside the head.

"Function no work with Unicode.  Bad candidate, go back to college!"

anon
Friday, February 11, 2005

Well, we definitely "don't smack them upside the head" if they don't get it right.  We just would give someone bonus points if he brought it up themself.  If they don't think of the issue themselves, we give them a hint, which usually starts a conversation where we can assess if they understand the issue, or not.

The point is that developers should automatically assume that dealing with char* strings that multi-byte characters are possible, but very few do.  It is my belief that the whole issue is a little off-topic for entry-level programming courses/books and a little too mundane for advanced ones.  So few people consider the issue by default unless they've been hit by it.

Of course, the reason it's such an issue for us is that our system wasn't originally designed to use UNICODE.  If you use UNICODE (wchar_t), this stuff isn't an issue because characters are always the same size and you can swap them to your heart's content.

Allen

Allen Hadden
Friday, February 11, 2005

Why would char* ever mean multibyte?  Wouldn't it be wchar* in that case?

Michael H. Pryor
Fog Creek Software
Saturday, February 12, 2005

Multi-byte strings are typically represented using char*.  UNICODE strings are represented using wchar_t* (this is actually a bit of a simplification, but good enough for most discussions).

A very good explanation, from Joel:

http://www.joelonsoftware.com/articles/Unicode.html

Allen

Allen Hadden
Sunday, February 13, 2005

I personly hate this kind of interviewer. You clearly incompetent in your very area. You ask "wrong" question (it should be wchar * and not char * as already pointed out), you give wrong answer to your own question and you clearly don't quite understand rigth answer when one is provided. So,

my_func(char *str) - is supposed to deal with simple C-style zero terminated string. 1byte - 1 char.
It is correct, that C allows you to pass multibyte charachter string here - becouse it's C and you can cast it anyway you like.

Right way of course is to use my_func(wchar *str) unless you would like to reinvent the weel and use your own tricks instead.

What does it have to do with interview anyway ? You also ask "what linker flags needed on HPUX for all this to work" ? I will not be surprized if you do.

OnlyOne
Monday, February 14, 2005

Hey OnlyOne, you really have no clue what you're talking about.  The thing that's amazing, is that you don't seem to understand your limitations.  You might want to do some homework on multi-byte strings.  Start off by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

Each character in a multi-byte string can be one or more bytes.  Programmers who are not familiar with internationalization issues usually think that one character equals one byte, but they are wrong.  You are an example of our computer science cirriculum failing us.

If you still don't believe me, look at the wcstombs documentation (either UNIX man page or MSDN).  There's a function to convert from wide characters (wchar_t*) to multi-byte characters (char*).  Think about what happens if a wchar_t cannot be represented in 1 byte.

The purpose of the "reverse strings" question is to determine, firstly, if they understand C well enough to reverse a string.  A secondary purpose is to determine if the person understands internationalization issues (OnlyOne clearly does not).

Allen Hadden
Monday, February 14, 2005

wchar_t type will solve for the most part but you may still need multiple wchar_ts to represent some of the more obscure letters.  According to the spec. at least.

Using the 32 bit representation you can assume one character per 32 bit value.

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G19273

zekaric
Monday, February 14, 2005

zekaric, that is exactly right.  Under Windows, a character can be made up of multiple wchar_t too (just like with chars).  Under Windows, wchar_t is 2 bytes and UTF-16 is used to encode characters.  See:

http://www.microsoft.com/globaldev/DrIntl/columns/002/default.mspx#Q2

Note that under Solaris (and Linux, I think), wchar_t is 4 bytes, which can absolutely contain every character in use on Earth past and present.  So under Solaris (and probably Linux) you can finally assume that one character equals one wchar_t.

Having said all that, those characters that can't be represented in 2 bytes are extremely rare and are generally not in use except for academia (that's my understanding, anyway).

Many legacy applications (including ours) still use char*, which is why it continues to be a problem for us.  I suspect that many people are in the same boat.

Allen Hadden
Monday, February 14, 2005

Let me collaborate a little more about this. Yes, I do ignore the fact, that you store multibyte charachters as (char *).
But, quouting from "The Single Most Important Fact About Encodings" :

"It does not make sense to have a string without knowing what encoding it uses".

So, since you don't tell anything about, I assuming we're talking about vanilla ansi chars. So, don't get started about "wow, they are clueless about i18n". Not they (or me) - but you are.
State your question right, or you're working on "garbage in, garbage out" principle.

OnlyOne
Wednesday, February 16, 2005

OnlyOne, nice recovery, except for the fact that you clearly indicated your lack of understanding in your first flame-ridden post when you said "my_func(char *str) - is supposed to deal with simple C-style zero terminated string. 1byte - 1 char."  That is simply wrong.  No amount of back-peddling is going to change that.  Nor will trying to imply that I don't understand I18N.  Perhaps you made a mistake.  That's fine...it happens to all of us, but learn from it and move on.

The whole problem is that developers are assuming too much about char*.  Developers shouldn't have to be told, because in the real world you won't be told until someone tries to use your product in Asia and it doesn't work.  Then you have a ton of re-work.  You can blame it on faulty requirements because you weren't told your product needed to work with multi-byte strings, but that's a cop-out.  It is your responsibility to understand how to manipulate strings.

This will be my final post on the topic, regardless of the flame-bait that OnlyOne throws my way.  I apologize for continuing this thread to all those who don't care about this.  :)

Allen Hadden
Wednesday, February 16, 2005

Sorry for my ignorance in advance ... since i dont understand why unicode etc r spoken about here
For the gurus... i am just one yr experienced guy so please bear with it.

when the string is input ...
i believe it should be input in a 2D array and  increase
the row when ever u input a space and append a null to end of every word  and a null input would mean  end of string .

so a string :
blue is true

is saved as :
blue
is
true
 
then print it from bottom up adding space as u decrease row !!
true is blue


please correct me if i am wrong since i am still at interviewee level

Sri Harsha
Wednesday, March 30, 2005

Sri, if you reverse "blue is true " you will get "eurt si eulb".

Dave
Sunday, April 03, 2005

*  Recent Topics

*  Fog Creek Home