Fog Creek Software
Discussion Board

256 Entries to Count Bits in a Byte?


In your Guerilla Guide to Interviewing, you say "For #3, you can see how well they learned the bitwise operators in C.... but this is a skill, not an aptitude, so you can help them with these. The interesting thing is to watch them write a subroutine that counts all the bits in a byte, then ask them to make it much, much faster. Really smart candidates will create a lookup table (after all, it's only got 256 entries) that they only have to create once."

Why do you need 256 entries to count the number of bits on in a byte? Don't you just need 8?

00 00 00 01
00 00 00 10
00 00 01 00
00 01 00 00
00 10 00 00
01 00 00 00
10 00 00 00

Bitwise-and against each of these entries and increment the count each time the bitwise-and equals the entry. I don't understand why you would want to bitwise and against a byte with more than one bit on. 

Monday, February 23, 2004

Yeah, ok, you don't get the job :)

My 256 entry table looks like this:


Joel Spolsky
Fog Creek Software
Monday, February 23, 2004

jw, bitwise AND followed by increment, repeated 8 times, is slow.

the point of the 256-byte table is do the whole thing with one addition and one lookup:

BYTE JoelsTable[256];

int countbits(BYTE b) {  return JoelsTable[b];  }
Monday, February 23, 2004

Its like cutting the cake of the problem in an entirely different dimension.  In Joel's solution he's storing information about a static state, whereas your solution is calculating the state.

Its precisely the analogue of sine and cosine tables being stored in games and 3D applications.  Yes you can calculate them, but they aren't going to change for the application so its quicker to have a static table.

Question, if you change the original requirements to an arbitrary number of bits does that change the nature of the solution?

Hint.  How does one set of 256 patterns differ from any other set of 256 patterns?

Simon Lucy
Monday, February 23, 2004

Bah, just do it this way...

NumBitsSet(unsigned long x)
    unsigned long      n = x;
    n = (n & 0x55555555) + ((n & 0xAAAAAAAA) >> 1);
    n = (n & 0x33333333) + ((n & 0xCCCCCCCC) >> 2);
    n = (n + (n >> 4)) & 0x0F0F0F0F;
    n = (n + (n >> 16));
    n = (n + (n >> 8)) & 0x3F;

    return (int)n;

Monday, February 23, 2004

Beautiful... It took me 10 minutes :)
Monday, February 23, 2004

I'd like to get pedantic for a second :)

A read that incurs a L1 and L2 cache miss takes (judging by my PC) about 1 microsecond. (I tried FLD [esi+0] then FSTP ST(0). There was 1MB of reads between each one, since I'm not sure offhand how big the L2 cache on my PC is and Windows won't let you do a WBINVD.)

It looks like a sine -- FSIN, I let it do the output of the last one -- takes about 270 cycles. That's less than half a microsecond on my PC. (650MHz Pentium III.)

So if you're not doing too many sines and cosines, and/or they're interspersed with enough memory accesses in the right (wrong!) places to evict the trig table, you should perhaps do it the inefficient way.

The results may be different if you're doing several at once, but cache considerations still apply depending on how close in memory successive table lookups will be.

So, for the bit counting thingy, a function that computes the thing each time may actually be quicker.

For more bit-twiddling madness, the reader may like to visit this page:

Interestingly, I've a couple of times been asked at interviews to make a function run fast. The fastest, according to the interviewer, generally involves a table lookup, or unrolling the loop entirely. But I'm not convinced this is the case on modern CPUs.

Insert half smiley here.
Monday, February 23, 2004

Whoops, I should add that (in case it's not obvious) this applies for compilers that will magically turn your sin() call into an inline FSIN.

And I forgot to set the FPU to single-precision mode, so expect a further speedup if you do that.

Insert half smiley here.
Monday, February 23, 2004

An interviewer who judges code speed by looking at it (vs running and profiling it) might not be worth working for...

Dan Maas
Monday, February 23, 2004

Given the optimisations that compilers do even on non-optimised code generation and the length of pre-fetch queues on different processors, and different paths through execution engines it would be tricky to know whether an unrolled loop, or some fancy schmancy set this flag loop forever thing was actually changed or improved.

Getting the code on an even word boundary could give you a performance difference.

Simon Lucy
Monday, February 23, 2004

I was going to nitpick about this once, but never got around to it.

I agree that it is not clear at all that the lookup table is a good solution.  Memory access is slow.  In tight loop, looking into a separate table for sine values can thrash the cache whereas computing an approximation on the fly could be much faster.

Monday, February 23, 2004

To build the lookup table with out errors most programmers would write a script to generate the table, but the script could be used to count the number of bits!

Monday, February 23, 2004

You just have to use a profiler to determine whether it's worth it to use a look-up table. In my experience, it nearly always is. Especially if the table isn't too large and it is used frequently, in which case it will be in the cache.

Frederik Slijkerman
Tuesday, February 24, 2004

There are some lovely solutions to this at

Tuesday, February 24, 2004

I ran into this problem a long time ago while answering someone's post on DevShed:

I immediately came up with a lookup table solution, but after thinking about it for a bit, I realized bit patterns allowed for a very quick computation of this lookup table.  The bits in the last half of any sized table are reversed of the bits in the first half.  I thought it was kind of neat.  I am pretty sure that these are the patterns Joel speaks about in his article.

Jason Doucette
Saturday, May 01, 2004

*  Recent Topics

*  Fog Creek Home