Fog Creek Software
Discussion Board

How relevant is optimization? You judge

I want to walk through an example. If you're interested, read the post called "how relevant is optimization" for background. I started a new topic because I think this example is interesting in its own right.

I'm using pseudo code that is somewhat like C++, I hope it doesn't need explanation, but feel free to ask.

Let's say I want to compute the mean and standard deviation of fields across many records. Assume all fields are float. Let's further assume that this can't be done in advance for some reason, but has to be done online.

I'll start with the simplest, ugliest way to do that: A static, two dimensional array; One index is the record number and the other is the field number. It would be something like:

float data[n_records][n_fields]

for x in field:
.  sum[x] = 0
.  sumsqr[x] = 0
for x in field:
.  for y in record:
.    item = data[y][x]
.    sum[x] += item
.    sumsqr[x] += item*item
for x in field:
.  mean[x] = sum[x] / n_fields
.  stddev[x] = sumsqr[x] / n_fields - (mean[x]*mean[x])

Now, believe it or not, if the number of fields is large, or, if we need to swap to disk for some reason, then exchanging the order of the "for" loops for x and y can change the running time by two or three orders of magnitude. If the array storage order is C like, then an internal field loop will be perfectly cache coherent and predicted by any working set algorithm, whereas an external field loop (as written above) will be not have horrible cache coherence, and won't be predicted as easily. (If the number of fields is a multiple of the cache line size, and the cache is 2-way or 4-way associative, the cache will be completely ineffective).

"Well, exchange the loop order then" I hear you say. Fair enough; that's a luxury all Fortran programmers and early C programmers had. But now I'll evolve the example and make it conform to standard "structured programming" practice:

struct record { float field[n_fields] }
record data[n_records]
. (snipped)
for x in field:
. for each record y:
.  sum[field] += y.field[x]
.  sumsqr[field] += y.field[x]
. (snipped)

It's still possible to reorder the loops, but it's not as natural now. All our routines process complete "records", so the program structure would probably favor iterating by record rather than by field. Let's go further, and enumerate the fields by name, and make this object oriented...

struct record { float field_1, field_2, ... field_n }
class field_selector {
. int i;
. field_selector(index) { i = index }
. operator get(record r) { return r->field_#i }
float mean_stddev(field_selector sel, record rec[]) {
. sum = sumsqr = 0
. for r in rec:
.  item = sel->get(r)
.  sum += item
.  sumsqr += item*item
. return sum/len(rec), sumsqr/len(rec) - (sum/len(rec))**2
do_all() {
for i in field:
. collect mean_stddev(field_selector(i), data)

It's object oriented, so it must be better, right? It actually does do things we couldn't do before, such as "virtual" columns; If I wanted to calculate the mean and deviation of, say, field1*field2, all I'd have to do is write a new field selector that returns that in response to get().

Hmmpf. Now it's not so simple to change the iteration order. In fact, if "rec" is abstract (we don't use it directly - only through "get"), it's impossible - and because we made it abstract in the first place (A probable OO design decision), Joey does the hall already wrote 200,000 lines that rely on it being abstract.

Badness doesn't end there, though - if the records are allocated independently (as they usually are in OO programs), then the cache use will be effectively unused and unpredictable.

Most OO people I know solve things closer to the "OO" way I gave above. That's fine, but don't delude yourself that it is optimizable to the first case because in practice it isn't.

Also, consider - how would you even detect that the program runs orders of magnitude slower than it could, and that it's because of memory access patterns?

"But that's only for number crunching stuff". Well, newsflash: *everything* a computer does is number crunching. Yep, you wrap it in 250 layers of abstraction, and try to forget about it, but every bottleneck is a number crunching bottleneck because everything eventually reduces to computing numbers and moving them around.

Whenever I design a solution, I measure the complexity of teh problem by counting the byte moving and number crunching that (eventually) has to be done to solve it. For a proposed solution, I do a rough estimate of the byte moving / number crunching that it will do; If it takes more than 10 times as much as really needed, I seek to improve or replace it. If I can't give a rough estimate of the byte moving / number crunching that a solution requires, then I seek to simplify it.

It works wonders - it keeps the solution simple, and the performance predictable. Being able to count byte moves also means you're not overabstracting to the point that you can't exchange loop indices.

And I would like, again, to point you to the wonder that is K [ ]. It's a language that's entirely interpreted (no JIT or anything), in which the natural way to write programs makes them faster by an order of magnitude, and shorter by two. The magic is selecting the right primitives, and the right memory layout. Yep, it's unlike anything else you've used (unless you've used APL / J), but it does deliver everything it promises.

Another example - you've got your class hierarchy all neatly developed, and now you need a few thousand objects (out of a few million) in memory for producing a report. Unfortunately, you can't fit the whole million in memory for some reason, so you resort to a database. You could have, in fact, used memory mapped files and not have to deal with any persistence issues (which often take 1000 times as much CPU as your processing). But because objects require proper construction, and have pointers to one another, you can't just mmap/MapViewOfFile and forget about it. You'll have to find some clever trick like caching computations on the database, moving them to batch processes etc. But if you used pointerless structs in advance, you could have; And it's possible that the code wouldn't have looked much different.

To sum up - my point is that you have to plan in advance where to leave a place for optimization, or there's a chance you won't have where to optimize.

Thank you for your time :)

Ori Berger
Monday, July 28, 2003

Maybe it's just me, but your "object oriented" solution seems backwards.  I'd think you would have a class representing a record, which contains all the fields.  If you set it up like this you can get the same performance characteristics of having the record as the outer loop

That said, I do understand what you're saying and it is an important point.  Algorithmic/structural changes are by far the most significant things in optimizations.  The real question is: When do you do this optimization?  The problem with your suggestion that everything is studied and done in an "optimal" manner to begin with (which I'm not sure is even possible in general, since you can't necessarily figure out the optimal way to do something without the framework of the rest of the program), you waste a lot of time optimizing code that has no business being optimized.  Take your example.  What if I had a small enough data set that the whole thing fit in cache?  Then your optimization means nothing.  Or what if the only time you do that calculation, you also do some reading from disk, so the time spent on the operation is completely dominated by waiting for the IO.  In that case, this optimization is completely unnoticable to the user.

The point is, if you attempt to optimize too early, two bad things happen:  1) You spend a lot of time optimizing things that shouldn't be optimized.  2) You (potentially) make your code harder to work with.  Almost all abstractions come at a cost of performance.  However, the gain from these abstractions often far exceeds the cost.  If you try to optimize too early, you will do away with the many benefits of abstraction in search for some performance which is very likely to not matter at all in the final product

You are absolutely right that your design can back you into a corner performance-wise, and it *is* a good idea to think about that through the entire process.  You should absolutely take into consideration access patterns of your data when designing data structures.  We do this all the time.  Nobody claims you should be totally ignorant of performance while designing your program (ie, few people would suggest to use a list instead of a map on anything larger than a few elements).

Mike McNertney
Monday, July 28, 2003

I have a set of business rules, which have a 25% chance of needing optimization at some point.

Implementing them in code module "A" in a logical, well-thought-out manner would take four weeks.

Implementing them in code module "B" in a logical, well-thought-out manner ensuring room for future optimizations would take eight weeks.

Refactoring "A" to optimize it would take four weeks.
Refactoring "B" to optimize it would take one week.

Doing "A" results in a delivery time of four weeks, and a 50% chance of needing four weeks additional work later. Doing "B" results in a delivery time of eight weeks, and a 50% chance of needing two weeks additional work later.

Now you could do the actual analysis and figure out what the real lengths of time and probabilities are, weighing them and determining the best path. Or you can start work.

My personal opinion - you can worry, or you can do. People who do get things done. People who worry produce five-volume specifications of over-designed code while the people who do have code in production.

Yeah, sometimes you get bitten. But I think you get bitten no matter which direction you go, so it's better to get bitten when you have code in testing than when you're still designing well after the delivery date. ;-)


Monday, July 28, 2003

Oops. Those 50's should be 25's. I never said you don't need testing. ;-)


Monday, July 28, 2003

"The problem with your suggestion that everything is studied and done in an "optimal" manner to begin with ..."

That's not what I wanted to say; I summed up by saying "my point is that you have to plan in advance where to leave a place for optimization, or there's a chance you won't have where to optimize."

Which I'll restate: DO NOT OPTIMIZE CODE UNTIL YOU'VE GOT SOMETHING WORKING. However, *DO* optimize structure and architecture so that you will be able to optimize should the need arise.

About "studied" - a carpenter should be able to give a rough estimate, within a factor of 10, of the amount of wood needed for a project. I believe same goes for a programmer, both about no. of lines of code and about no. of CPU cycles. It's not a matter of "study", it's part of the trade.

And I'll try to clarify what I was aiming at (and, upon rereading 30 minutes later, see I didn't express well enough): That all too often, thinking about performance in advance would lead you to a solution that is of comparable effort, but can be improved at a later stage, whereas the "natural" solution can't, and may not offer anything in return (it may, but it does not necessarily).

About the OO being backwards - that depends on what you expect to change more, and what you have control over. Your "forward" is the natural SmallTalk / Python solution - if you need a new virtual field, you just add it to Object (or the most common ancestor), and use it from there.

However, this doesn't work well as well for C++ / Java. If you want to add a virtual field to existing objects, you'll have to change the base class (often not possible), or create a new class that supports the virtual field, and an object of that new class for every object of the original class (possibly holding just a pointer to the original, but still, you'd have to construct it if you follow the language spec). On the other hand, if you use an adapter object like I did, it can "adapt" any class to the required functionality (providing fields, real or virtual), and if used properly this results in much less object creation / destruction and much less code than subclassing. But YMMV. In my experience, adapters work better than simple inheritence (this is also the approach taken by the STL; However, I do not consider the STL A good example of adapter use).

One of the problems of OO is that it lacks a good definition, so one's forward is another backward. [ ] sums it up nicely.

Ori Berger
Monday, July 28, 2003

Philo - Totally agree.

You've done the analysis, and took the efficient (business wise) choice. I'm all for it.

But what usually happens is that plan B is never even considered (you know. "We'll look at it later if there's a problem"). And it often turns out that Plan B is the same length as Plan A, only using somewhat different principles, but optimizing plan A will take 4 times as much or won't be possible.

P.S: Everything I say about running time, I apply to coding time and no. of lines as well. Which is compatible with how you look at it.

Know thy alternatives, and pick wisely, according to case. Don't discard performance estimations and alternatives, just as you don't discard schedule estimations and alternatives or complexity estimations and alternatives.

Ori Berger
Monday, July 28, 2003

> a carpenter should be able to give a rough
> estimate, within a factor of 10, of the amount
> of wood needed for a project.

I hope you mean 10% not a factor of ten (x10) or do I misunderstand the factor of ten?

Sanger B
Tuesday, July 29, 2003

Ori - sure you can.  Can't you coerce objects of type list-of-records to array-of-records?  The fact that languages like Java let you declare multidimensional arrays, suggests that such optimizations are common.  I'd treat this as a similar problem to caching.

Tayssir John Gabbour
Tuesday, July 29, 2003

Ah I understand what you're saying now Ori.  I pretty much agree with you.  You should definitely think about different possibilities for doing the same thing, and whether any would be drastically faster for the same (or similar) effort.  Drawing the line between that and doing premature optimization can be a bit hard at times, though

Mike McNertney
Tuesday, July 29, 2003

" I hope you mean 10% not a factor of ten (x10) or do I misunderstand the factor of ten? "

Well, for a carpenter it better be closer to 10% than 900%, but I did mean a factor of 10 for software. With todays caches, instruction sets and compilers, I don't think that it's possible to make a precise estimate.

"Can't you coerce objects of type list-of-records to array-of-records?"

In C++, you can't. You can write a templace function that accepts either, but if you have a non-template function you want to use, then coercion is effectively constructing the container anew; This will always take O(n), even if the routine you use only requires O(log n) or O(1) processing.

"Drawing the line between that and doing premature optimization can be a bit hard at times, though"

Couldn't agree more.

I'm being a bit extreme on the performance side in an attempt to balance the common view that performance doesn't matter. Whenever you're doing something nontrivial, it DOES.

Ori Berger
Tuesday, July 29, 2003

*  Recent Topics

*  Fog Creek Home