Fog Creek Software
Discussion Board




C++ objects, optimization

I discovered Joelonsoftware.com only recently, got to say great site.

Anyway on to the topic in my head (there is a question towards the end, if you want to skip my introductory rant). Sorry this is a bit convulted...

I am working on a program (C++, MFC, Win32) which involves lots of objects all in memory.

Basically their is a process which involves

Step1:
Create Objects Type A - various complicated data calculations
Create Objects Type B - more complicated data + include arrays of As
Create Objects Type C - some complicated data + include arrays of Bs
Create Object Type D -  some complicated data + include arrays of Cs

Step2:
Do lots of work on these objects starting at D, and mostly working inwards

Optimizing, and this ain't bad code, I have gone from a 145 seconds total (about <15 in step 1, 130+ in step 2) to 6-7 seconds (5-6 in step 1)


Some points encountered along the way

1. I started with what I "knew" was slow - all the calculations. This was a lot of work only saved 1-2 seconds! As everybody says, unless you measure before optimizing, you're wasting your time (fortunately 1-2 seconds is quite significant at the point I am now).

...So then I started measuring...

2. One very interesting thing I discovered: the code had a progress bar in the app's status line (based on CProgressCtrl).

What it did was kind of (simplified) calculate the number of B-objects, and as each was done in step 2, calculate B-objects done as % of the total number, and use this to update the progress bar. Now because their was a lot of B-objects (more than 100) and the progress bar doesn't have the resolution, it get's told display a given % more than once... Putting in a simple guard not to update the bar (just return immediately) if the value to be displayed was already displayed saved about 45 seconds in itself... (and reduced the progress bar update time to negligible)

...a few optimizations later - and this was one of the slowest parts...

3. There is another part of the code, where there is sort of a class like this

class SomeClass
{
protected:
  // lots of member variables
public:
  SomeClass()
  {
  }
} ;

This class gets constructed, and a pointer to it gets passed around to various functions.  Then the object is destroyed.  A bit later the whole process repeats.

Just constructing the object seemed to take a long time. Then I had a ephiphany... There is only 1 of these objects at a time! 

Solution 1: Re-use the same object. Downside would have to rewrite lots of functions not to take this object as a parameter - a good idea in the long run, but not something I want to do right now.

Solution 2: Re-use the member data. Changing the member variables to statics, and making the ctor just reset their values.  I tried it gingerly - it worked. major time savings - with no impact on the rest of the code.


...a few optimizations later...

One of the remaining issues that I want to optimize is a particular function called AddObject (in class-D). It takes a reference to a C which is added to the array in D. The C contains an array of Bs which get copied, and the Bs contain As which get copied.

The arrays are defined as CArray< obj, obj& >. This makes the code real simple, and prior to my other optimizations, the overhead from this was a tiny % of the total - now it's a big %

So the questions
1. Any better way to do this? One obvious idea would be to make the arrays contain pointers, ref counted (see next point). Of course as this is a major change I am worried that doing all that work may make little difference (or make it worse!). Any thoughts?
2. After calling AddObject, the object being copied from eventually gets destroyed. Is there some way that I could construct it in place in the array to avoid the copy?
3. At the bottom level of the hierachy, some (not a lot) identical As exist. I'm wondering if I could re-use these (ref counting) rather than process the same A more than once in step 2.

S Tanna
Monday, May 12, 2003

On the last, I have no comment. Nothing immediately jumped to mind.

However, on the object construction/destruction issue you mentioned, making the members static seems really ugly to me. The reason is this: sure, you know you may never instantiate more than one at a time, but what if this ever has to be multithreaded, or whatever? Or you just want to change it?

The better way that occurs to me is simple object pooling: You can have a collection of objects in a pool, and one is only constructed if the pool is empty. The object is returned to the pool when finished. It's more complex, but:

1) It will allow N objects to exist at once without horribly breaking.

2) It should perform similarly (after all, if you always have more objects in the pool, you'll never instantiate another one).

3) It is transparent. It wouldn't require changing every routine that uses the object, but instead only the one that instantiates it.

Now, thinking back on your first question, a few things come to mind.

On question 3: lookup the Gang of Four Flyweight pattern.

On number 2: I don't know, can you? :) In C++ it is possible to construct an object in-place in an array (i.e. I believe the vector class just calls the default constructor for each of its members to create them). Your code may vary, maybe you engineered it so you can't 8-}

Now, having had much fun going backwards, returning to 1, I think pointers are a good idea, provided you can keep track of them satisfactorily. As you say, that can be work.

Here's a question: can you omit the copy step? Not omit it and have it work, but will the code not crash if you just don't do it? I'm thinking, measure the time with and without the copy. Or only copy the first element, or something.

Take all this with a planetoid of salt, seeing as how I've made some rather large assumptions about your code 8-}

Hope this helps.

Mike Swieton
Monday, May 12, 2003

I agree the static thing is a hack - but fortunately it's a safe hack. The multithread issue doesn't arise in the app. And the debug code will nicely alert if it the assumptions get broken (I use a flag to track it) - leading to comments which tell you how to fix it and make it general again

I just thought it was interesting (and surprising) that in Visual C++ that constructing an object containing all statics (ints in this case) appears to be substantially faster than constructing an object containing ordinary members.  In my app, and I guess in others too, it's a useful thing to know if you are constructing/destroying the same object repeatedly.

Thanks for the other tips, I'll look into them.

Ideally, I'm aiming to get this thing down to a couple of seconds.

S Tanna
Monday, May 12, 2003

That should be true for other compilers as well -- statics are allocated in the start-up code before main() is called and thus the memory allocator isn't called; the default memory allocator on any system is going to take a while to find a block; especially if it's a big one - might even go off and start doing heap compaction or paging to disk or who knows what.

X. J. Scott
Tuesday, May 13, 2003

Is your application going to exit as soon as it gets its job done?

If yes, you can skip freeing memory. Just leave everything allocated and let the operating system clean up the mess after the application has exited.

Skip ref counting, too. Since everything stays allocated, you do not need ref counts.

Write a custom memory allocator. See the example below. It should be good enough for a application that is going to exit after a few seconds of running time:

void* operator new(size_t size)
{
    const int pool_size = 1000000;
    static char pool[pool_size];
    static char* mark = pool;
    char* result = mark;
    mark += size;
    if(mark >= pool + pool_size)
    {
        // Out of memory -> increase pool_size and recompile.
        abort();
    }
    return result;
}

jannek
Tuesday, May 13, 2003

Using VC++?

Use the profiler.  It will break down on a per function basis where your app is spending all it's time

vc++ monkey
Tuesday, May 13, 2003

There is something very curious about a compiler that is faster at executing a static member function over a non-static member functions.  Even if it were virtual, the difference should be a handful of cycles - nothing a human would notice.  Might be best to double check your measurements...

Nat Ersoz
Tuesday, May 13, 2003

Unfortunately I can't rely on the OS free when the app dies... the user keeps using the app

It's a graphics app.  The user hits some button/changes some settings... and waits for the result to display, and then does the same thing again ad infinitum.  As this is a key part of the app, I don't want users waiting around for the result to display.

A programmer-world experience of the same thing might be, you edit one line of code, and then you have to wait 2.5 minutes (okay now a few seconds) before you can start editing the next line. Now I can't just say, hit a button after you have made all the changes to do the long process, as users want to see the effect of each edit before making the next one.

Yes I use profiling.

To answer the question about static vs. non-statics. It's not calling a static vs. non-static functions, it's static vs. non-static member data, the constructor, and allocating the object. If you do the same thing like a million times, even a tiny difference between one method and the other method accummulates.

Off topic aside: This isn't the question, but static vs non-static member functions: I would expect a tiny difference, as the non-statics have an extra (hidden) parameter - so must generate slightly different assembly.  In a totally different project a while ago, I needed to make the program much smaller for extremely complicated (but important - it is something lots of customers asked for all the time) reasons I won't go into.  One trick I discovered static members (plus playing around with different calling conventions) can make a few bytes smaller code (that hidden parameter again). In many classes there tend to be helper functions which don't need access to the rest of the member data... so I saved about 10% of the program size just by doing that to as many functions as possible.

Going back to the original question, I have thought of another optimization in the algorithm. 

There is actually a step 3 I didn't mention earlier (but it's relatively tiny), and some kinds of edits would only affect step 3, so for these kind of edits I keep the results of step 2 around. I already had this.

I have now realized for some edits I can keep the results of step 1 around, so sometimes I can skip step 1.

However the problem remains open, as about 50% of the possible edits would still require me to go through all three steps.  I think the answer here has to be something like creating the objects in place to reduce the number of copies, which are taking most of the time.

S Tanna
Tuesday, May 13, 2003

One thing that I'm confused about... You said that the reason you made the class use static data is because you didn't want to rewrite many functions to not take the object as a parameter.  Why would this be necessary?  You don't have to make it a global, just pass the same pointer in at the top level (whereever the thing gets allocated).  Even if you do make it a global you don't have to change the methods that take it as a parameter, just pass in the global pointer.

Mike McNertney
Tuesday, May 13, 2003

Making that object global would work and probably have the same (or very close but probably marginally better) effect in terms of speed to making the member data static.  However in either case, there is an "evil" assumption - only 1 of these objects at a time.This evil assumption holds now, but it is possible it might not at some point in future. 

With my method, if I need to switch back, the debug code will assert and tell them, telling me to change one #define - i.e. all the changes within the module making the assumption. 

With the global method I would also need to change the code that creates the object (in a separate module) to call some reset method separate from the ctor. - in other words it impacts code outside the module making the assumption.

S Tanna
Tuesday, May 13, 2003

I wasn't really supporting the "make it global" idea.  Mostly I was wondering why you would have to change the functions that take it as a parameter, rather than simply changing where it is allocated.  If it wouldn't be easy to re-use the same object, you could also go with the "pool" idea mentioned earlier.

I'm not trying to criticize your decision, just trying to get a better idea of the factors that were involved in your choice

Mike McNertney
Tuesday, May 13, 2003

You're right I don't have to change the functions in this case, and I misspoke if I said I did.

Basically what I'm trying to do is add the sizzle (which is speed - a competitor's app takes several minutes to do the same step and requires a whole bunch of user interface options about when to see the update) and close out the remaining issues before shipping the app.

I am sort of in the last mile. When I speed something up, by adding this type of assumption, I preferably don't want to do so in a way that is not easily irreversible, in case it needs to be reversed in the next version.

The object pooling idea is reasonable, but it's unnecessary additional complexity (potential for bugs) for the problem I have right now.

I have just added the last optimization I mentioned, works pretty good.

The in-place creation of objects is the last thing.  If I can solve this one, it'll be fast enough to see changes in real time vs. 10 minute wait per change in the competion.

S Tanna
Tuesday, May 13, 2003

Given what you said in your first message, try the following optimizations: minimize the number of times that you call operator new; and avoid copying objects after constructing them.

If you know how many objects will be in the array by the time you have finished, use CArray::SetSize before you start.

To avoid copying objects accidentally, declare but don't define a copy constructor for A.

To minimize the number of times that you call new, allocate the memory-space for e.g. 1000 objects at a time.

If you don't know how big the array will be (so that SetSize is impossible), and want to avoid copying objects, then if you merely need sequential (iterating) but not random access to the elements, then use a list of bjects instead of an array. Note that a C-style list like the below uses fewer memory allocations than a CList template (because in the C-style list, the pointer-to-next-object is contained in the same memory block as the object itself).

class A {
  //copy ctor defined but not declared
  A(const A&);
  //pointer to next A object in a list
  A* m_next_A_ptr;
  //other member data and methods ...
  ...
};

Christopher Wells
Tuesday, May 13, 2003

I tried a couple of things

1. SetSize, on the outer most level - no measurable difference in most cases. Once in a while it goes crazy and makes it substantially worse.  SetSize on the inner levels definitely makes it worse even when I know how many items there will be.  I have no explanation of this behaviour, but it's measurable and quite obvious

2. Construct in place. The difference if any was within the bounds of statistical error. While I guess this should be faster, it ain't, and I'm guessing this is some artifact of CArray

3. The code was/is very good for const correctness. I managed to find 3 functions where the parameter is a const reference. Changing them to non-constant, seems to save an implied call to a copy ctor (darn C++), I think in a template class some place. As they get triggered a whole bunch of times, this adds up to quite a bit

To explains the array sizes

D - contains variable amount of Cs - in my test example about 5X3+1+1=17

Each C - contains lots of Bs - in my test example several hundreds

Each B - contains usually 4 As, except some which contain a few dozen

S Tanna
Tuesday, May 13, 2003

Point 3 surprises me. Anyway, you realise that unless you need a copy ctor you could just declare it and then you will find out when these implicit copy calls are being made?


Wednesday, May 14, 2003

I'm unclear how many objects you have running around your system at any one time: It looks like a C could easily be 2000-4000 objects on it's own (17*200). If you have a number of C's running around, keep an eye out - NT's memory manager starts to get sluggish when you have hugh numbers of objects running around (somewhere over 10,000-20,000, if I remember my testing of several years ago). It's not assured, but if you are allocating a lot of objects on the fly, you could run into a memory manager issue. (Note: my original finding of this problem happened on an NT4 system several years ago. I don't know about 2K or XP in regards to this issue (I never re-tested with 2K, and XP didn't exist at the time), but I can't believe that MS would have bothered to re-work the memory manager, as it works fine in most cases.).

As I remember, I proved this by using calls to the performance counter to tell how long my new's were taking.

In any event, if you are running into memory manager issues, you might want to try class new and delete routines. That would allow you to locallize the object pooling into the classes that need it. As I remember that was what I used in my system.

Michael Kohne
Wednesday, May 14, 2003

Very rough, back of envelope. This is absolute worst case
1 D
17 Cs
Say 200Bs per C = total 3400Bs
For each of 200Bs, approximate (it isn't this simple really) 198 of the Bs contain 4 As, 2 contain say 70 As. Total As = 17X198X4 + 17X2X70= 13464 + 2380 = 15844As

SUB TOTAL = 1 + 17 + 3400 + 15844 (approx). <= 20,000 objects

Additionally there can be another copy which has (this is fast enough not to have been part of the optimization issue, except any side effects of fixing the other stuff)
1 D
7 Cs
Say 50Bs per C = 350Bs
For each 50Bs, approximate (again it isn't this simple really), 48 contain 4As, 2 contain say 12As. Total As = 7X48X4+7X2X12 =  1344 + 168 = 1512 As (approx)

SUB TOTAL = 1 + 7 + 350 + 1512 (approx) <= 2,000 objects

Grand total = 20,000 + 2000 = 22,000 objects

I think the 2 approx numbers are probably somewhat over-estimated, so we are probably somewhat less than 22K

I have not yet encountered any problem with memory management of these objects. I should point out that As are very simple (about 50 bytes), and they aren't being explicitly new'd anywhere in the program. Just constructed as locals, then copied into the CArrays.

There is an unrelated memory issue elsewhere in the program, principally there are a number of large graphics (upto 10) images floating around, allocated as blocks with new (so you need enough memory for this).  The program is very stable for a very long periods provided you have enough memory.

S Tanna
Wednesday, May 14, 2003

Regarding copy ctor. It's not easier to take it out, as way too much code depends on it. I get huge numbers of compile errors if I try.

S Tanna
Wednesday, May 14, 2003

ray tracer?

Anonymous
Thursday, May 15, 2003

Not quite, but close.

S Tanna
Thursday, May 15, 2003

*  Recent Topics

*  Fog Creek Home