Fog Creek Software
Discussion Board




Commercial Software: Why No CPU-Specific Releases?

Modern compilers (GCC, Microsoft, Intel, etc) can all tailor compiled code for specific CPUs within an architecture, such as Athlons, P4s, and so on.  After seeing the comparative benchmarks for various open-source applications (Mozilla, LAME, video encoding/decoding) which have had their compiler settings tuned for specific CPU-types, it seems clear that some applications can enjoy large performance gains from optimized compilation settings.

My question is this: why don't we see commercially-released software offering a choice of binaries?  At install time, the installer could detect the computer's CPU type and install optimized binaries or, of course, fall back on standard i386 or Pentium-compatible "lowest common denominator" binaries.  The support/testing cost seems as though it would be minimal; in most cases these performance gains are only a couple of compiler switches away.

I realize that for most applications (word processing, etc) any modern CPU is "fast enough", but you'd think that an extra 10-30% of performance would be a great selling point in highly competitive and performance-hungry arenas such as databases or multimedia software.  In a lot of cases, a performance boost from optimized binaries could forestall the necessity of hardware upgrades, translating into a very real cost advantage for potential purchases.

John Rose
Monday, July 12, 2004

Added testing complexity? Increased time-to-market? More support headaches? Those are a few I can think of.

dude
Monday, July 12, 2004

To be specific about these performance gains... I tried an AthlonXP-optimized build of Firefox and it was literally three times faster on some Javscript benchmarks than the "official" Firefox build, which runs on any Pentium-level computer or higher.  I've also seen speedups of 10-50% on various media encoding/decoding benchmarks with optimized compilation settings.

(Of course, nobody needs to run Javascript three times faster.  It's just an example of what sorts of gains can be enjoyed in some cases) 

You'd think that database software would be a no-brainer for optimized compilation settings.  The guys at Oracle and Microsoft bash their brains out trying to out-do eachother on those TPC-C benchmarks and ekeing out another 1% of performance wherever they can.  Multimedia software would be another obvious choice - we're working on some big Flash applications here, and they chug even on P4's and AthlonXPs.  Video software is obvious as well - our render jobs often need to run overnight, and faster rendering means we could get by with less render boxes.

Games, of course, would be another obvious avenue.  For the games themselves as well as the video drivers released in the hypercompetitive, performance-sensitive world of 3D accelerators. 

(I do realize that not all software would automatically benefit from a simple compile)

John Rose
Monday, July 12, 2004

"Added testing complexity? Increased time-to-market? More support headaches? Those are a few I can think of."
----

True, it would take a little more testing.  Do compatibility headaches often arise from using anything but the failsafe default compiler arguments, though?  (Honest question, not rhetorical)

Another thing to consider - what's cheaper... paying skilled coders to squeeze another 20% of performance out of an existing codebase, and the extra testing that would result?  Or simply compiling several different binaries and dealing with that extra testing?

I realize that for something like Outlook or Microsoft Word, the costs of optimized builds would outweigh the gains.  But for multimedia software it seems like a no-brainer.

I suppose that, in the world of "real" databases like Oracle/MSSQL and above, stability has got to be priority #1 over performance, even.  So maybe that's not a total no-brainer candidate like I indicated in my previous post.

John Rose
Monday, July 12, 2004

That's supposed to be a reason for shipping software in an intermediate language, that's run on the target machine by a Just In Time compiler (for example .NET ... and perhaps Java).

Intel's compiler can optimize for all of (Intel's) CPU versions simultaneously: see "Processor dispatch" on http://www.intel.com/software/products/compilers/cwin/cwindows.htm

Christopher Wells
Monday, July 12, 2004

I suppose that a compiler which produces a single binary optimized for all (or at least multiple) architectures would be more desirable than having multiple binaries.

If you follow the conventional wisdom that 10% of the code takes up 90% of the execution time, you could have multiple CPU-specific codepaths within a single binary.  Even if your binary is, say... 40% bigger, that would only mean a couple extra MB of space in memory and on disk.  Not a factor in most cases these days. 

You wouldn't have to worry about people running the wrong binary, but you'd have to do some testing on each of the codepaths.

John Rose
Monday, July 12, 2004

And that's definitely a good point about shipping software in an intermediate language like .NET or Java, Christopher.

Do you know if .NET or Java actually perform this sort of optimization, or if it's simply something they could conceivably do?

John Rose
Monday, July 12, 2004

A lot of games (and probably other software that is CPU-heavy) already does this, just in a way that you don't notice. 

It is fairly common for a modern game engine to have DLLs that are specifically optimized for 3dNow or MMX or SSE2, and the main executable will dynamically load the optimized DLL that fits best after doing a CPUID check to see what the processor supports.

Mr. Fancypants
Monday, July 12, 2004

> Do you know if .NET or Java actually perform this sort of optimization, or if it's simply something they could conceivably do?

I don't know. I do know that MS has touted it as an advantage: that when they improve their compiler in the future, I won't need to rebuild my app then to take advantage of it.

Christopher Wells
Monday, July 12, 2004

I assume it's because the vast majority of users have no clue what CPU is in their machine.

And many of those who really care about such things  are using open source and compiling their own anyway.

Tom H
Monday, July 12, 2004

We do it all the time with pocket PC applications (different processors), and it is done quite commonly with customized and specifically CPU-intensive tasks, such as mathematical and scientific apps.  The gains you might get from a specific CPU optimization for something as across-the-board resource using as Word might not be that great, but for something that just hammers some particular registers, then you might get gains worth noting.

sir_flexalot
Monday, July 12, 2004




Until you can have a simple select or dropdown to select the CPU (or unknown) at install time, this will not happen.

Most people don't have a clue about their CPU-specific benchmarks, let alone how to compile them.

Hell, I know what they are and I don't even use them for anything.

KC
Monday, July 12, 2004

"Until you can have a simple select or dropdown to select the CPU (or unknown) at install time, this will not happen."

If only there were some way to have the program identify the CPU, and run the corresponding code.

Oh, wait.  That's exactly how programs that already implement this idea work.


Monday, July 12, 2004

"That's supposed to be a reason for shipping software in an intermediate language, that's run on the target machine by a Just In Time compiler (for example .NET ... and perhaps Java)."

So, why is Java slower than VB 6?  Why is .net so slow?

If the above implication is true, shouldn't they be FASTER than c++ (compiled for maximum compatibility, average speed)?


And, regarding "When MS releases a new compiler, you don't need to recompile your app"... No, but you have to redistribute a 20 $%^ MB .net Runtime.  Recompiling my app is EASY.  Distributing it is the hard part. But my app is only 10 MB.

I would RATHER redistribute my app than have to redstribute the .net runtime.

Mr. Analogy
Monday, July 12, 2004

"Until you can have a simple select or dropdown to select the CPU (or unknown) at install time, this will not happen."
----

The user doesn't have to do ANYTHING.  Software can easily determine the CPU type by querying the OS or but using even lower-level methods.

John Rose
Monday, July 12, 2004

And, unless you install every binary for every CPU (and query the CPU type before running the app), you'll run into the problem of the user moving the hard drive between different machines (with different CPU's). Or installing the app to a network drive where it's accessed by many different CPU types at the same time.

If the end user has the source and compiles it for a specific CPU, then it's his problem when it breaks. If the vender does it and it breaks then it's a support call.

RocketJeff
Monday, July 12, 2004

> Why is .net so slow?

Several reasons.

> If the above implication is true, shouldn't they be FASTER than c++ (compiled for maximum compatibility, average speed)?

Perhaps it could, in theory. In practice, even after it's compiled .NET is doing more than C++ does (for example, it checks that array indexes aren't out-of-bounds at run-time, etc.) (and there's a lot of "etc." in a 20 MB run-time).

What MS were implying is that e.g. 3 years from now, my .NET code will be JIT compiled by a compiler version that will know about hardware that doesn't even exist yet.

> No, but you have to redistribute a 20 $%^ MB .net Runtime.  ... I would RATHER redistribute my app than have to redstribute the .net runtime.

Yes I know you would. But I'm writing software for corporations ...

Christopher Wells
Monday, July 12, 2004

In my experience, and in general business applications (especially 'commercial' databases, which are my speciality), the speed of the application has little to do with the speed of the CPU these days.

It's much more to do with
(a) the design of the app itself (both low level code optimisation and high-level functionality stuff)
(b) the design of the database
(c) the speed of the LAN/WAN and
(d) (at a pinch) the speed of the hard disks.

When did you last have something running so slowly that an end-user noticed? :-)

Miranda (UK)
Monday, July 12, 2004

It is ridiculous that some people are posting things in here as to why this can't/shouldn't be done when it is being done all the time and they'd just need to read a couple of the previous posts in this thread to see how it is done.
(Hint:  See my post higher up about using CPUID and loading DLLs for specific processors.)

The only reason 'why not' to do this is simply because, as someone mentioned previously, most software isn't all that demanding on the CPU and doing this for software that isn't CPU-bound would result in no noticable gains.

But for software/games that do, this is easy to implement, and the user never has to do anything or even know anything tricky is going on, and no, it doesn't suffer from the user moving the install to another system since the CPUID detection is done at runtime each time the program is executed.

This is a solved problem.  End of discussion!

Mr. Fancypants
Monday, July 12, 2004

Also,

"If the end user has the source and compiles it for a specific CPU, then it's his problem when it breaks. If the vender does it and it breaks then it's a support call."

If the end user has the source and compiles it for a specific CPU and it breaks, that user is probably going to make a support call anyway.  And the support tech can just tell the user "sorry, but we don't support that', which is all well and good but just by handling the initial call and answering the phone, the company is already paying the bulk of what it has to pay for support, even if all they do is tell the user they can't help them.  The damage is already done, price-wise.

Mr. Fancypants
Monday, July 12, 2004

Of course those of us in the Linux world get to be smug about this...

Because we can do this if we want to. And in fact it does have an IMMENSE effect on performance; especially if you compile bottleneck components like the kernel specifically for the CPU. (Normally they come compiled for 386s, although many distributions are now coming with variants for 586s as well.)

I don't think there's much point building a processor specific "ls", though...

Personally, I don't bother that much for my machines, but I know people that do, and the performance they get is impressive.

Katie Lucas
Monday, July 12, 2004

The largest performance increases seen in a program are when a slow performing algorithm is replaced with a fast performing algorithm.

Compiling the same algorithm for different CPU's generally will not make a great amount of difference.  Of course that will vary according to what the algorithm is doing and how the CPU hardware can help it (or how the programmer let's the hardware help.)

Thus it's not worth the time invested in compiling for different CPU's and troubling the user on what to install.  Instead it's best to use the algorithm that is best suited for the purpose and one that has the desired performance characteristics.

Dave B.
Monday, July 12, 2004

"When did you last have something running so slowly that an end-user noticed? :-)"
---------------------

Today, actually - I was struggling to squeeze some more performance out of some fairly involved SQL code I'd written. 

It's for a custom messageboard with close to 100,000 posts, on a cheapo shared server, so it takes a fair amount of (very education) work for me to keep things running quickly.

As I was trying to figure out how to squeeze some extra performance out of the database, I was thinking "the code for MSSQL was probably compiled for a 386, which bears absolutely no performance characteristics in common with the P4 on which the server's running, aside from a shared instruction set architecture.  Wouldn't it be nice if SQL could be recompiled for the P4?  Even a 'free' performance gain of maybe 10% would help me out quite a bit!"

Obviously, for non-multimedia, single-user applications performance is really not an issue these days.  However, aren't multimedia or multi-user scenarios quite common these days?

John Rose
Tuesday, July 13, 2004

"Thus it's not worth the time invested in compiling for different CPU's and troubling the user on what to install.  Instead it's best to use the algorithm that is best suited for the purpose and one that has the desired performance characteristics."
-----------------------

No.  No, no, no.  The user would never have to consciously select a specific CPU architecture.  This is easily accomplished through OS or CPUID checks, as Mr. Fancypants noted repeatedly.  So, no.  To repeat: no.

Also... no.  Nobody's suggesting that the best algorithm shouldn't be used for the job.  Yes, that's the most important factor when it comes to performance.  Compiling for specific architectures would provide, in most cases, a performance gain *on top of* those enjoyed when using the best possibly algorithm.

The statement "Thus it's not worth the time invested in compiling for different CPU's" is also false.  Optimization is tricky and challenging.  What is the time of a skilled programmer worth?  Also, optimization often (but not always, of course) sacrifices code clarity which invokes additional costs down the road.  In the "real world", we usually don't have the time or resources to optimize everything.  In these cases a "free" performance gain from proper compilation is appreciated.

John Rose
Tuesday, July 13, 2004

Fancypants is right and most of the rest are arguing against the wind. Most multimedia software, particularly at the high end, runs code optimized for the CPU by loading in runtime linked libraries. It is quite often that you even see the names of these libraries with the name of your processor as they flash by on the splash screen during startup.

Never seen it on opensource stuff, it's only professional quality commercial applications where they take the time to do things right like this.

Dennis Atkins
Tuesday, July 13, 2004

"My question is this: why don't we see commercially-released software offering a choice of binaries?"

You won't see that, because the software that actually does install different binaries based on the CPU won't let you know that it is doing that.  To prompt the user would confuse and piss them off.

T. Norman
Tuesday, July 13, 2004


My bad, I didn't realize it (CPU Type) could be detected that easily at such a high level.

KC
Tuesday, July 13, 2004

I guess the short answer to your question is that it would be a maintenance and support nightmare.

Multimedia optimization is different than CPU-specific releases.  MMX is useful when used in application that have:

- Small native data types (such as 8-bit pixels, 16-bit audio samples)
- Compute-intensive recurring operations performed on these data types
- A lot of inherent parallelism

Obviously multimedia vendors take the time to code their algorithms using MMX instructions or they use libraries that have already been coded using these instructions.  (MMX generically speaking.)  The problem is that coding MMX usually involves hand coding the assembly which takes more time.

If there were automated tools to help maintain, install and support the various CPU-specific releases this practice may become more popular for non-multmedia apps, but I doubt it simply because it's not worth the overall cost to maintain and support.

The root of your MSSQL problem is that you are using a shared server.  Do you know how much RAM it has?  Do you know how many people are hosted on that server?  Did you use the Query Optimizer?  Is your database designed correctly?  Are you properly using the recordset?

There are a ton of things you can do that are much simpler than a company supporting and maintaining many different EXE's for CPU specific releases.

Imagine if you released a DBMS v1.0 with 10 different CPU specific exe's.  Imagine the support and maintenance nightmare.  Now suppose you release v2.0.  now you have 20 different EXE's to support.  You might be thinking that because the compiler is generating the exe's there are no bugs in them... well I think you would find different.

The best you can do is look at your algorithms, look at your hardware, look at the context of the situation you are operating in and go from there.

These really are the reasons you don't see CPU-specific releases and you do see software labeled with the various MMX implementations.. 3DNOW etc... 

*(Linux is special because of it's open source nature you can compile the applications to whatever CPU you see fit.)

Dave B.
Tuesday, July 13, 2004

Also if this were a bigger selling point to the customer it might make sense to try.  As it stands though it's simply not worth the effort.

And you are correct the user wouldn't have to select the CPU version they have - it could be detected.

Dave B.
Tuesday, July 13, 2004

Thank you for the detailed response!

----------------
"Obviously multimedia vendors take the time to code their algorithms using MMX instructions or they use libraries that have already been coded using these instructions.  (MMX generically speaking.)  The problem is that coding MMX usually involves hand coding the assembly which takes more time"
----------------

Recompilation doesn't equal recoding via hand-optimization, though.  Obviously the biggest potential gains come through painstaking hand-optimization of low-level code, but that's not what I'm talking about.

Intel (and presumably some other?) compilers are able to automatically detect some parallelism and compile accordingly, although of course automated compiler detection of SIMD (MMX, SSE, etc) ready code would not be as good as tight hand-coded assembly.

There's also the simple fact that compilers target 386-level CPUs by default.  That was a single-issue, non-superscalar, non-pipelined CPU... I think it could do one floating-point (assuming you didn't have a 386SX!) or one integer op at a time.  Today's CPUs have multiple instruction pipelines and various penalties and efficiences related to instruction ordering that *completely* independant of SIMD instructions such as MMX. 

Look at the gains possible simply by changing compilers or compiler options:

http://www.willus.com/ccomp_benchmark.shtml?p10

Obviously, that's a "best case" scenario for the effect of compiler options on application performance - LAME is an mp3 encoder and a purely CPU-bound program which isn't the case for most programs.  Keeping that in mind, however, the gains to be had through smart compilation are fairly spectacular.  You can see gains of 7x between compilers and 2x-3x even with the same compiler and different compilation options.

John Rose
Tuesday, July 13, 2004

Small note to the link I provided above...

They have compiler benchmarks for other software as well, none of which show the performance variance you see with the LAME benchmarks.

But also note that the minimum level of optimization used is /G6 in the Intel compiler tests; this targets 6th-generation CPUs like the PentiumPro/Pentium2 and above.  They don't do any tests with "barebones" 386-targeted compilations like a lot of commercial software seems to use.  :)

John Rose
Tuesday, July 13, 2004

*  Recent Topics

*  Fog Creek Home