Fog Creek Software
Discussion Board




Memory caching and synchronization


I was recently bitten by a nasty multi-processor synchronization bug.  The destructor for a custom semaphore object had the code:

if (--mThreadCount == 0)  mThreadId = 0;

This has worked fine for years on x86 machines, but the code in question failed rapidly on an IA64 machine.

The problem was decremented value of mThreadCount being written to memory after mThreadId, and after another thread had already claimed the semaphore based on mThreadId being 0.

The solution was to use InterlockedDecrement and InterlockedExchange calls to modify the variables.  An article on the MSDN talks about this:  http://msdn.microsoft.com/library/en-us/dllproc/base/synchronization_and_multiprocessor_issues.asp

But, I think this article is incorrect in that it talks about a second processor reading a value from memory while it is still in the cache of the first processor.  I thought bus snooping caches, et al. solved that problem.

As I see it the problem is with the transfer of data between processor registers and memory (cache or otherwise).  In this case, the compiler decided to reorder the writes for efficiency.

The problem I have is I don't see where the guarantees are.  Why should calling an API function ensure that all the processor registers are flushed?

But, how am I supposed to write thread safe code in this case?  Even if I use a critical section what is to say that a value is written back to memory and not just cached in a register somewhere until the later when the compiler decided it needed to reuse that register?

In a number of places I have code that can avoid the need for (and associated performance overhead of) explicit synchronization primitives if the order of writes to memory is the same as is given in the code.

Am I missing something obvious here or is all my multithreaded code now suspect? 

Rob Walker
Monday, December 16, 2002

If the problem is with the compiler as you think Rob, there really isn't a garauntee that your code is thread-safe at any time. Which is unfortunate, because the leak in the abstraction is at the compiler level, which is like knocking out a support beam in software.

Have you looked at the assembly code to see if things were switched as you say?

Giampiero
Monday, December 16, 2002

This is an unfortunate consequence of how C is defined. In order to give a compiler  the maximum possible flexibility in what optimizations can be performed, very little is specified in terms of what order  operations are performed in.

Do a Google search for "Sequence Points", or take a look at http://www.embedded.com/story/OEG20020625S0041 and you'll out a lot more about how this works.

As far as how calling an API helps? Well, it's probably (!) safe to assume that the synchronization primitives in whatever API you're using were written at least partially in assembler, so as to make use of the special machine instructions needed to effectively synchronize two processors.

There is still an opportunity for your code to go wrong in not putting a value out to memory right away, even inside your critical section. There isn't an easy solution here, but the proper use of the "volatile" keyword in C can be helpful to ensure that writes actually make it out to memory in a timely manner.

Ultimately, you shouldn't try to implement semaphores or mutual exclusion on your own. Use whatever is provided by your environment/API.

-Mark

Mark Bessey
Monday, December 16, 2002

>> if (--mThreadCount == 0)  mThreadId = 0;

Wasn't  mThreadCount  guarded by a CRITICAL_SECTION or semaphore?  If not, then this code could have failed on a single processor machine as well.

Though I don't recall the Interlocked set of MSFT functions all that well, I think all they did was guard the variables with cirtical sections.

Of course, all I see is this one line of code, but even in this context, based on the var names, I'm guessing its a simple multithreading bug - not a multi CPU bug.  It just had not shown up yet (??).

Nat Ersoz
Monday, December 16, 2002

> As far as how calling an API helps? Well, it's probably (!)
> safe to assume that the synchronization primitives in
> whatever API you're using were written at least partially in
> assembler, so as to make use of the special machine
> instructions needed to effectively synchronize two
> processors.

Specifically, the Interlocked* calls use the LOCK opcode, and are __fastcall (parameters in registers instead of stack), so they should be pretty much safe in this case

KJK::Hyperion
Monday, December 16, 2002

If you arent using processor locking then all you code is suspect.  The API's wrap/abstract this for you( was even more usefull back when nt ran on alpha as well), which is why you ought to be using them.  It will probably work on 1 proc, may work on two, but throw 3-32 in the mix and its a crapshoot.

SR

MT Dabbler
Monday, December 16, 2002

Thank you all for your comments.

Giampiero wrote:
> Have you looked at the assembly code to see if things
> were switched as you say?

Not in any detail.  This happens in the Win64 build of our software, and I haven't dug out an IA64 instruction manual yet.  I'll try and find time to do it in the next day or so.

I was guessing that the problem cropped up now because the IA64 architecture suddenly has a *lot* more registers available, and the compiler has to be aggressive with instruction reordering to take advantage of the parallelism available on the chip.

Also, this problem only occurred in release builds not debug ones.

Nat Ersoz wrote:
> Wasn't  mThreadCount  guarded by a CRITICAL_SECTION
> or semaphore?  If not, then this code could have
> failed on a single processor machine as well.

Unfortunately the snippet of code I gave doesn't give you the full context.  There is no need for a critical section around mThreadCount.  The only thread that is allowed to manipulate this value is the one that owns the semaphore.

The lock method of the semaphore has always used InterlockedCompareExchange on mThreadId to ensure only one thread can gain access at once.

The code has been run under very high stress on 8-way 32 bit boxes without any locking problems before.

KJK::Hyperion wrote:
> Specifically, the Interlocked* calls use the LOCK opcode,
> and are __fastcall (parameters in registers instead of
> stack), so they should be pretty much safe in this case

True, but on an IA64 there are a lot more registers to worry about.  I don't know what the calling conventions are on that architecture.  I also want to avoid the cost of locking the bus associated with the LOCK opcode where it isn't absolutely necessary.


The original IA32 implementation of this routine was written (almost entirely) in assembler.  Perhaps it is time to learn some IA64 assembler as well ...

Rob Walker
Monday, December 16, 2002

Mark Bessey wrote:
> Do a Google search for "Sequence Points"

I'm not sure sequence points help much though.  The article you referred to says:

"Sequence points also appear at the end of the controlling expression of an if or switch statement"

So there is already a sequence point in the 'right place'.

I'll try the volatile modifier to see if it makes any difference to the compiler's output.

> Ultimately, you shouldn't try to implement semaphores
> or mutual exclusion on your own. Use whatever
> is provided by your environment/API

Unfortunately the use of some shared data structures low down in our code means that the difference in performance between our custom semaphore and the standard CRITICAL_SECTION is noticeable.  Whilst not huge, I hate to give up performance without good reason (and, of course, correctness is a *very* good reason!)

Rob Walker
Monday, December 16, 2002

MT Dabbler wrote:
> If you arent using processor locking then all you
> code is suspect.

I agree with the need for processor locking, but still feel it shouldn't be necessary in this case.

I guess I just have to expand the list of cases where interlocked access is required to include ones where I need to ensure the order of write back to memory.

My concern is that this opens up a whole new class of bugs I hadn't considered before - and which aren't visible just by reading the source code without thinking "and what if the compiler does it in this order".

Rob Walker
Monday, December 16, 2002

If it really is a compiler optimisation problem and there's no pragma to switch off optimisation for that code you might need to make it explicit to the compiler what it is you want to do.

One simple way is to decompose compound statements like --variable = dosomethingwith(variable);

variable = dosomethingwith(variable);
--variable;  // or if pathological variable = variable -1;

At a guess one of the reasons for the difference between debug and release code behaviour is the depth of stack.  Not for the usual reason that things work in debug but fail in release, that some automatic variable is screwing a value in the stack, but that it just takes a smidgin longer for things to unwind off and wind onto the stack.

If the above still didn't do it, I might even get into writing inline code but I doubt its worth it.

Simon Lucy
Tuesday, December 17, 2002

Not only do you have to make sure the decrement is atomic, you also have to include the branch in that same atom. (otherwise you could have two processors that both think the counter is zero). All multiprocessor-capable CPUs provide special assembly instructions for this kind of thing. In Linux you'd use atomic_dec_and_test(), I don't know what the Win32 equivalent is.

Dan Maas
Tuesday, December 17, 2002

Most programmers are not aware of the fact that many languages do not guarantee the order in which variable values become visible to other threads, other than at specific sequence points.

I'm most familiar with this issue in Java in the form of double-check locking. It's common, but not quite safe, to write code like this:

private Object m_v;

init() {
if( m_v == null ) {
  synchronized(this) {
    if( m_v == null ) {
      m_v = new Foo();
      ....
    }
  }
}

This has a measurable performance gain, but may not work on some machines or JVMs. The level to which you will be affected by these problems depends on specific hw platform and how aggressive is your compiler in optimizing memory access.

You should use OS (or language) provided synchronization primitives that ensure that variable values become visible to other threads.

igor
Tuesday, December 17, 2002

Why are you writing your own semaphore code?  What's wrong with CreateSemaphore, or if you need something lighter, a CRITICAL_SECTION?

Brian
Tuesday, December 17, 2002

Don't know if this will help, but Micro C++ is a good tool for doing concurrency.

http://plg.uwaterloo.ca/~usystem/uC++.html

Giampiero
Tuesday, December 17, 2002

To answer the original poster's question:

Situations like this are what the "volatile" keyword is for. It tells the compiler that this variable may change for other reasons that the compiled code itself, so don't do anything funky like cache it in a register or reorder the writes. It should fix your problem.

Chris Tavares
Tuesday, December 17, 2002

>>Situations like this are what the "volatile" keyword is for. It tells the compiler that this variable may change for other reasons that the compiled code itself, so don't do anything funky like cache it in a register or reorder the writes. It should fix your problem.

Well...  C++ (and I assume C, but I don't have a reference handy to double-check) is mute on the subject of threading, so it's not possible for a language feature (like the "volatile" keyword) to have any semantics vis-a-vis threading.  So you really need an extra-linguistic solution - such as the Interlock* solutions mentioned above for Win32.  However, a compiler implementation is free to attach thread semantics to "volatile", but is not required, so your statement may be true for some implementations (and not others).
In practice, "volatile" will often do the trick.  But if you really want to sleep well at night, either verify that your compiler guarantees that it will work, or use Interlock*-type solutions.

Brian
Wednesday, December 18, 2002

(re: check-lock-check)

Don't use it.  It is fundamentally broken in both Java and C++.  I don't have pointers handy, but I will post them when I find them.  news:comp.lang.c++.moderated is a good source, as is news:comp.programming.threads.

Brian
Wednesday, December 18, 2002

Making it 3 posts in a row to this forum:

Here is a thread that was just started yesterday on this very topic: http://groups.google.com/groups?threadm=1040163326.525628%40master.nyc.kbcfp.com

Here is an enormous thread from a while back.  It should be cued up to a really good post on why check-lock-check is not sufficient on multi-processor machines.
http://groups.google.com/groups?threadm=3A66A5A8.17960D70%40sea.ericsson.se&rnum=1

Brian
Wednesday, December 18, 2002

Rob wrote:
"My concern is that this opens up a whole new class of bugs I hadn't considered before - and which aren't visible just by reading the source code without thinking "and what if the compiler does it in this order". "

Yep, that's part of why writing multi-threaded code in C is so hard. Having experienced just these sorts of problems on a number of projects, I'm not convinced that the average programmer is really up to the challenge.

There are too many little details you need to keep track of, and it's simply too easy to get screwed by a bug that's so rarely reproducible that you effectively can't track it down.

When you consider that the same concerns apply to any library code that you might use as well, it starts looking very bad, indeed.

Threading - just say no. (unless your language and runtime explicitly support it)

-Mark

Mark Bessey
Wednesday, December 18, 2002


Thanks for the pointers to the usenet discussions, Brian.  They were very helpful. 

Unfortunately just giving up on threading in a mutli-user server app that runs on 4 and 8 way boxes isn't an option!

It looks like the only way to get performance when I know what semantics I want is to drop down into assembler to ensure things happen in the right order.  But maybe it is time to relax to performance requirements a little ....

It is just annoying to have to give up performance because I can't specify what I want to.

The uC++ link that Giampiero posted looks interesting, but wouldn't be practical for this product (for a start it is windows based).

Rob Walker
Thursday, December 19, 2002

>>Unfortunately just giving up on threading in a mutli-user server app that runs on 4 and 8 way boxes isn't an option!

You don't have to...  I think the moral is that writing (correct) multi-threaded software is harder than most people realize, but not impossible.  Only use constructs that have guaranteed threading semantics -- don't write "++s_count;" when you mean "InterlockedIncrement(&s_count);"

>>It looks like the only way to get performance when I know what semantics I want is to drop down into assembler to ensure things happen in the right order.

But you can't even do that*!  The problem is not with compiler instruction reordering, it's with hardware memory caching.  Check out the MSDN article "Synchronization and Multiprocessor Issues".  You MUST use a Critical Section, Mutex/Semaphore, or an Interlocked function.  It's extra work and takes some thought and planning, but you get guarantees with it.

*I'm not an assembler expert.  I'm sure it is possible.  But the correct assembly code will include proper memory barriers, not just un-reordered/de-optimized compiler output.  The crux of the matter is memory synchronization, not instruction ordering.

Brian
Thursday, December 19, 2002

There are other ways to exploit multiple processors than multithreading. Many applications can be just as efficient with multiple processes, sharing little or no memory, as with threads (especially on UNIX or UNIX-like OSes).

There are also non-blocking I/O APIs so you don't necessarily have to use threads for concurrent disk or network I/O.

Dan Maas
Thursday, December 19, 2002

*  Recent Topics

*  Fog Creek Home