Fog Creek Software
Discussion Board




Waking up thread blocked on recv or WSARecv

I think I know the answer here, but I was wondering if anyone knows how to wake up a thread that is blocked on a recv or on a Win32 WSARecv call. 

In my application it is possible that another thread could determine that waiting for I/O is hopeless before the timeout expires.  That thread could notify the blocked thread and wake it up.

I'm not sure if I could use WaitForMultipleObjects() as the problem is similar to waiting for multiple events.  My guess right now is no. 

In absence of a being able to wake up the thread, the only option I can think of is a not so busy loop such as:

totalTimeout = 0
while(totalTimeout < timeout){
  if(signaled by other thread)
    break;

  recv with shortTimeout
  if got data
    break
  totalTime += shortTimeout
}

any ideas?

christopher baus
Saturday, August 23, 2003

Hmm I just had another idea.  How about asynchronously closing the socket?  That might work since I want to close the socket after waking up the blocking thread.  I could just switch the order, but that seems like a bad idea. 

That would look like this

Thread A blocks on recv
Thread B closes socket
Thread A wakes up since the socket is closed.

This just doesn't sit well with me...
 

christopher baus
Saturday, August 23, 2003

Look for WSACancelBlockingCall() or something with a similar name. It was there when I last tried it on a win32s project.

Personally, I recommend not using threads for anything. But it looks like it's too late in your project for that now.

Ori Berger
Saturday, August 23, 2003

use "select" instead of "WSARecv" (that's the old way, before we had threads, of writing a server that communicated on multiple sockets)

Joel Spolsky
Saturday, August 23, 2003

You can use WaitForMultipleObjects (WFMO) if you'd like.  Look up WSAEventSelect (you could also use MsgWaitForMultipleObjects if you'd rather signal termination with a message than a separate event).  You should be aware of a fundamental limitation in the number of events you can wait on for each call to WFMO (the MAXIMUM_WAIT_OBJECTS macro has that information).  You certainly should not be creating one thread per connection.

If your network operations require high volume data-transfer and large numbers of connections, use IO completion ports and pool your connections.

K
Saturday, August 23, 2003

Ori, am I understanding correctly that you suggest to people that they never use threads in any context in any application?  If so, why is that?

K
Saturday, August 23, 2003

I've thought about ditching threads all together, using old event demultiplexing with select() or similar.  Yes that would solve my problem, but that would be a completely different architecture, and one that would be difficult to implement.  I think I have something like 150 returnable states.  Ouch. 

christopher baus
Saturday, August 23, 2003

WSAEventSelect is exactly what I am looking for.  If used in combination with WaitForMultipleObjects I can wake up either if data is waiting or the other thread signals another event that I am also waiting on.  Perfect!

Now I need to figure it out for POSIX threads. 

christopher baus
Saturday, August 23, 2003

"That would look like this

Thread A blocks on recv
Thread B closes socket
Thread A wakes up since the socket is closed.

This just doesn't sit well with me..."

It ought to sit well with you. It's a very, very common way of dealing with this type of problem.

Brad Wilson (dotnetguy.techieswithcats.com)
Saturday, August 23, 2003

K, yes, it is usually my advice that people use threads only as a last resort, and it is very rarely needed.

Threads seem like they simplify programs, but they eventually make them much more complicated than they should be. A quick hack is simpler with threads than an async solution, but when you write robust software, the threaded solution is usually ten times as complex.

In my experience, most people who advocate using threads do not know how to do it any other way. Many times, especially with respect to sockets, people say "but there is no other way!". Well, there is, and it's simpler.

If you can read Python, I urge you to read the sources of Medusa (you can probably pick Python through reading it - Python is executable, but closer to pseudo code than any other language I know). It's a beautiful example of an asynchronous system.

It is very concise, extremely robust, very high performance yet very easy to use (once you make the switch in your head that "everything is a state machine").

Most peope would probably find creating the "basic" thing easier in a threaded environment. When you try, however, to build the robust, high performance version using threads, you'll see that the async version is significantly simpler to write, use and maintain.

There are reasons to use threads, but they are rarely used for the right reasons.

Ori Berger
Saturday, August 23, 2003

I forgot the pointer to the Medusa source:

[ http://oedipus.sourceforge.net/medusa/ ]

Ori Berger
Saturday, August 23, 2003

I'm not sure that is all that common.  I read this comp.unix.programmer

> Less portable mechanisms are closing the socket from another thread and using pthread_cancel(). On some systems, if another thread closes a socket that you're blocked on, then your system call will fail with EBADF. But systems behave differently when this happens: on some the system call fails immediately, while on others the call
won't fail until data arrives.

I really want this to work on Linux, Win32, and hopefully Mac OSX.  I am most familiar with Win32's threading model and that is why I threw out the WaitForMultipleObjects() example.

christopher baus
Saturday, August 23, 2003

Ori,

I agree with you.  To create really high throughput systems the single threaded system will always outperform if written correctly. Although I think developing the logic of single threaded server is at least an order of magnitude more complicated than a multithreaded server. 

In complex applications the number of states can quickly get out of control.  I don't know of any single threaded, multiconnection, implementation of HTTP 1.1 for instance.  If it is out there, please point me toward it. 

I am going to check out your python example. 

christopher

christopher baus
Saturday, August 23, 2003

One more comment.  Doug Schmidt has good support for single threaded/multi-connection servers in his ACE C++ framework with the Reactor architecture.  But it doesn't construct your application's state machine for you.  Again I still believe this is non-trivial.

http://www.cs.wustl.edu/~schmidt/ACE.html

christopher baus
Saturday, August 23, 2003

Medusa does single-thread multi-connection very well.  I'm sure it does "connection: keep-alive" on HTTP/1.0; I think it also does HTTP/1.1, and I recall it does chunked encoding, among other things.

The state machine can be, and should be, broken down to several semi-dependent machines. I've been doing this for years now, and I don't think any of my "basic" state machines ever had more than 8 states. I *did* have 3 independent state machines, so you could say it's 8*8*8=512 states, but the complexity is really closer to 8+8+8=24 states

And when you incorporate all of the exception handling logic, mutual exclusion logic, thread pool logic, etc, do you still think that the threaded logic is simpler? The potential for race conditions in your simple "one thread closes another's connection" really is astounding, if you properly analyze it.

And once you place all the mutual exclusion/locking code, it's usually too coarse (resulting in threads unneedingly locking out each other), or too grained (resulting in a horrible performance drop). Or, more commonly, things aren't properly locked and you get a wonderful mix of unreproducible bugs (but you only get them in _production_, because testing hardly ever produces the right mixture of load and random timing to make them show).

It's been years since I saw a non reproducible bug in single threaded code. I see it all too often in multithreaded code.

Ori Berger
Saturday, August 23, 2003

Hi Ori,

I think that it's possible to understand the purpose of state machines very well and still find some practical uses for threads.  The thread manager is essentially implemented by a state machine behind the scenes (for that matter, the processor can be described that way too).  People typically use threads in a single process where they might otherwise use a system of processes except that it would add complication or detrimentally affect performance.

Network servers, I agree, are almost always better implemented with a single-threading model.

How about a program that downloads large audio files from a slow server, analyzes them for particular patterns, and updates a display with the ongoing results of that analysis with the requirement that non-technical home users will put the priority for use on the consistency of the behavior of the user interface?  With a multi-threading model you can easily implement each task in a different thread with very little intercommunication and still be (relatively) certain that the consistency of the behavior of the user interface will be limited by the resolution of the average timeslice multiplied by the number of running processes (rather than that same number plus the time taken for the worst case of any of the systems described earlier).  What do you think?

K
Saturday, August 23, 2003

High performance single threaded network servers in Windows?  Nuh uh.

Well, kinda.  Single threaded async windows servers compared to thread-pooled I/O completion ports windows servers is like a juiced-upped Civic compared to an NSX.  It may get you there fast, but it ain't an NSX.

You want that x-platform?  Check out how Apache does it.  They use the thread pool-I/O comp port model.

Johnny Simmson
Saturday, August 23, 2003

K, that depends on how long the "analysis" phase takes, and how much control you have over it.

If it comes in small chunks (1ms or so, absolutely less than 10ms per "chunk"), then multiple threads are not helpful in any way.

Regardless of how slow the download is, in an async download, it takes constant time to receive a packet. So just receive your packets, and when enough data has been collected, apply your analysis, and update the display.

Threads do not make any guarantee about consistency of UI response or CPU time; If your analysis thread needs 150% CPU to do its stuff, other threads WILL feel it. It's generally better use of resources (assuming one CPU) to cut your computation to small chunks and co-operatively give up your CPU often.

Since ~1990, all GUIs were effectively event driven and single threaded; If you discard earlier PC attempts (and you should), then this goes back to 1985 when the Mac (Mac OS), Amiga (Intuition) and Atari ST (GEM) came out. The only exceptions I'm aware of are the Java UIs (Swing, AWT and friends), and I suspect BeOS used multiple threads for UI, but I'm not sure.

Compare:
read_input("What's your favorite color", &color)

with the dialogs/controls you'd use in a Win32 program. It _looks_ much simpler, and was popular in the 80s, yet no one is doing it anymore. The reason is that when you want to make it robust and full featured, nothing beats the single threaded event driven model. (Think about how you'd implement context sensitive help, a "cancel" button, and a close-this-program button).

Sockets are not any different than interacting with the user - they interact with another computer, but they are not as predictable as, say, reading from a file. The bandwidth and reliability are not predictable, the processing on the other side is not predictable,  and various error conditions can appear at any time. For all you know, the entity at the other side of the socket might be a human that uses "telnet".

Yet for some reason the industry at large fails to apply the lessons learned from UI.  Oh well.

Ori Berger
Sunday, August 24, 2003

OT:

If you use the async functions in .NET, the runtime automatically does thread pooling and I-O completion ports junk for you.

.net rocks
Sunday, August 24, 2003

Hmm... Ori, you seems to be assuming there is only 1 CPU. For any server app I've ever worked on, the minimum would be 4, more typically 12+ CPUs. How do you intend to fully utliize the machine with only 1 thread?

dude
Sunday, August 24, 2003

I'm with dude - except that a lot of the multi-threaded apps I've worked with ran on 64-way Sparc boxes.

My first real exposure to multi-threaded apps on Windows, however, was to pull the multi-threading out of an application. The developer spun off threads all the time and nothing was properly locked.  It ran fine on a single CPU box (which is how he developed it) but it had errors as soon as there was more then one CPU.

I remember a magazine article I read at that same time about the troubles that a lot of Windows programmers had with multi-threading - the author pointed out quite a few commercial packages that had warnings like 'only run on a single CPU machine'. His point was that they all failed on multiple CPU machines due to improper threading.

RocketJeff
Sunday, August 24, 2003

whenever someone tells me they've developed multithreaded code, I always ask if it was single processor.  I developed really bad habits writing multithreaded single proc code then had to unlearn it all once I moved on to a "real" (multi-proc) server environment. 

I think the fact that  IIS was written multithreaded and I/O comp'ed speaks volumes about the "proper" way to design high performance windows servers. 

some other dude
Sunday, August 24, 2003

The race conditions in my situation aren't really bad at all.  Just close the socket.  If it is already closed big deal.  Closing a closed socket isn't a problem.

If a system runs on a single proc box and not a multi-proc box then the developers should be fired.  Writting multithreaded code isn't nearly as bad as everyone makes it out to be.  Yes it is hard.  Sorry programming is hard.  Read Doug Lea's Concurrent programming in java.  I don't care what platform you are on. 

christopher baus
Monday, August 25, 2003

Dude, multiple processes are often much more useful than multiple threads; share explicitly (through shared memory etc.) rather than implicitly (through shared thread address space). The multi-process solution scales to more machines, whereas the multi-thread solution doesn't. If you saturate your 12 CPU machine, what are you going to do?

And, if an application you write needs more than one CPU, any application except perhaps image/video processing and heavy number crunching -- you're probably doing it wrong.

A 1GB/3GHz CPU of today has roughly 1000 times the processing power of an 1990 AT and memory to match. I've never seen a business application that needs that much power. That much power does allow people to write inefficiently, which they do. And if you write inefficiently enough, not enough CPU or memory will ever be sufficient.

I'd really like to hear what this 12/64 CPU application does, and I urge you all to read:

Phillip Greenspun on application servers:
[ http://philip.greenspun.com/wtr/application-servers.html ]

Dan Kegel on handling lots of connections (unix centric, but has some calculations and tips that should be relevant to everyone):
[ http://www.kegel.com/c10k.html ]

Ori Berger
Monday, August 25, 2003

Ori, given those 2 links, I guess I'm talking about a different space.

I'm talking about _writing_ things similar to AOLServer, not _using_ things similar to AOLServer.

The C10K link backs up the thread pool & I/O completion port model, that's where I first started reading about it!

From C10K:

"""Under Windows, asynchronous I/O is associated with the terms "Overlapped I/O" and IOCP or "I/O Completion Port". Microsoft's IOCP combines techniques from the prior art like asynchronous I/O (like aio_write) and queued completion notification (like when using the aio_sigevent field with aio_write) with a new idea of holding back some requests to try to keep the number of running threads associated with a single IOCP constant.  """


Monday, August 25, 2003

"I'd really like to hear what this 12/64 CPU application does, and I urge you all to read:"

Options trading system handling over 600 symbols (stock symbols, with anywhere from 20 to 100+ options per symbol).

The system _did_ make use of multiple machines and processes - there were (IIRC) 6 or 7 various Sparc boxes running as servers with several other machines acting as trader workstations.

I know it wasn't a normal application but it did make good use of mult-threading and multi-processor machines.

RocketJeff
Monday, August 25, 2003

RocketJeff -  Let's do a small back of the envelope calculation with your numbers:

You've got 600 symbols, with 100 options on each; That makes it 60,000 instruments. If each was traded once per second, you'd have 1Ghz/60,000 = 16K cycles/trade; I doubt that's the case, because there aren't 60K trades/second even on NYSE. Just for the record, NYSE has ~3,000,000 trades/day [ http://www.nysedata.com/factbook/viewer_edition.asp?mode=tables&key=6&category=3 ], which translates to average (over 8 hours) to ~400,000/hour, or less than 200 per second. Let's assume that the peak is 10 times the averge, meaning 2000 per second. So on a 1Ghz machine that would leave us 500K cycles per trade, without saturating.

But even 16K cycles is still an awful lot. If done properly, all of the risk management, order matching, etc. can be done in a few hundred cycles.  It does require intricate planning and using the right data structures and topology, but if you can allow yourself 16K cycles, you can go for a much simpler solution and still be well within the soft real time limits. If you can allow yourself 500K cycles, you can do 30 times more. [ For those of you who had an old 4.77Mhz PC, that's roughly equivalent to the computing power that PC had per second. That's a little more than an Apple ][ had per second. And if you think an Apple ][ couldn't do one trade per second, then you should really pull it out of the attic and play some games on it. It can probably do ten or a hundred ]

Reiteration: These numbers are for NYSE (which trades ~10000 instruments), during peak hours, and a 1Ghz machine. Now, why would anyone need multiple threads or multiple machines?

Mostly because things are done inefficiently. E.g., because the database and/or network is accessed through a blocking API, because the code is written with 60 levels of abstraction, each with some cost, and because programmers are lazy.

(Yes, I'm a programmer; Yes, I'm lazy; No, not all of my programs are as efficient as they could be - in fact, most aren't. But I always do the math, and if I'm more than an order of magnitude worse than a back-of-the-envelope calculation, I know I screwed and try to fix it)

One reply I keep getting to these rough calculations is "it doesn't work that way in practice". Well, news flash: It does and always has done for me; Naturally, if you're building 60 layers of abstraction, you can't account for cycles. But what you should do is drop the abstractions that make things unpredictable, rather than give up your ability to predict.

An example: Since you're in the trading business, you may have heard of the Kx guys [ http://kx.com / http://www.firstderivatives.com ]. They market a database system and high-level 'scripting' language with perfectly predictable cycle-counting based performance. It does 50,000 ACID transactions per second, with multiple hot backups, on commodity hardware.

Ori Berger
Monday, August 25, 2003

Ori,

You're talking abotu a trade a second. I'm talking about generating qotes for each individual option.  We were generating over 3 million quotes each day for just the ISE (the firm I was with is a PMM on the ISE). That doesn't include the Amex, Philx or PCoast.

An option's theoretical value needs to be recalculated much more often then just when the underlying value changes.

There's also the part of the system that handles risk management and offsetting the risk by buying/selling futures or equities. It also has to handle the manual trades and parameter tweaking by over 50 people (traders and financial engineers).

This isn't a small system for individual traders that can be run on a little hardware.

RocketJeff
Monday, August 25, 2003

Jeff, I'm not bashing your system. I have no doubt that it was well thought out, and did marvelous things. I value robustness and correctness of a system above almost anything else.

But performancewise, I'm not impressed. It's possible to code e.g. a complete Black & Scholes routine that outputs call, put, delta, vega etc. and runs in a few hundred cycles, which means it can calculate millions of option prices per second on a 1Ghz PC. Standard risk management and generation of offset trades is also very simple -- unless, of course, you used algorithms which are much more complex than the standard.

Did you measure where exactly did the CPU time go? From my experience, in most systems that are not designed with cycle-accounting in mind, one finds that after eliminating the three largest bottlenecks and speeding the system two or threefold, you can't really tell where and why time is spent - it's obscured (and often consumed) by the many layers of abstraction.

How was processing time distributed among modules?
What data store were you using?
What programming language?
What was your target average and peak rates of quotes/second? What was your achieved rate?

Phillip Greenspun's article details a setup for a webserver that would sound fantastic to many: negligible CPU usage for a very dynamic, quite popular web site, on one machine. Most three-tier / DNA / Java gurus I've talked to dismissed this setup as impossible, mostly because their tools and standards could never result in such performance.

I've been doing software (retail, enterprise, embedded, client server, you name it) for a long, long time now, and I've yet to see a business application that cannot be satisfied with a 1Ghz CPU, 1GB of RAM server. I've seen many that aren't, but it's not because the processing power or I/O bandwidth was insufficient - rather, it's often because no one considered how to fit it into such a setting until it was too late to change it.

Ori Berger
Monday, August 25, 2003

Wow, I wish I lived in Ori's world!

Who would have thought the NYSE ran on a single PC?

Who knew you could run a db that could sync to disk 50k transactions a second on a commodity PC?

Who could have known that for a trading system you could process all risk management, order matching, etc in a few hundred CPU cycles - i.e. for a modern processor, about as fast as you could copy a single string.


Monday, August 25, 2003

Interesting thread.  I appologize about the comment about working on a single processor vs two processors.  I was too tired to be allowed to post anymore.  ; )

christopher baus
Monday, August 25, 2003

BTW, the product I am working on is a network appliance.  I expect to be able to handle somewhere around 500 concurrent connections at wire speed.  I think it is likely that I will have to scale to multiple CPUs, depending on the size of the rule database. 

On windows and now Linux,  threads and procs are nearly identical as far overhead for the context switch.  The difference is they don't share a common address space. 

When implementing a single threaded/multi connection server, yes just starting a second process is an easy solution.  But it is likely that the two or more process will have to communicate with each other.  On Win32 that requires process synchronization object such as Mutex instead of CriticalSection.  Mutexes take an order of magnitude longer to acquire than Critical Sections on Win32.  Plus you have to worry about race conditions, etc, anyway.

IIS and apache now use a hybrid process/thread model.  My guess that reason this is at a certain point heap contention becomes a bottleneck when a certain number of threads are active.  Easy solution.  Start another proc and get a new heap.

The problem with this approach is developers are forced into using shared memory (or MMFs on Win32) ALL THE TIME.  Hence inter-connection communication has a much higher overhead....

christopher baus
Monday, August 25, 2003

A nameless person responded:

"Wow, I wish I lived in Ori's world!"

Just to clear up - I felt repeating the context was redundant, but apparently, it isn't, so here goes for the nameless people who would rather mock than consider the facts:

My thesis / claim (call it whatever you will) is that, CPU-wise, there is NO justification for a multiple-processor-single-process job in today's business world, and -- by extension -- there is therefore almost no technical justification for the use of threads. Furthermore, using very conservative estimates based on real numbers, I show that to be the case.

I do not claim that NYSE can run on a one processor PC on commodity hardware. I do claim that, computation-wise, it is possible if done right. The reason NYSE requires hundreds of computers to run is NOT a result of a lack of CPU power -- it is a result of many things not having anything to do with CPU power, such as robustness, backups, legacy systems, i/o channels and the like.

A multi-processor machine DOES NOT improve robustness of a system compared to a single processor machine. For a system which does not saturate a single CPU, the only added value of a multi processor machine is ease of management.

I'll sum up my thesis:

a) There's no real need for more than one CPU, for 99% of business applications (there's never enough CPU for graphics and games, though). I demonstrate this by estimating how much CPU power is _really_ needed to run NYSE. You might not agree with the numbers; If so, I'll be glad if you can tell me why.

b) If you're using one CPU, a _robust_ system is simpler to build on a single threaded, asynchronous foundation, than on a multi-threaded, synchronous foundation.

Everyone is invited to disagree, but if we are to maintain a meaningful discussion, it should have some rigour.

Nameless person, keep living on your own world. I'll keep living in mine; And no, I don't run NYSE at home, just in case you had a doubt.

Ori Berger
Monday, August 25, 2003

John Ousterhout has an interesting talk on "Why Threads are a bad idea (for most purposes)":

http://home.pacbell.net/ouster/threads.pdf

Ori, I'd question your assertion that "If you're using one CPU, a _robust_ system is simpler to build on a single threaded, asynchronous foundation, than on a multi-threaded, synchronous foundation", at least as a blanket statement. I've tried just about every way of writing a Winsock server, and found the easiest to get right, and the easiest to understand several months later,  is the thread per connection model. People always say it doesn't scale, but I've found it works fine in many cases. Ori, which model would you favour?

as
Monday, August 25, 2003

a) I disagree with your numbers. Simple function call overhead is in the 10s of cycles, a L2 cache _hit_ on a P4 costs 9 cycles, a miss around 250. The idea that you can do anything significant in a few hundred cycles is silly. As I said before, copying a string will normally take a few hundred cycles.

b) Is largely true, but not useful. Almost any realistic system will have to communicate with external servers / dbs, and rightly or wrongly, these typically won't have async interfaces.


Monday, August 25, 2003

a) Ok, then our experiences disagree. Store your data in Fortran / column major order rather than C / row major order, and use simple arrays rather than objects with pointers to each other, and you'll see your cache hit ratios skyrocket. Also, if you write your routines to work on batches, and you eliminate most of the call overhead (which is generally insignificant, but does bother you).

Yes, It's different than what they teach you everywhere, but it's not more complex or anything. It's just different, but also much much faster on modern architectures (most older high performance ones as well).

b) What I do is either choose a packages that do have async interfaces natively, or wrap them (using threads) to provide an async interface for the rest of the program to use.

YMMV; This is my experience going back nearly two decades now (back from the days that it was impossible to get anything done without cycle counting; It never stopped working, but it seems the art has been lost).

Ori Berger
Tuesday, August 26, 2003

as:

How many threads have you tried to run concurrently, and what kind of network vs. cpu load? If you got to the thousands concurrent connections, and profiled your code, you'd probably see that the synchronization primitives take a nonnegligible part of the time (In some cases, if there are few shared resources that every thread needs often, the synchronization can actually dominate the running time -  for example, standard mallocs() often serialize along threads so that if you constantly allocate/deallocate, you'd be spending lots of time synchronizing, and not utilizing more than one CPU no matter how many you have - see, e.g., [ http://www.hoard.org ]

You're also unlikely to be able to test your system properly in that scale; Not because it's hard to generate the load (it's easier than writing the server), but because the system state at any given moment is effectively random because it includes the system scheduler's state which you can't query and can't modify, except by e.g. blocking all threads.

And when you use threads, the potential for race conditions increases exponentially with every thread, unless you've synchronized _everything_. It's enough to have one unsynchronized variable that has potential for race conditions to cause problems, yet it might only really become visible under very high random load. And when it does become visible, it won't be reproducible.

My recommendation is to use thread only as a last resort (e.g., when you need to use an API that can't be used asynchronously), and mostly to give it an asynchronous interface. Your main program should be a single threaded server, never blocking, regardless of the number of connections.

Ori Berger
Tuesday, August 26, 2003

*  Recent Topics

*  Fog Creek Home