Fog Creek Software
Discussion Board




Question for TCP Gurus

First, I apologize if this a little off-topic for this forum, if you can point me to a more apropriate one, that would be great too.

I've been working on a simple database system where the time from the client asking the question to receiving the answer is critical. (Client and server are on the same LAN.)

I wrote a simple test program that just does this in a loop:
select() on write
send() 20 bytes
select() on read
recv() 20 bytes

I ran two tests on each system, one with both the server and client on the same cpu, and one with them on separate cpus talking over the lan:

Linux
K7-850 talking to itself: 20000 transactions/sec
K7-850 talking to K7-850: 7500/sec

Win2k
P3-800 talking to itself: 6900 transactions/sec
P3-800 talking to k7-1200: 4000/sec

The platform my stuff will be running on is Win2k, so I'm a little dismayed to see such a big performance disparity. Any ideas why this is happening, or solutions to fix it? I know windows has it's own custom functions for waiting on IO (as opposed to select) but I haven't investigated them yet.

thanks!

b
Wednesday, June 05, 2002

> I've been working on a simple database system where the time from the client asking the question to receiving the answer is critical.

Real-time protocols use UDP not TCP.

Another fix is to batch your transactions, e.g. one big transaction / second instead of 5000 little transactions / second puts less strain on the network (fewer, larger data packets, fewer acknowledgements)... TCP is intended for streaming data.

Your best LAN result (7500 X 40=) was 2.4 Mbit/sec whereas a single TCP connection can easily send 10 to 60 Mbit/sec if the data is streaming and not "ping-pong".

> I know windows has it's own custom functions for waiting on IO (as opposed to select) but I haven't investigated them yet.

What does Task Manager say about CPU utilization when you're running the test? If it's less than 100% then you could better performance using multiple:
* threads
* TCP connections
* UDP ports
* overlapped data buffers

> select() on write
> send() 20 bytes
> select() on read
> recv() 20 bytes

It's fewer function calls to just do send and recv without select, using blocking/synchronous sockets.

It's more efficient if you have recv() already called before the data is received, so that the OS has your data buffer into which it can copy the data.

Christopher Wells
Wednesday, June 05, 2002

Excellent suggestions from Christopher.

I would also look at using the Windows API's directly. They are optimized over the basic select/recv API's.

Regs

James Ladd
Wednesday, June 05, 2002

The Winsock lame list has this to say about select() in the Windows world:

<quote>
select(). Self abusively lame.
Reason: Consider the steps involved in using select(). You need to use the macros to clear the 3 fd_sets, then set the appropriate fd_sets for each socket, then set the timer, then call select().
Then after select() returns with the number of sockets that have done something, you need to go through all the fd_sets and all the sockets using the macros to find the event that occurred, and even then the (lack of) resolution is such you need to infer the event from the current socket state.
Alternative: Use asynchronous operation mode (e.g. WSAAsyncSelect() or WSAEventSelect()).
</end quote>


Your best course of action depends partly on the number of clients you are expecting to deal with. For smallish numbers the easiest thing is probably to spawn a new thread for each connection, and this may give perfectly acceptable performance. If you've got thousands of simultaneous connections, and don't mind locking yourself into the Windows world completely, it's worth looking at using completion ports. There is, or was, an introductory article on these to be found at:

msdn.microsoft.com/msdnmag/issues/1000/Winsock/print.asp

Good luck.

Andrew Simmons
Wednesday, June 05, 2002

I should also have mentioned that the Winsock FAQ may be found at:

http://tangentsoft.net/wskfaq/

and describes itself as follows:

"This FAQ answers the most commonly-asked questions on the alt.winsock.programming and comp.os.ms-windows.programmer.tools.winsock newsgroups. The FAQ also contains a growing repository of Winsock programming information and links useful for all levels of programmers. Please email me if you have any corrections or additions for the list."

Andrew Simmons
Wednesday, June 05, 2002

Winsock lame list?  Ammusing...

If you're utilizing one thread per socket read or write then the "overhead" for setting up select is very low.  It turns out that in most implementations, a "set" is merely a bitfield of 32 bits (unsigned long).  So, all you're doing on a set is the equivalent to:

set |= 1 << (fd % 32);  // set

And the macros for this are inline assembler - just to make it fast(er);

Additionally, "setting this clock" amounts to:

struct timeval tv = tv0;

The equivalent of 2 32 bit assignments.  Big overhead for setting up select, right?  This would not even be necessary but that select modifies your tv value, so you need to keep the original around (tv0).

By far, the larger performance hit is select itself, not the set up.

What is so bizarre about Windows sockets 2.0 (as I recall), is that it employs a Windows Message loop for async notification.  So here you are, in the most unlikely place to be drawing GUI's, but you've got this WM_MESSAGE thing being passed around along with a (invisible?) window.  Weird...

I have no idea why the Win32 machine is seemingly slower.  Perhaps the TCP window for win32 defaults to a smaller value, which forces more ACK's per data sent?

I don't know...

Nat Ersoz
Thursday, June 06, 2002

I'm impressed that Nat actually bothered to look at the Winsock implementation of select, and I'm sure he's right. I was lazily passing on the comments of people who I assumed knew what they were on about. To be honest, if I'm writing my own sockets application I've always got perfectly adequate performance by creating a new thread per connection. Although some people swear by the Winsock asynchronous sockets mechanism, it does as Nat says, seem weird.

On the other hand, my needs for custom sockets software are modest, and if I needed a high-performance database system I would buy one from some one who knows better than I what they are doing.

If you really must build your own, chapter 27 of Stevens on "Unix Network Programming" has a discussion on design alternatives. As to why the Windows implementation is slower, beats me. TCP performance tuning seems to be a black art to some extent.

Re the lame list, I've just found a possibly relevant quote from Jon Sander's book "Effective TCP/IP Programming": "One might wonder where the lameness lies when a system call does not perform according to its published specification".

Andrew Simmons
Thursday, June 06, 2002

:)  I don't know about WinSock, I looked up the Linux implementation.  I have no idea what winsock does.  I haven't written Win-anything since leaving the empire (formatted my lap top hard drive when I left - just to be sure I didn't take anything out).

On another note, it does show a limitation of select, which is that you can easily get select() false hits when aplying it against multiple fd's.  Which I never even considered - as you say, use a thread per socket - and everything goes good.

Nat Ersoz
Thursday, June 06, 2002

If you want high performance winsock apps, one thread per socket won't scale.  Use asynch winsock & I/O completion ports.

This:  http://www.amazon.com/exec/obidos/ASIN/0735615799/
and the apache 2.0 source code for windows make good refs.

those who know me have no need of my name
Thursday, June 06, 2002

I have some sample multi-threaded asynchrounous Winsock2 code on my website that uses I/O completion ports, etc. The code provides a (hopefully) reusable C++ class that deals with all the grungy IO issues and leaves you free to worry about the application level logic.

It may be overkill for what you're trying to do, but if you're interested, take a look:

http://www.jetbyte.com/portfolio-showarticle.asp?articleId=37&catId=1&subcatId=2

Len Holgate
Thursday, June 06, 2002

> If you want high performance winsock apps, one thread per socket won't scale.

I know people say this, and after a certain point it's no doubt true, but for a large class of applications one thread per socket works just fine. It's important to consider factors such as number of simultaneous connections, frequency of requests on each one, amount of data transferred per request, and so on, before rushing into an over-complicated solution.

Andrew Simmons
Thursday, June 06, 2002

Thanks for the feedback. I/O completion ports look like the way to go.

There will probably be around 1000-2000 clients, each of which will only average a few requests per minute, but when they ask a question, they want the answer right away. (Mainly so I can avoid multithreading the clients, so don't try to convince me to multithread the server :)

b
Thursday, June 06, 2002

'select() on write
send() 20 bytes
select() on read
recv() 20 bytes'

If they absolutely must have the answer as fast as possible, try disabling the Nagle algorithm, which waits for a full MSS for 200 milliseconds in the interests of bandwidth efficiency.  The actual option is TCP_NODELAY.

The default tcp configuration (MTU and whatnot) on Windows 2000 is rather inefficient; try the registry changes suggested on speedguide.net.

If requests are infrequent for a given client, one-shot UDP requests will probably be faster than going to all the trouble of either maintaining or doing setup/teardown of TCP connections.

Jason McCullough
Thursday, June 06, 2002

<<< There will probably be around 1000-2000 clients, each of which will only average a few requests per minute, but when they ask a question, they want the answer right away. (Mainly so I can avoid multithreading the clients, so don't try to convince me to multithread the server :) >>>

While you probably should work to provide answers as soon as possible, you can't rely that the clients will receive them quickly enough after asking - you should either multithread the clients, or - preferably, make them asynchronous.

If you have 2000 clients, you WILL have collisions, you WILL have exponential backoff kicking in, you WILL have lost packets and retransmits. This means that, occasionally, a client might wait a few seconds even though the server handles it within a millisecond.

P.S: TCP does packet accounting for you, but UDP doesn't. If you use it, you have to do book keeping yourself to make sure that every request/reply is only processed once - packets can be lost, duplicated, arrive out of order, etc. Robust UDP protocols, though not complicated, are not as simple as they initially sound and many turn out to implement a large portion of TCP (usually retransmit, and often well-ordering; If you expect to scale, exponential backoff is also required which puts you back with TCP except you have implemented it yourself).

Ori Berger
Thursday, June 06, 2002

I'm puzzled to see people recommending UDP for a database server, where presumably it matters that clients get accurate data when they request it. As Ori points out, UDP is an unreliable protocol which provides none of the book-keeping that TCP does. This means that to provide a reliable service the client and server are going to have to do things such as re-transmitting lost data, re-assembling datagrams in the correct order, discarding duplicate datagrams, checking for corrupt data, flow control, and so on. Given the number of person-hours that have gone into doing these things in TCP, why would you even think of re-inventing the wheel? UDP may have its uses, but I can't see that this is one of them.

Andrew Simmons
Thursday, June 06, 2002

<<If you have 2000 clients, you WILL have collisions, you WILL have exponential backoff kicking in, you WILL have lost packets and retransmits. This means that, occasionally, a client might wait a few seconds even though the server handles it within a millisecond.>>

Yep. The question is how to minimize this problem.  My client requests come in three goups:

1. Frequent, easy to answer queries. Handled Async.
2. Rare, slow queries (may involve the server getting data from other servers, etc.) Handled Async.
3. Rare, easy to answer queries. Handled Sync.

Not surprisingly, most kinds of queries are type 3, so I'm hoping with a fast server cpu and some optimization I can avoid debugging a bunch of asynchronous logic.

b
Thursday, June 06, 2002

If the amount of data transferred is not too large (a few KB), why couldn't you just include a CRC checksum with each UDP request / response and retry the request if the checksum doesn't match, or if there is no response within a small time frame? Doesn't sound too complicated to me, or is there a catch?

Frederik Slijkerman
Friday, June 07, 2002

Fredrik - A checksum is already included in UDP (and even plain IP) packets; It's not CRC, but it get the job done (especially since there's a checksum at all levels - checksum is generally not reliable enough).

UDP's (lack of) reliability is not about getting corrupt data; It's about packets either getting lost, getting duplicated, or arriving out of order. There's also the fragmentation issue that sometimes (more often than you'd expect, though rarely on LANs) essentially boils down to a limit on the size of a UDP packet that will get to a destination - try to get a 64K UDP packet (max size of UDP packet) to a machine on the other side of the globe through 30 hops or so, and see how many of them get there (usually, none).

Lets say you use UDP.

1. If your packets are large, you have to fragment them yourself. (Or switch to TCP and let it do it for you).

2. If you're worried about packet loss, you'd have to add some retransmit mechanism. (Or let TCP do it for you). But if you do add one, you need to be very careful about when you retransmit because if all clients use the same deterministic (or not random enough) strategy, you'll get a phenomenon known as bunching - if two packets collide once, their retransmits are likely to collide again (and again, and again ...). An extremely robust solution to this problem is to use exponentially increasing, but random, delays between retransmits (known as "exponential backoff - which TCP will gladly do for you if you let it).

3. Once we've got packet loss out of the way, there's the problem of packet duplication. You have to make sure that if a packet is received twice or more, it doesn't do any harm like getting server and client out of sync. To do that you either make sure that the actions done on packets ON BOTH SERVER AND CLIENT are idempotent (meaning, doing them any number of times is equivalent to doing them exactly once) - and pass the complexity to the protocol design; Or, you can add identifiers and discard duplicates (which TCP will gladly do for you, if you use it).

4. Finally, there's the issue of packet order - again, you can pass the complexity to the app by make sure that request order does not matter - i.e., making the system completely stateless - but it rarely is easy because nothing perfectly stateless at this level is useful. Alternatively,. you can reorder packets on arrival back to the order the client sent them (Which is what TCP does).

Costs of maintaining an open TCP connection have gone down significantly over the past few years. TCP connection lists inside kernels went from linear searches (which is sufficient when you have ~20 connections open on the average) to hashes (which have great average performance) and btrees (which have great worst case performance, and good average case performance). The berkeley API and modern TCP stacks are still not up to the job of handling 60000 simultaneous TCP connections for the most part, but doing a better job than they do using UDP is an extremely nontrivial task, and usually requires relaxing two of issues above (when most applications can afford to relax one or none at all).

Unless you are already experienced enough with network code to assert that you can use UDP more effectively than TCP, go with the latter or you will gain this experience but it will probably be at the cost of late, low-quality delivery and much, much frustration. Especially with networks, nothing behaves in real life as it does in the qa lab.

Ori Berger
Friday, June 07, 2002

Thanks for the explanation. Personally, I'm using UDP only when I need to do broadcasts on local networks -- I was merely thinking out loud.

Frederik Slijkerman
Friday, June 07, 2002


Check out ACE (ADAPTIVE Communication Environment). It is a mature, cross-platform C++ library for network and distributed applications.

http://www.cs.wustl.edu/~schmidt/ACE.html

Zwarm Monkey
Friday, June 07, 2002

I'll second the ACE endorsement.  Very nice design & in use by some big name companies. 

I heard a rumor that ACE might be included in the next standard.  Anybody have any knowledge?

those who know me have no need of my name
Friday, June 07, 2002

I have to admit that someone told me about ACE a while back, but it didn't pass my simpleminded comprehension test. I clicked around on the ACE website for 20 minutes and I couldn't figure out what problem of mine ACE would solve. (I'm not saying it wouldn't solve any, just that I couldn't understand what the heck they were talking about! :)

b
Friday, June 07, 2002

Don't use UDP. Period. Unless you want to implement reliability, flow control, reassembly, etc. The only reason for using UDP is if you're sending streams where retransmit is not a good idea (e.g., streaming realtime audio ... if you lose a packet, you don't want it 1 second later because you can't use it). [NFS-3 now uses TCP because there just isn't any significant performance advantage of UDP over TCP.]

I'm surprised at the speed difference between Linux and Windows; I had heard that Windows uses the BSD stack, so performance ought to be similar. But my observation is that Windows sucks with networking anyway (do a machine-to-machine copy and watch your CPU utilisation go to 100%, where a Unix or Linux machine would barely register).

As far as I know, all the high performance Web servers use select(), not multiple threads. However, multiple threads are easier to program. If you want to know more about how select() works under the covers, get Stevens&Wright's book, which documents the source code of the (older) BSD TCP stack ... things have probably been improved since then.

Disabling the Nagle algorithm is usually a bad idea (again, see Stevens book. However, if you're sending short packets, you probably do want to disable it because you'll wait 200 ms for each packet anyway and there'll be no change in network load. "ethereal" is your friend, (http://www.ethereal.com)

Also:
Effective TCP/IP Programming: 44 Tips to Improve Your Network Programs
by Jon C. Snader

PeterL
Sunday, June 09, 2002

AFAIK Nagle is roughly the equivalent of stdio buffering for files... If you are writing a few bytes to a socket here and there, leave it on; but if you do your own buffering and coalesce it all into one big write, then turn Nagle off...

Dan Maas
Sunday, June 09, 2002

BTW I am not surprised at all that Linux turned out to be faster, especially on the localhost test. The basic overhead for a system call in Linux is much lower, and context switches are much faster than in Windows. If you are just ping-ponging bytes, then you're really just testing syscalls & context switches...

Dan Maas
Sunday, June 09, 2002

A question for Ori - I was under the impression that the checksum for UDP was optional, whereas that for TCP is mandatory. Is this correct, and are there any implementations of UDP nowadays that don't in fact include the checksum? All right, two questions.

Re ACE, we use it here, and I don't really care for it myself, but it does have a large and devoted following. There's a recently published book on it:

"C++ Network Programming Volume 1 - Mastering complexity with ACE and Patterns" by Doug Schmidt and Stephen Huston.

which is worth browsing to see if you might find ACE helpful. It is also supported by one of the most helpful and responsive newsgroups I have ever encountered - comp.soft-sys.ace.

Andrew Simmons
Sunday, June 09, 2002

UDP packets, like TCP packets (and ICMP packets, and just about every other useful packet) has a checksum over both data and header - see, e.g. [ http://www.freesoft.org/CIE/RFC/768/index.htm ]  - a fine example of how specifications for general use should be written (It's not sematnically precise enough to validate an implementation, but it's concise, and a damn good reference).

Although most TCP stacks in this world started out as the original BSD network stack, they're far from equivalent. The Linux networking code in 2.4 has neato tricks like zero-copy and significantly improved management that give it an advantage over just about any other general purpose TCP implementation. Other implementations will surely catch up soon (perhaps they have already done that when I wasn't looking). When ZDnet tried a year ago to saturate Tux, a webserver for Linux that takes advantage of all of the Linux TCP functionality (and was, in fact, in important factor in its development), they couldn't and had to downgrade their test system as they couldn't generate enough load - they had no problem saturating either IIS or Apache for that test.

BTW, I thought I'd mention that TCP works well for many applications, and it's probably the best choice for session/connection based protocols. However, if there are multiple receivers, or group communication is otherwise important, you have better alternatives - UDP broadcast and IP multicast for non-reliable transmission, or use Spread [ http://www.spread.org ] and get all the reliability and synchronicity guarantees you can hope for (at the cost of more CPU cycles than you intended to spend on network, but usually worth it).

Ori Berger
Sunday, June 09, 2002

Thanks for the RFC link - it's excellent, & very readable.

I had a vague impression that somewhere in one of Stevens's books he made that statement about the UDP checksum, but I can't lay my hand on it. The RFC could be read that way, when it says "An all zero transmitted checksum value means that the transmitter generated no checksum (for debugging or for higher level protocols that don't care). "

Andrew Simmons
Sunday, June 09, 2002

Dan Maas has it backwards ... there's no point in turning Nagle off if you're sending large chunks down the wire. The packets will get sent as soon as they're full, Nagle or not.

Nagle's algorithm works by waiting (200 ms approx) to see if you're going to put any more bytes out before sending a partially full packet. It tries to coalesce small packets into larger ones, but still avoiding waiting too long (just imagine what it would be like running an interactive editor such as vi or emacs if the buffering tried to wait until there were a full 1500 bytes to send).

As Snader's book says, if you think you need to turn off Nagle, you're probably wrong. But there are a few occasions when you do need to turn it off, which is why the TCP_NODELAY option exists. You might get to know your network administrator better if you do this unnecessarily and it results in a huge number of tiny packets.

BTW, Ori Berger mentions that Linux 2.4 has zero-copy ... other TCP stacks have this also: do a search with "TCP zero-copy" at Google.

PeterL
Sunday, June 16, 2002

------------------------------
I ran two tests on each system, one with both the server and client on the same cpu, and one with them on separate cpus talking over the lan:

Linux
K7-850 talking to itself: 20000 transactions/sec
K7-850 talking to K7-850: 7500/sec

Win2k
P3-800 talking to itself: 6900 transactions/sec
P3-800 talking to k7-1200: 4000/sec
------------------------------

Linux shortcircuits localhost connections so that they pretty much don't touch the network stack. Windows doesn't, because it doesn't use sockets for IPC (it has other, better mechanisms), so the only reason you'd want to use a localhost connection would be to debug a network problem.

As for the other disparity... uncertain. Windows is designed to give the best bang for the buck when you use mechanisms other than select() such as Overlapped IO, though.

Simon

Simon Cooke
Thursday, June 12, 2003

*  Recent Topics

*  Fog Creek Home