Fog Creek Software
Discussion Board




Windows & Time

Perhaps some of you have a solution, or a direction towards a solution.

I have a distributed system that needs to do things with low (<10ms) average latency. Unfortunately, I'm forced to run Windows in some parts of it. I'm using TCP sockets for communication, and even though the Windows system is consistently below 2% CPU, doing _nothing_ but wait for an event and respond, I see delays in the 100ms range.

I have turned off Nagle, and tried playing with the almost-undocumented TcpAckDelay registry key, but to no avail.

Anyone has any idea where could the problem lie?

This is a 'soft realtime' system, not a 'hard' one - I can tolerate delays every once in a while as long as the average is consistently low. And I'm forced to stay with Windows for now because a component I use runs only on Windows and does not come with source.

I'm considering switching to UDP, under the (possibly false) assumption that UDP has much less buffering to interfere with timing; However, I would then have to implement much of TCP above it, which is something I'm trying to avoid.

Also, if anyony knows how to get 1ms accurate time of day on Windows, please respond (And, no - the standard GetTimeOfDay calls provide 10ms or 15ms precision, even though the documentation says they give nanosecond. They only give it on a nanosecond _scale_).

Philo, if you ever go by the place where they keep the kernel developers, please tell them that 1ms time of day matters to some people. 1ms ping time resolution is important to _everyone_.

<rant>And this is even more frustrating because on Linux and Solaris these things just work as one expects</rant>

Ori Berger
Tuesday, June 29, 2004

Did you not rant on this sometime earlier?

http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=85520

KayJay
Tuesday, June 29, 2004

QueryPerformanceCounter and QueryPerformanceFrequency will give you a high-resolution timer. Unfortunately it isn't synchronised to the time of day, but if all you need is "how much time has elapsed since I last asked", then they might help you.

Adrian
Tuesday, June 29, 2004

According to  http://dream.eng.uci.edu/TMO/pdf/isorc2002_clock.pdf "Realization of a Distributed OS Component for Internal Clock Synchronization in a LAN Environment" it should be possible as they write: "In the current implementation of TMOSM on Windows XP/2000/NT, the resolution of DCSage TMOSM pretends to support is 1 microsecond but to the application TMO programmer, the precision of the time value that can be trusted is at the level of a millisecond."

Find more of this work starting at http://dream.eng.uci.edu/ , with descriptions of (dated?) implementations on NT. Unfortunately it seems their software is no longer (temporarily?) downloadable ( http://dream.eng.uci.edu/TMO/tools/Default.htm ).

Just me (Sir to you)
Tuesday, June 29, 2004

> Anyone has any idea where could the problem lie?

Have you boosted your process and thread priorities?

How is your receive implemented: select? notification after asynch read? or a blocking (synchronous) read? The blocking read should be the least latency (though not the least bandwidth for multiple connections, as it requires a dedicated thread for each socket that you're reading from).

Perhaps http://www.sysinternals.com/ntw2k/freeware/tdimon.shtml with timestamps enabled will tell you something.

> I'm considering switching to UDP, under the (possibly false) assumption that UDP has much less buffering to interfere with timing; However, I would then have to implement much of TCP above it, which is something I'm trying to avoid.

That should be a true assumption; and it shouldn't take much code to test it. The way fax-over-UDP works, by the way, is simply to always send multiple UDP packets (with a sequence number), expecting that at least one of the packets will get through, and discard received duplicates. This may be be simpler to implement (but more expensive in bandwidth) than the TCP "retransmit if you don't get a timely ACK" algorithm.

Christopher Wells
Tuesday, June 29, 2004

Do you have an anti-virus program running? Our anti-virus program forces it active when a read or write is done.  If you look at the machine while idle it is 1%, hit a read/write and it could easily cause 100ms or more.

Anonanonanon
Tuesday, June 29, 2004

It works on a PC platform under Linux?  Good to know.

I didn't think Windows HAD a 'hard real-time' philosophy.  If something happens in 100 mSec, that's 10 times per second.  Fast enough for humans...

I've moved to PIC and AVR processor based single board computers, tied to the PC over a serial port, for these kind of applications.

Originally, the PC's internal timer chip was programmed to overflow at around 55 mSec, which supported the TimeOfDay clock tick IRQ.  This could be reprogrammed (under DOS) so that reading the timer along with TOD would give you a clock accurate to 900 uSec.

I know that Windows 2000, when it starts up, reprograms the timer chip to provide 'faster' ticks -- 10 mSec? 7 mSec?  not sure.

The statements about the performance counter are true -- it gives you 1 mSec accuracy, but it is NOT synchronized with the TOD clock.

I don't think UDP will improve your performance.  You'll be replacing an optimized TCP/IP stack (kernel level?) with your own code at user level.  I'd expect performance to be worse.

AllanL5
Tuesday, June 29, 2004

Thanks for the answers everyone!

KayJay: I did rant on the time issue before, and possibly on the tcp issue before (I don't remember and the search function here is broken). That was over 6 months ago, and I keep finding new problems related to the same issues of broken design and implementation (not sure which), so I decided to ask again.

Adrian: Indeed, these do provide excellent precision for short measurements, but they do not help me in this case: I have machine A receive an event through the ether, generate a network transmission to machine B which then has to act on it. I get unreasonable latency, and I'm trying to decide why. I've measured my code, and it's less than 1ms between receiving the event at A and sending it, and also less than 1ms between receiving the event at B and acting on it. Which means time is spent either in OS network buffers or task switches. Neither is under my control, so no matter how precise I measure it, it doesn't really help me.

Christopher: I've tried boosting priorities, to no avail. Notification is received by a blocking select() call, because it needs to monitor multiple sources. A blocking recv() will require multiple threads, which I might try to implement but would complicate many things. I must say It would leave me nothing short of astonished if a blocking select() call is the culprit responsible for a 100ms delay, though!

I'm not sure what times TDImon is reporting (packet receipt? packet processing?), and I can't correlate them properly among two Windows machines with the poor clock synchronization.

I may adopt the fax-over-UDP solution eventually, if I don't find any other solution.

Anonanonanon: Thanks, the A/V is already disabled. But it's still loaded as a driver. As far as I know this shouldn't impact anything -- but perhaps someone knows better?

AllanL5: Neither windows nor Linux has a 'hard real-time' philosophy. But I'm unable to get reasonable 'soft real-time' from Windows either. It's capable of soft realtime in SOME forms -- e.g. in Arcade games -- but apparently not good enough for me in the Network area. Linux manages to properly keep track of time with microsecond accuracy, and well understood and predicatable network timing patterns.

The tick speed isn't the issue - Linux 2.0 and 2.2 did microsecond time precision even with a 100hz interrupt clock. And a machine with under 2% load shouldn't introduce a latency greated than 20ms under any circumstance, even if the clock interrupt comes once an hour.

Win2K repgrograms the chip to produce either 10ms or 15ms, depending on the weather, the day of the week, your sexual preference and a random number (usually 17).

As christopher mentioned, UDP may help me if I don't try to replace the entire TCP packet; I have bandwidth and throughput to spare that I'm willing to trade for latency.

Ori Berger
Tuesday, June 29, 2004

s/entire TCP packet/entire TCP stack/

Ori Berger
Tuesday, June 29, 2004

To get that level of accuracy in Windows, you really need to load a real-time add-in.  You can't get anything faster than the timer tick, which is much higher than 1ms.

I've dealt with customers in a similar situations (we make real-time embedded systems).  There's not really a reliable solution.

Here's a good write-up on it:
http://www.flounder.com/time.htm

There's a number of add-ins for Windows that give you real-time support.  http://www.windowsfordevices.com/articles/AT5376962137.html

I've had customers speak highly of the VentureCom solution.

Myron A. Semack
Tuesday, June 29, 2004

Also, I'm assuming you've looked at the MIDI timers.

http://www.sysinternals.com/ntw2k/info/timer.shtml

Myron A. Semack
Tuesday, June 29, 2004

Here, I dug this out of my customer support e-mails:
http://www.cl.cam.ac.uk/users/kw217/useful/win32time.html

Myron A. Semack
Tuesday, June 29, 2004

> I must say It would leave me nothing short of astonished if a blocking select() call is the culprit responsible for a 100ms delay, though!

It would surprise me too. Something else that might work better than select-followed-by-read is a single overlapped ReadFile, perhaps attached to a completion port ... the idea being that, when the network stack receives a packet, it should already have a buffer of yours into which it should copy the data.

> I'm not sure what times TDImon is reporting (packet receipt? packet processing?), and I can't correlate them properly among two Windows machines with the poor clock synchronization.

I'm not sure either. "TDI" is the kernel-level network protocol stack ... so the times it's reporting should be the times of various kernel-level events. I was wondering whether it might show 100 msec ever being consumed between the receipt of the packet and the sending of the reply.

Also if you ever experience a 1ms roundtrip you could use that event to synch the TDI logs from your two machines.

It wouldn't fix the problem, but might provide an insight into what's happening, when, in the kernel.

Christopher Wells
Tuesday, June 29, 2004

> I may adopt the fax-over-UDP solution eventually, if I don't find any other solution.

FYI it transmits e.g. two packets for every one, and uses no ACKs (it's streaming data in one direction) ... but the protocol has a higher-level mechanism to cope with lost packets (i.e. it either allows fax scanlines to be lost, or it retransmits entire frames (consisting of several packets) if fax error-correction is enabled).

SFAIK, VoIP (which needs to be fixed-maximum-bandwidth soft-real-time, and doesn't need to recover lost packets) is similar.

When I implemented reliable-transport-over-UDP (and IPX), I did it by retransmitting every 10 msec until I received an ACK: note that unlike TCP I had no "exponential backoff", and I also had a "window size" of 1 ... so, I consumed more network bandwidth than TCP, and had lower maximum throughput (because of the ACK ping-ponging and small window size) as the price of reducing latency. I found it important to have not one but two simultaneous overlapped reads outstanding on each connection (so that the kernel still had a buffer of mine for the next packet, even while it was returning to me my buffer containing a packet).

This mechanism worked OK for low-bandwidth traffic over a dumb LAN (better with plain Ethernet than with token-ring), but not well with routers and dial-up connections in the loop.

Christopher Wells
Tuesday, June 29, 2004

Have you hooked up a sniffer to see when the
packets are leaving and entering the machine?
Then you can get a feel for the processing
overhead on each side.

An interesting test would be to bypass the
stack completely and measure performance.

On real-time systems we have events triggers
through the entire system to measure latencies
etc, that doesn't appear to be available on
windows.

Though i agree that adding a real-time module
would be necessary to get guaranteed
performance.

son of parnas
Tuesday, June 29, 2004

I don't know what might happen if you put a network time server on your LAN: how accurately it might be able to synch the time-of-day clocks of your machines ... which might help with your testing/logging.

Christopher Wells
Tuesday, June 29, 2004

Just random thought: if you do "ping" between this machines what responce time do you get? Also try "ping -l sizeofyourdata"...

WildTiger
Tuesday, June 29, 2004

We've done a lot of work synchronizing Windows systems over a network for distributed data acquisition, and we've written a lot of Windows drivers for high-speed devices.

With Windows, it appears that there is nothing you can do about the occasional times the system goes away on you for 10ms+. We've never tracked down the source of the problem, since our software has to run on customer-provided systems, we have little control. We have done the usual things, trying changing priorities, killing services, using UDP instead of TCP, and so on.

I'm not surprised that you're seeing 100ms intervals occasionally. I've heard of a study (which I never tracked down) done by an academic computer scientist in Germany or Switzerland that documents this behavior in more detail. His results (as told to me) are consistent with your measurements.

I assume you've noticed the problem with Windows adjusting time occasionally, which means that tagging events with the time of day produces interesting artifacts.

Recently, we've switched almost entirely to developing USB-based equipment, so we keep track of time in the device, not in Windows. Of course, synchronizing multiple devices requires an additional hardware connection for that purpose.

Dan Brown
Tuesday, June 29, 2004

One option is to see if your Windows program will run under Linux with Wine.  No idea what kind of latency you get there though.  Could be better, could be worse.

The Pentium has a register called rdtsc that counts clock ticks irrespective of anything your program is doing.  If you have a 2 GHZ CPU, it counts to 2 billion every second,  period.  You may need to write a device driver to read it.  I have code that will calibrate it's readings by way of least squares linear regression, it works pretty nicely.  Calibrate it, get the current time of day, then just use the register contents from there on out.  It's 64 bits so rollover isn't a problem.

Finally, if you're on a private, quiet network then modify your program to toggle a line in either a printer interface or RS232 (yep, legacy devices still have their uses).  Hook a scope or logic analyzer to the ethernet wire and this line, then you can see exactly where the latency is.

Post your email here if you want the code.  I'd post mine but I'm afraid of spam harvesters.

Snotnose
Tuesday, June 29, 2004

Maybe this paper can help:

http://www.microsoft.com/windows2000/techinfo/howitworks/communications/networkbasics/tcpip_implement.asp

.

Lots of tips about registry configuration of tcp/ip parameters, worth a shot.

R
Tuesday, June 29, 2004

I'd be curious if the same thing happened using blocking sockets or I/O completion ports.  My feeling is that WSASelect() was hacked onto NT's I/O subsystem, and probably has the lowest performance of the 3, but I don't have any evidence to back that up. 

BTW -- this is off the subject I know -- Ori you are adamant about using single threaded servers.  I took your advice not because I think they are less complicated than thread/connection servers, but because if you really want Linux to scream and behave predictably under load that’s really the only way to go. 

But I’ve run into what I consider a serious problem using this model:  program tracing and logging.  I’ve been forced to limit logging to one log/request in production mode because of the overhead of blocked logging.  I know you are going to say: I’m doing it wrong and should use non-blocking logging, but this adds enormously to the complexity of the application.  If you print a trace message at the beginning of a function, the number of states required becomes insane.  The only solution I could think of is in production mode limit the logging to one write per request.  That one write can be non-blocking with out exploding the complexity of the state machine. 

christopher baus.net
Tuesday, June 29, 2004

Your problem is probably occuring because you are dribbling small amounts of data onto the network which Windows is coalescing to save on network bandwidth. Take a look at setsockopt() and TCP_NODELAY (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/setsockopt_2.asp)

xyster
Tuesday, June 29, 2004

The OP said:

> I have turned off Nagle

christopher baus.net
Tuesday, June 29, 2004

There may be an outside chance that it's worth trying WSAEventSelect () instead of select(). It's supposed to be more efficient, although I can't see 100ms.

as
Tuesday, June 29, 2004

> I know you are going to say: I’m doing it wrong and should use non-blocking logging, but this adds enormously to the complexity of the application.

Can you not simply implement logging as a thread-safe queue of messages to be logged: when you log a message you just enqueue it (which takes almost no time), and have a background thread which dequeues messages and writes them to disk?

Christopher Wells
Tuesday, June 29, 2004

> Can you not simply implement logging as a thread-safe queue of messages to be logged: when you log a message you just enqueue it (which takes almost no time), and have a background thread which dequeues messages and writes them to disk?

I thought about that.  But what happens when if the queue is full?  You can't grow the queue indefinitely.  Under high load it would be very likely to grow faster than it is written.  When the queue is full you end up having to block everything until it can be processed. 

Under high load the logging becomes the bottle neck either way. 

christopher baus.net
Tuesday, June 29, 2004

> But what happens when if the queue is full?

a) It doesn't become full (it has no maximum size; is limited only by sufficiently-infinite RAM)

b) or the enqueue becomes a blocking operation (as you suggested)

c) or you discard log messages (perhaps low-priority or duplicate messages, or message details).

> Under high load the logging becomes the bottle neck either way.

Yes: with a high enough load (i.e. when your hardware is "overloaded" for a sufficiently long period) your logging can saturate your hardware (first your log I/O bandwidth, and eventually your queue of pending log messages), in which case see above.

When the system isn't overloaded (and even when it is overloaded, unless you choose option (b) above), you have decoupled the performance of your logging from the performance of your foreground thread, which is good, without affecting the complexity of your foregound code (compared with putting asynch I/O awareness in the foreground thread, which would affect its complexity).

Christopher Wells
Tuesday, June 29, 2004

Lemme guess, you're writing a distributed file system.

ronk!
Tuesday, June 29, 2004

For queue full conditions you may want to
consider aggregration. Combine  logging messages
where possible into one message. Then consider the
concept of priority so you can decide what to drop.
Consider knowing the type of the messages to make
better decisions. Perhaps a create followed by a delete
can be dropped. Perhaps the create and delete can
be combined into one message.
Consider serializing and compressing messages
in ram so they take less space than the allocated
messages. Consider doing local writes to disk
so you can ride out any constipation at the headend.
Consider filters pushed to the distributed nodes
so you are only logging things you care about.
Consider a pull based system so you are only pulling
loggin messages when the headend has CPU,
memory, and network bandwidth.

In the end there is always loss.

son of parnas
Wednesday, June 30, 2004

> a) It doesn't become full (it has no maximum size; is limited only by sufficiently-infinite RAM)

I don't like that solution.  If logs are piling up faster than they can be written, you will eventually run out of memory, and the process will probably just thrash itself to death.

I do see your point though.  If you don't drop log messages and log writing is bottleneck at some point you are going to have to throttle requests to service the log file. 

My plan is to use use blocking I/O for tracing and if the performance isn't there then just disable it.

christopher baus (www.baus.net)
Wednesday, June 30, 2004

Thanks for the replies. The last link Myron gave seems like it's going to solve at least my time measurement problems, which is great help towards solving (at the very least, understanding) the latency problem.

I've looked into multimedia timers before, and wasn't too impressed but apparently I should look again. I was using timeBeginPeriod(), which supposedly the documented way to call the undocumented NtSetTimerResolution. I'll try the NTDLL path, though - perhaps there is a difference.

Regarding blocking/non blocking logging - I agree that attempting to log completely asynchronously is a nightmare. The solution depends on how important it is to never lose a logged text. If you queue and handle it at a convenient time, or buffer it in any other way, it becomes extremely simple.

Blocking when resources are exhausted in an otherwise asynchronous system is reasonable - if you do not have enough resources for what you're trying to do, something's go to give.

With my limited knowledge of your case, what I'd suggest was adding a thread-safe queue, and creating another thread to just dequeue messages off that queue and log them wherever it is you need; Limit the size of the queue, and if more messages cannot be posted, either drop them (but log that 'xyz messages have been dropped') or block while waiting for queue space to become available.

Have a look at mod_log_spread - it's a module for Apache that uses Spread (group communication toolkit) to distribute logs relatively easily; While it doesn't directyl solve the problem, using it (or something similar) will let you put up several logging backends, and you'll still manage to log as long as one of them can accept the log message and the network is reachable.

Ori Berger
Wednesday, June 30, 2004

Ori, after these topics my curiosity is peaked. What exactly is the experimental setup here? What are you measuring?

Just me (Sir to you)
Wednesday, June 30, 2004

A thread safe queue may add unaccetable  latency into your
system. It may be better  to create a nonblocking solution.
Use a circular queue of fixed sized buffers, for example.

son of parnas
Wednesday, June 30, 2004

I put some more thought into the logging problem last night.  And the solution I came up with is to just use another process which handles queueing writting the log files.  On Unix to solution is simple: 

syslogd. 

This is what thttpd does.  Here's the author's notes on the topic:

http://www.acme.com/software/thttpd/notes.html#syslog

Syslog is the standard Unix logging mechanism. It's very flexible, and lets you do things like log different programs to different files and get real-time notification of critical events. However, many programmers avoid using it, possibly because they worry it adds too much overhead. Well, I did some simple benchmarks comparing syslog logging against the stdio logging that other web servers use. Under conditions approximating an extremely busy web server, I found that syslog was slower by only a few percent. Under less strenuous conditions there was no measurable difference.

Another concern about syslog is security against external attacks. It's written somewhat casually, using a fixed-size internal buffer without overflow checking. That makes it vulnerable to a buffer-overrun attack such as used by the Morris Worm against fingerd. However, it's easy to call syslog in such as way that this attack becomes impossible - just put a maximum size on all the string formatting codes you use. For instance, use %.80s instead of %s. Thttpd does this.

christopher baus.net
Wednesday, June 30, 2004

Just Me: Unfortunately, I can't disclose the details of the project I'm working on now.

An older project which needed this kind of stuff, that I was involved with, though, is network event correlation - given sniffer streams of just about everything in the network, construct a chain-of-events.

Without millisecond-accurate timing, you can't tell, e.g., in a worm infection, which machine was the entry point.

Ori Berger
Thursday, July 01, 2004

Then in the general case i think you are screwed.
The time won't be well enough synced across a network
to determine relationship and cause/effect merely from time.
You'll have to use the semantics of the data with
time as an input.

son of parnas
Thursday, July 01, 2004

Linux and Solaris have no problem keeping time synchronized to a few microseconds on a local 100MB/S network. And this kind of precision is usually only interesting within a LAN.

And even on Windows one _CAN_ do better - it's just extremely hard (see the last link Myron gave above for what apparently needs to be done).

Ori Berger
Thursday, July 01, 2004

"Without millisecond-accurate timing, you can't tell, e.g., in a worm infection, which machine was the entry point. "

OK, I haven't thought this through, but wouldn't it be better to follow the infection paths than try to get exact timings on the infection results? I mean, suppose you have machine A infected at time t
machine B infected at time t+1
is B infected through A or is B itself a new enrty point infected by an outside machine? You can tell by just looking at the timestamps, no matter how accurate they are.

Just me (Sir to you)
Thursday, July 01, 2004

that should be "you can't tell" of course

Just me (Sir to you)
Thursday, July 01, 2004

*  Recent Topics

*  Fog Creek Home