Fog Creek Software
Discussion Board

Somebody, please explain voice-over-IP to me.

I'm not much of a techie, but something has me puzzled.

It's very common when surfing the web to notice distinct delays (or worse) when attempting to load a web page.  Presumbably, somewhere between you and the web server there's some kind of bottleneck.  When you do a traceroute, you can often find out where the latency is the highest.

You might think that these types of problems would wreak havoc with voice-over-IP.  Even a half-second delay would make it virtually impossible to have a normal conversation.  And yet, people even manage to make trans-continental IP calls.  So... what gives?  How do Internet telephones manage to get around the usual problems of sending packets across the 'Net?  I've run some Google searches on this, but I couldn't find anything that addresses the issue directly.


Can you hear me now?
Saturday, September 27, 2003

Voice over IP is typically designed to be run over a "managed" network. What this usually means is the the network (e.g., an enterprise LAN or a cable company's HFC network to your home) is engineered for the appropriate amount of bandwidth to handle X number of calls without jitter (invalid packet sequencing) or delay being an issue.

When you run VoIP over the public Internet, you are using some sort of provider that will try to simulate the "perfect" network as much as they can. They will use RSVP or MPLS to give your packets a higher priority over non-voice calls and will route your calls with minimal hops.

Hope this helps. I can tell you more if you ask.

Saturday, September 27, 2003

I am just speculating here... when you load a web page, you're using HTTP obviously, which is not the protocol that a VOIP system would use.  HTTP runs on top of TCP/IP, which is a reliable protocol on top of UDP, and unreliable protocol, which is faster, but doesn't make as strong guarantees about the data arriving.  Audio data is different than regular data in that e.g. 1% of bytes being 1% off isn't going to matter at all.  So this fact can be exploited to make the latency less than that of loading a web page.  That's the general idea, I'm sure someone else can explain this in more detail.

Sunday, September 28, 2003

TCP is not implemented over UDP.

Both TCP and UDP are implemented over IP.

UDP is a sort of "IP with port numbers" (IP doesn't know about port numbers, TCP and UDP do).

An IP datagram, or data packet, contains a field which tells it which high-level protocol has generated the datagram.

So, when the IP layer of a computer receives an IP datagram, it passes it to whatever protocol can handle it: TCP, UDP, ICMP, etc.

Voice over IP usually uses UDP.

Celmai Taredin Parcare
Sunday, September 28, 2003

In fact all streaming voice/video tends to go over UDP or a more specialized protocol above IP because there is no point in resending packets if the action has moved on. 

There's no point in TCP's overhead slowing you down if you're just going to discard packets that come in late anyway.

Thomas David Baker
Sunday, September 28, 2003

In managed networks, the physical and data-link (i.e., MAC) layers are well known and their properties can be engineered for reliable data delivery (e.g., on a cable company's fiber/coax network or an enterprise LAN's switched Gig-E). Since IP is not very reliable, it is enhanced through protocols like RSVP and MPLS to ensure there is enough bandwidth and that voice traffic (like video) has the highest priority. VoIP typically uses UDP at the transport layer because it is a lightweight protocol. TCP is too heavy and has unnecessary features like retransmission.

Above UDP, the Real-Time Transport Protocol (RTP) is often used (on managed networks) because it facilitates streaming data like voice and video. At the session layer, SIP is often used for call control. There are many implementations of VoIP and each one does what it can to address the inherent problems with IP-based networks.

<soapbox>The traditional phone network is circuit switched. That is, each call you make reserves a complete circuit through the entire network and uses this bandwidth whether you're talking or not. Obviously, packet switching is more efficient for data (e.g., web surfing). The advent of VoIP was marketed by saying that VoIP calls are more efficient than old, circuit switched calls. However, in a typical VoIP call, you are still going to encode voice into a 64Kbps stream (like the old way), but then you have to add the MAC header, the IP header, the UDP header, and the RTP header. This means your 64Kbps call is now a 115Kbps call. Then, you have to add the overhead of encryption. Some implementations can compress your call and strip out redundant headers (voice stream packets tend to have the same headers), but this reduction is more than compensated by the fact that you have to reserve bandwidth through the network to maintain quality of service. Thus, we are essentially recreating the circuit switched world on our packet switched networks. Considering that our old phone networks are already paid off, this doesn't seem economically feasible. Nonetheless, nobody thinks legacy voice calls are exciting and IP is the future.</soapbox>

Sunday, September 28, 2003

Things like Net2Phone work because you can afford to lose a fair proportion of voice data and still get a clear message; when you are dealing with a load of noisy lines then the quality declines and can reach unintelligibility.

The real double whammy is that the places with good telephone lines, and thus clear communication, are the cheapest to call; so a call to the UK, US or Australia costs about 5 cents a minute and is crystal clear, whilst a call from Saudi to a Sri Lanka mobile  over Net2Phone costs 81 cents a minute, and the quality is so bad you give up and pay the $1.35 a minute the normal phone line costs.

Stephen Jones
Sunday, September 28, 2003

I just found out about a new, free, voice-over-IP program called Skype:

I tried it on my local network, and the quality was quite good.  (Even just using my notebook's crappy built-in microphone.)  Don't know how well it holds up over in real use over a distance, though, when the packets have to bounce through multiple networks.

Robert Jacobson
Sunday, September 28, 2003

It would be interesting to know what the internet phone equivalent of this would be:

"Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.

The rset can be a total mses and you can sitll raed it wouthit porbelm.

Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe."

UI Designer
Sunday, September 28, 2003

When I call my girlfriend in Colombia, it's about 5c a minute if I call her, 30c a minute if she calls me.  When we talk over the computer (AIM), it's about 3c a minute for her, free for me.  So, it's a good deal.

Sunday, September 28, 2003

I use Yahoo IM voice chat to talk to my parents in NY and sister in CA...A good mic is the key (and cable/dsl on both ends :)

Monday, September 29, 2003

Robert: Actually, voice compression has been a hot topic for years and there are many codecs in use that can compress 64Kbps voice down to even an 8Kbps stream. Some of these are used in wireless communications where bandwidth conservation is a good thing. However, your written jumble analogy doesn't play very well in voice compression. Most of us are willing to live with lousy cell phone quality, but there is really only so much you can do when the compression is so high. You see, in order to "compress" the signal, they use a linear predictor that attempts to "predict" the next few bits of your voice using the characteristics of the last few bits. While the "noise" this algorithm creates can be acceptable, just try whistling (if you're a good whistler) or playing some sort of sine wave into your cell phone. Odds are, the person at the other end will hear a broken sine wave. VoIP calls will probably be just as good (bad?) and cheaper prices will be the driver, not quality. My hope is that broadband will further penetrate homes and we start to get higher quality voice (and features) rather than bargain basement voice. Once you see (and use) the new Cisco IP phones with their XML programmability, you will start to imagine what the network of the future will be like. I'd rather talk in luxury (at decent rates) rather than simulate making a call in the jungle from a satellite phone.

Monday, September 29, 2003

I have had hit and miss results with MSN, Yahoo (P.O.S.), and others.  So far MSN (when it works) has done a decent job.  I use the video as well (very cool).

But MSN will not let you do audio/video to multiple people at once (WTF?!  P.O.S).  So maybe someone can recommend a "decent" replacement (commercial or not) that is best for tele-video-conferencing.

Monday, September 29, 2003

I can't add anything to all the fancy talk of compression and TCP and UDP, but I can speak of my home VOIP service from

They gave me a free Cisco ATA that I plugged into my home network.  The Cisco box grabs an IP address from the router, connects to the Vonage network, and has a phone jack on the back.  They gave me a local phone number, and my phone service has been fantastic!

If you're paying attention, it's not quite as good as regular phone service, but it's much better than most cell phones I've used.  Nobody on the other end can tell it's VOIP.

At $25/month, I get all the features (caller ID, call forwarding, voice mail, etc) and Free calling within my state and all the long distance I ever use.  And, there are none of those taxes that get tacked onto your regular phone bill each month.

I've heard that Vonage has had some growing pains as a company (support/service), but I've never had a problem with them...

Monday, September 29, 2003

Skype is fantastic. Check it out...

Crystal clear audio, full duplex too !

Monday, September 29, 2003

To answer the original question, you use a jitter buffer.  The phone keeps 50-100ms of audio. It plays audio at a constant real time rate. Each packet it receives from the network contains (usually for RTP) 30 ms of audio.
Think of it as playing from the front and putting packets at the end. This allows for jitter caused by traffic. Of course, too much delay and you will hear it.
The amount of audio to buffer is a tricky thing to play with. To much and you can not have a conversation, not enough and the jitter kills you.

VoIP Monkey
Monday, September 29, 2003

*  Recent Topics

*  Fog Creek Home