Over the past few months I’ve written a lot about the signaling aspects of IP communications and although I’ve mentioned media as the result of a SIP session, I haven’t really gone into much detail about the different types of media. Well, today I plan on rectifying that and spending some time on audio as a media type.
Before I go into the different media codecs I need to lay down the groundwork. For instance, what is that codec thing I just mentioned? Simply put, a codec is a CODer and a DECoder used to convert analog media to packetized IP and vice versa. In other words, a codec can take human speech, convert it to a stream of IP packets, and then eventually convert those IP packets back to something the human ear can hear. As part of that process you need transducers. A microphone is a transducer that turns sound into electrical signals and a speaker is a transducer that takes electrical signals and turns them back into sound.
When it comes to IP communications there are a plethora of audio codecs to choose from. Each one has its strengths and weaknesses. Some codes are designed to accurately reproduce voice and aren’t concerned with the number of bytes it takes to do that. Others are designed to be as efficient as possible byte-wise while delivering acceptable voice. The codec that you use is dependent on the type of experience you want to create given the parameters of your network and the processing power of you communications devices.
It would take pages to cover every codec that’s out there, so I will stick with the ones that you will most commonly encounter.
G.711 is a very common codec and has been around since 1972. G.711is what a traditional telephone calls sound like. In fact, it is commonly referred to as toll quality voice. You will also hear G.711 called Pulse Code Modulation (PCM). This is the technical way of saying 8-bit non-uniform quantization with 8000 samples per second which I guess is even more technical than saying PCM. The most important things to know are that G.711 consumes around 90 kilobytes of network bandwidth for a single call and it sounds pretty good. Unless there are network problems, people do not complain about G.711 voice calls. It’s what they’ve been used to for the past 40+ years.
While you often see G.729 written just as I did, that’s not technically accurate because nobody implements it that way. You see, there are a lot of different flavors to G.729 and each one is slightly different from the other. The two flavors, or annexes, that you will commonly see are G.729A and G.729B. Of these I will take G.729A over G.729B any day. G.729B employs something called silence suppression which causes problems when you are very quiet talker. Instead of suppressing just the silence on a telephone call, G.729B will suppress the voice itself leaving you with a very choppy, or clipped, conversation.
Every annex of G.729 will require less bandwidth than G.711. Typically, G.729 will use 32 kilobytes of network bandwidth per call. This means that you can get about four times as many G.729 calls on a network connection than you can if those calls used G.711. The voice quality of those G.729 calls won’t be nearly as good as those that use G.711, but if your concern is reducing bandwidth usage then it’s a perfectly acceptable choice.
Lastly, G.729 codecs cannot transmit DTFM (telephone touch tones). For that you will need to use G.711 or an out-of-band transmission mechanism like RFC 4733 (formerly RFC 2833), but that’s a blog for another day.
Like G.729, G.726 is a compressed codec that uses significantly less bandwidth than G.711, but produces voice quality similar to that of G.711. For you technical people out there, G.726 uses something called Adaptive Differential Pulse Code Modulation (ADPCM). The most important thing to know about ADPCM is that it uses the differences between voice samples to create its media stream. In other words, instead of sending information about each voice sample, it will send one full sample followed by how the next samples differ from that one. That lowers the bandwidth required by G.726 down to 55 kilobytes for each call. So, not as low at G.729, but less than G.711 with comparable voice quality.
G.722 takes the opposite approach of G.729 and G.726. Instead of focusing on lowering bandwidth usage, G.722 is concerned with improving voice quality. G.711 may be called toll quality voice, but it was invented quite a long time ago (heck, I was still in high school) and with bandwidth becoming cheaper and more plentiful why not make a voice call sound better than it did in 1972? That’s exactly what G.722 does. Instead of that 8000 sample rate of G.711, it doubles it to 16,000 samples per second. Because G.722 also uses ADPCM technology, the bandwidth usage isn’t double that of G.711 even though the sample size is. A typical G.722 call consumes about 90 kilobytes per call.
G.722 is still fairly new in the world of codecs, but it or another “wideband audio codec” will most likely replace G.711 in the not too distant future.
There are more audio codecs out there, but I will stop with these four since they are the ones you will mostly likely run into. However, stay tuned for a further look at codecs where I will tackle such beasts as ILBC and Microsoft’s RTAudio .