How does Audio over IP work and what are the reasons for a delay?

Whenever you transport audio over an IP network - that is true for every vendor and every technology, you will incur a delay. That is unavoidable. Why ? This page will explain the general concept of how Audio over IP works, and explain specifically the reasons for an inherent delay.

CURRENTLY WORK IN PROGRESS - NEEDS TO BE REVIEWED

Concept of Audio over IP

Sampling

To transmit Analog Audio over an IP Network, it needs to be sampled (a measure be taken) with the sample rate. These samples then can be handled in the digital domain and transferred over the network to the decoder, which ultimately will convert the samples back to an analog voltage on the audio output, accurate with the sample frequency. The original signal as it was present at the input is reproduced on the output.

Sending every sample over the network

In theory, every sample could be directly sent as a block over the IP network with minimum delay, but doing this will generate a huge amount of traffic (48.000 blocks per second for 48khz sample rate) and bandwidth requirement (a min. ethernet frame is 60bytes, so you are actually using more than 23MBps if you do this) !

Ethersound does this, actually - all channel samples at a given timepoint (one per channel) go together in a frame on the network, one frame per sample set .. generating practically a constant network load of 100Mbps (not over IP, but over Ethernet).

For standard applications where the audio stream needs to coexist on the network with other services, something must be done to generate less network load

Collecting samples, sending packets

Let us assume the network load, expressed in number of packets per second, needs to be limited to no more than 100 (blocks per second). To achieve that, the device will need to to collect 10ms worth of samples before sending them out together in an IP block.

With a 48kHz sample frequency, that means 480 samples (960 bytes) have to be collected before the block can be sent. That is making perfect sense, as one Ethernet block can carry up to about 1400 bytes of Data.

Receiving and decoding

At the receiving end (the "Decoder"), a constant stream of samples at the sample frequency must be generated. To do this, the decoder always needs samples "in storage" it can use, if it ever runs out of samples to generate the stream, an "underrun" condition exists, which will cause dropouts and other issues. Consequently, a buffer of samples must be maintained, high enough so that it always is replenished before the D/A runs out of data.

Delays introduced in the chain

Packetizing at the encoder

As already discussed above, to limit the bandwidth used to a reasonable level, samples must be collected and send as blocks to the destination, via the network. Here is the first delay. It is pretty obvious: The delay incurred, even with an optimum system performance, directly depends on the number of samples per block resp. blocks per second. if 100 blocks per second is the target, 10ms of samples need to be collected, if 50 blocks per second are allowed, 20ms worth of samples need to be collected before they can be sent.

With MP3, not only is a frame of samples first collected (20..50ms, depending on audio frequency setting), but the collected samples then, once complete, must be processed with computational intensive functions (DSP), which takes significant time (but average less than it took to collect the samples, otherwise the system cannot work reliably).

Processing at the encoder

While putting PCM samples in a packet and sending them does not really need resources, in the MP3 case, a significant delay is introduced in addition to the delay produced by "holding up" the samples to packetize them. Count in 20ms for MP3.

Network delay "in transit"

Now that a block is going to the network, some delay already is introduced (let's stick with the example, 10ms for PCM and 60ms for MP3). The block now needs to be sent over the network, potentially fighting with other traffic for bandwidth. On a local LAN, the delay will typically be quite low (msecs maybe), but beware .. if there is "sometimes" a fight over bandwidth/buffering in a switch or router, you may see average very low delays but PEAK delays could be substantially higher. Why is that a problem ? Well .. the receiving side always needs to be fed with samples before its buffers run empty .. now, if there is a block delayed, let's say 30ms, the receiver must have buffers configured so that it can live with that delay before running into an error condition (empty buffer). The difference between the min. delay and the max. delay of a network block arriving at the decoder is commonly called "jitter". Jitter can be significant especially with Wifi networks, as there are invisible retries happening in a lower level protocol - you might see (ping command shows all this) an average delay of only 5ms, and zero block loss, but a max. delay of 200ms ... any device with a receive buffer configured of less than 200ms will mean that you will encur dropouts. period.

Buffers in the receiver

Now, let's take a close look at the decoding side. As already stated before, to maintain a constant, consistent stream of samples (which are then converted to an analog value and sent to the audio interface), buffering is a must. The buffers must be able and configured to hold as much data as necessary to cover/survive the longest possible "dry period" when no block comes in from the source, for whatever reason. If (example above), the source sends a block every 10ms, very precise timing, and zero jitter is introduced by the network (an unlikely scenario), in theory a buffer of one frame is sufficient. When the block arrives from the network, it will be copied into the buffer (let's say, holding 480 samples, in our example with 10ms) and the output can start converting to analog. The buffer now drains, but right when it is getting to "empty", the next block comes in from our perfect encoder through the perfect network infrastructure. That is, unfortuntately, not a real life scenario, jitter is always introduced somewhere on the way. A realistic setup uses a buffer holding several blocks.

Processing in the Receiver

Oh yes, and for MP3, you have to add another reason for delay at the decoder side. Once a frame is received, it cannot be output immediately, but it first needs to be decoded, which is resource intensive and can take several ms .. and there is a need for an output buffer for samples ...

Sample buffer for the D/A

The last addition of delay is not really necessary technology wise, but a fact in Barix devices. We use a Main CPU for network tasks and a DSP driving the D/A, which turns the samples back to analog audio. The DSP is necessary for MP3 (and AAC etc etc) decoding, for PCM it mainly works as a pass-through. The interface between the main CPU and the DSP also introduces a sample buffer at the DSP output side, which is counted in bytes, and can introduce quite significant delays for low bitrate/sample rate streams. Why ? Well, the buffer is counted in Bytes as i said, and let's assume for these examples here it is 2kBytes (2000 bytes), for a PCM 48kHz stream that means 96Bytes per ms, so roughly 20ms - but if you send a 8Khz PCM stream, you get 6x that delay.

Conclusion

So, at the end you have several sources introducing delay, with the buffering for network jitter being often the most significant one, but a "base delay", depending on sample rate, encoding format etc is always present. As you can figure from the above examples, if you have the bandwidth, it often makes sense to configure higher sample rates and bitrate streams, as that will effectively lower the delay due to the fact that the constant (byte wise) buffers in the chain have a smaller through delay.

Barix devices

One last comment, now specific to Barix devices: The Exstreamer 1xx and 2xx decoder devices use a DSP with ample buffering. There is reasons for that, not to be further detailed here. In contrast, the Exstreamer 1000 use a different DSP with lower buffers, and we are currently beta testing a DSP software patch which reduces the buffers much further to almost non-existing ! With the Exstreamer 1000, you can currently achieve delays of below 50ms - the software has not been optimized for very low delay. However, with optimized software, you can get the delay down to well below 20ms - that has been proven in our labs (for a specific project). We are in the process to bring this down further, obviously (goal 5ms ?) this can only be done by sending many more blocks over the network, for example, one per ms - 1000 blocks per second .. ask your Wifi router what it thinks about that ..), so it needs optimized software.