The Playout Buffer | RTP: Audio and Video for the Internet

Data packets are extracted from their input queue and inserted into a source-specific playout buffer sorted by their RTP timestamps. Frames are held in the playout buffer for a period of time to smooth timing variations caused by the network. Holding the data in a playout buffer also allows the pieces of fragmented frames to be received and grouped, and it allows any error correction data to arrive . The frames are then decompressed, any remaining errors are concealed, and the media is rendered for the user . Figure 6.7 illustrates the process.

Figure 6.7. The Playout Buffer

graphics/06fig07.gif

A single buffer may be used to compensate for network timing variability and as a decode buffer for the media codec. It is also possible to separate these functions: using separate buffers for jitter removal and decoding. However, there is no strict layering requirement in RTP: Efficient implementations often mingle related functions across layer boundaries, a concept termed integrated layer processing . ⁶⁵

Basic Operation

The playout buffer comprises a time-ordered linked list of nodes. Each node represents a frame of media data, with associated timing information. The data structure for each node contains pointers to the adjacent nodes, the arrival time, RTP timestamp, and desired playout time for the frame, and pointers to both the compressed fragments of the frame (the data received in RTP packets) and the uncompressed media data. Figure 6.8 illustrates the data structures involved.

Figure 6.8. The Playout Buffer Data Structures

graphics/06fig08.gif

When the first RTP packet in a frame arrives, it is removed from the input queue and positioned in the playout buffer in order of its RTP timestamp. This involves creating a new playout buffer node, which is inserted into the linked list of the playout buffer. The compressed data from the recently arrived packet is linked from the playout buffer node, for later decoding. The frame's playout time is then calculated, as explained later in this chapter.

The newly created node resides in the playout buffer until its playout time is reached. During this waiting period, packets containing other fragments of the frame may arrive and are linked from the node. Once it has been determined that all the fragments of a frame have been received, the decoder is invoked and the resulting uncompressed frame linked from the playout buffer node. Determining that a complete frame has been received depends on the codec:

Audio codecs typically do not fragment frames, and they have a single packet per frame (MPEG Audio Layer-3MP3is a common exception);
Video codecs often generate multiple packets per video frame, with the RTP marker bit being set to indicate the RTP packet containing the last fragment.

Receiving a video packet with the marker bit set does not necessarily mean that the complete frame has been received, since packets may be lost or reordered in transit. Instead, it gives the highest RTP sequence number for a frame. Once all RTP packets with the same timestamp but lower sequence number have been received, the frame is complete. Whether the frame is complete can easily be determined if the packet with the marker bit for the previous frame was received. If that packet was lost, as revealed by a timestamp change that appears without the marker bit, and if only one packet is lost according to the sequence numbers , then the first packet after the loss is the first packet of the frame. If multiple packets are lost, typically it is not possible to tell whether those packets belonged to the new frame or the previous frame (knowledge of the media format may make it possible to determine the frame boundary in some cases, but that ability depends on the specific codec and payload format).

The decision of when to invoke the decoder depends on the receiver and is not specified by RTP. Frames can be decoded as soon as they arrive or kept compressed until the last possible moment. The choice depends on the relative availability of processing cycles and storage space for uncompressed frames, and perhaps on the receiver's estimate of future resource availability. For example, a receiver may wish to decode data early if it knows that an index frame is due and it will shortly be busy.

Eventually the playout time for a frame arrives, and the frame is queued for playout as discussed in the section Decoding, Mixing, and Playout later in this chapter. If the frame has not already been decoded, at this time the receiver must make its best effort to decode the frame, even if some fragments are missing, because this is the last chance before the frame is needed. This is also the time when error concealment (see Chapter 8) may be invoked to hide any uncorrected packet loss.

Once the frame has been played out, the corresponding playout buffer node and its linked data should be destroyed or recycled. If error concealment is used, however, it may be desirable to delay this process until the surrounding frames have also been played out because the linked media data may be useful for the concealment operation.

RTP packets arriving late and corresponding to frames that have missed their playout point should be discarded. The timeliness of a packet can be determined by comparison of its RTP timestamp with the timestamp of the oldest packet in the playout buffer (note that the comparison should be done with 32-bit modulo arithmetic, to allow for timestamp wrap-around ). It is clearly desirable to choose the playout delay so that late packets are rare, and applications should monitor the number of late packets and be prepared to adapt their playout delay in response. Late packets indicate an inappropriate playout delay, typically caused by changing network delays or skew between clocks at the sending and receiving hosts .

The trade-off in playout buffer operation is between fidelity and delay: An application must decide the maximum playout delay it can accept, and this in turn determines the fraction of packets that arrive in time to be played out. A system designed for interactive usefor example, video conferencing or telephonymust try to keep the playout delay as small as possible because it cannot afford the latency incurred by the buffering. Studies of human perception point to a limit in round-trip time of about 300 milliseconds as the maximum tolerable for interactive use; this limit implies an end-to-end delay of only 150 milliseconds including network transit time and buffering delay if the system is symmetric. However, a noninteractive system, such as streaming video, television, or radio, may allow the playout buffer to grow up to several seconds, thereby enabling noninteractive systems to handle variation in packet arrival times better.

Playout Time Calculation

The main difficulty in designing an RTP playout buffer is determining the playout delay: How long should packets remain in the buffer before being scheduled for playout? The answer depends on various factors:

The delay between receiving the first and last packets of a frame
The delay before any error correction packets are received (see Chapter 9, Error Correction)
The variation in interpacket timing caused by network queuing jitter and route changes
The relative clock skew between sender and receiver
The end-to-end delay budget of the application, and the relative importance of reception quality and latency

The factors under control of the application include the spacing of packets in a frame, and the delay between the media data and any error correction packets, both of which are controlled by the sender. The effects of these factors on the playout delay calculation are discussed in the section titled Compensation for Sender Behavior later in this chapter.

Outside the control of the application is the behavior of the network, and the accuracy and stability of the clocks at sender and receiver. As an example, consider Figure 6.9, which shows the relationship between packet transmission time and reception time for a trace of RTP audio packets. If the sender clock and receiver clock run at the same rate, as is desired, the slope of this plot should be exactly 45 degrees. In practice, sender and receiver clocks are often unsynchronized and run at slightly different rates. For the trace in Figure 6.9, the sender clock is running faster than the receiver clock, so the slope of the plot is less than 45 degrees (Figure 6.9 is an extreme example, to make it easy to see the effect; the slope is typically much closer to 45 degrees). The section titled Compensation for Clock Skew later in this chapter explains how to correct for unsynchronized clocks.

Figure 6.9. Packet Send Time versus Receive Time, Illustrating Clock Skew

graphics/06fig09.gif

If the packets have a constant network transit time, the plot in Figure 6.9 will produce an exactly straight line. However, typically the network induces some jitter in the interpacket spacing due to variation in queuing delays, and this is observable in the figure as deviations from the straight-line plot. The figure also shows a discontinuity, resulting from a step change in the network transit time, most likely due to a route change in the network. Chapter 2, Voice and Video Communication over Packet Networks, has a more detailed discussion of the effects, and the section titled Compensation for Jitter later in this chapter explains how to correct for these issues. Correcting for more extreme variations is discussed in the sections Compensation for Route Changes and Compensation for Packet Reordering .

The final point to consider is the end-to-end delay budget of the application. This is mainly a human-factors issue: What is the maximum acceptable end-to-end delay for the users of the application, and how long does this leave for smoothing in the playout buffer after the network transit time has been factored out? As might be expected, the amount of time available for buffering does affect the design of the playout buffer; the section titled Compensation for Jitter discusses this subject further.

A receiver should take these factors into account when determining the playout time for each frame. The playout calculation follows several steps:

The sender timeline is mapped to the local playout timeline, compensating for the relative offset between sender and receiver clocks, to derive a base time for the playout calculation (see Mapping to the Local Timeline later in this chapter).
If necessary, the receiver compensates for clock skew relative to the sender, by adding a skew compensation offset that is periodically adjusted to the base time (see Compensation for Clock Skew).
The playout delay on the local timeline is calculated according to a sender-related component of the playout delay (see Compensation for Sender Behavior) and a jitter-related component (see Compensation for Jitter).
The playout delay is adjusted if the route has changed (see Compensation for Route Changes), if packets have been reordered (see Compensation for Packet Reordering), if the chosen playout delay causes frames to overlap, or in response to other changes in the media (see Adapting the Playout Point).
Finally, the playout delay is added to the base time to derive the actual playout time for the frame.

Figure 6.10 illustrates the playout calculation, noting the steps of the process. The following sections give details of each stage.

Figure 6.10. The Playout Calculation

graphics/06fig10.gif

MAPPING TO THE LOCAL TIMELINE

The first stage of the playout calculation is to map from the sender's timeline (as conveyed in the RTP timestamp) to a time-line meaningful to the receiver, by adding the relative offset between sender and receiver clocks to the RTP timestamp.

To calculate the relative offset, the receiver tracks the difference, d ( n ), between the RTP timestamp of the n th packet, T _R(n) , and the arrival time of that packet, T _L(n) , measured in the same units:

The difference, d ( n ), includes a constant factor because the sender and receiver clocks were initialized at different times with different random values, a variable delay due to data preparation time at the sender, a constant factor due to the minimum network transit time, a variable delay due to network timing jitter, and a rate difference due to clock skew. The difference is a 32-bit unsigned integer, like the timestamps from which it is calculated; and because the sender and receiver clocks are unsynchronized, it can have any 32-bit value.

The difference is calculated as each packet arrives, and the receiver tracks its minimum observed value to obtain the relative offset:

Because of the rate difference between T _L(n) and T _R(n) due to clock skew, the difference, d ( n ), will tend to drift larger or smaller. To prevent this drift , the minimum offset is calculated over a window, w , of the differences since the last compensation for clock skew. Also note that an unsigned comparison is required because the values may wrap around:

The offset value is used to calculate the base playout point, according to the timeline of the receiver:

This is the initial estimate of the playout time, to which are applied additional factors compensating for clock skew, jitter, and so on.

COMPENSATION FOR CLOCK SKEW

RTP payload formats define the nominal clock rate for a media stream but place no requirements on the stability and accuracy of the clock. Sender and receiver clocks commonly run at slightly different rates, forcing the receiver to compensate for the variation. A plot of packet transmission time versus reception time, as in Figure 6.9, illustrates this. If the slope of the plot is exactly 45 degrees, the clocks have the same rate; deviations are caused by clock skew between sender and receiver.

Receivers must detect the presence of clock skew, estimate its magnitude, and adjust the playout point to compensate. There are two possible compensation strategies: tuning the receiver clock to match the sender clock, or periodically adjusting playout buffer occupancy to regain alignment.

The latter approach accepts the skew and periodically realigns the playout buffer by inserting or deleting data. If the sender is faster, the receiver will eventually have to discard some data to bring the clocks into alignment, otherwise its playout buffer will be over-run. If the sender is slower, the receiver will eventually run out of media to play, and must synthesize some data to fill the gap that is left. The magnitude of the clock skew determines the frequency of playout point adjustments, and hence the quality degradation experienced .

Alternatively, if the receiver clock rate is finely adjustable, it may be possible to tune its rate to exactly match that of the sender, avoiding the need for a playout buffer realignment . This approach can give higher quality because data is never discarded due to skew, but it may require hardware support that is not common (systems using audio may be able to resample to match the desired rate using software).

Estimating the amount of clock skew present initially appears to be a simple problem: Observe the rate of the sender clockthe RTP timestampand compare with the local clock. If T _R(n) is the RTP timestamp of the n th packet received, and T _L(n) is the value of the local clock at that time, then the clock skew can be estimated as follows:

graphics/06inequ01.gif

with a skew of less than unity meaning that the sender is slower than the receiver, and a skew of greater than unity meaning that the sender clock is fast compared to the receiver. Unfortunately, the presence of network timing jitter means that this simple estimate is not sufficient; it will be directly affected by variation in interpacket spacing due to jitter. Receivers must look at the long- term variation in the packet arrival rate to derive an estimate for the underlying clock skew, removing the effects of jitter.

There are many possible algorithms for managing clock skew, depending on the accuracy and sensitivity to jitter that is required. In the following discussion I describe a simple approach to estimating and compensating for clock skew that has proven suitable for voice-over-IP applications, ⁷⁹ and I give pointers to algorithms for more demanding applications.

The simple approach to clock skew management continually monitors the average network transit delay and compares it with an active delay estimate. Increasing divergence between the active delay estimate and measured average delay denotes the presence of clock skew, eventually causing the receiver to adapt playout. As each packet arrives, the receiver calculates the instantaneous one-way delay for the n th packet, d _n , based on the reception time of the packet and its RTP timestamp:

On receipt of the first packet, the receiver sets the active delay, E = d , and the estimated average delay, D = d . With each subsequent packet the average delay estimate, D _n , is updated by an exponentially weighted moving average:

The factor ³¹ / ₃₂ controls the averaging process, with values closer to unity making the average less sensitive to short-term fluctuation in the transit time. Note that this calculation is similar to the calculation of the estimated jitter; but it retains the sign of the variation, and it uses a time constant chosen to capture the long-term variation and reduce the response to short-term jitter.

The average one-way delay, D _n , is compared with the active delay estimate, E , to estimate the divergence since the last estimate:

If the sender clock and receiver clock are synchronized, the divergence will be close to zero, with only minor variations due to network jitter. If the clocks are skewed, the divergence will increase or decrease until it exceeds a predetermined threshold, causing the receiver to take compensating action. The threshold depends on the jitter, the codec, and the set of possible adaptation points. It has to be large enough that false adjustments due to jitter are avoided, and it should be chosen such that the discontinuity caused by an adjustment can easily be concealed. Often a single framing interval is suitable, meaning that an entire codec frame is inserted or removed.

Compensation involves growing or shrinking the playout buffer as described in the section titled Adapting the Playout Point later in this chapter. The playout point can be changed up to the divergence as measured in RTP timestamp units (for audio, the divergence typically gives the number of samples to add or remove). After compensating for skew, the receiver resets the active delay estimate, E , to equal the current delay estimate, D _n , resetting the divergence to zero in the process (the estimate for base_play-out_time ( n ) is also reset at this time).

In C-like pseudocode, the algorithm performed as each packet is received becomes this:

 adjustment_due_to_skew(rtp_packet p, uint32_t curr_time) {     static int       first_time = 1;     static uint32_t  delay_estimate;     static uint32_t  active_delay;     uint32_t         adjustment = 0;     uint32_t         d_n = p->ts  curr_time;     if (first_time) {         first_time = 0;         delay_estimate = d_n;         active_delay = d_n;     } else {         delay_estimate = (31 * delay_estimate + d_n)/32;     }     if (active_delay  delay_estimate > SKEW_THRESHOLD) {         // Sender is slow compared to receiver         adjustment   = SKEW_THRESHOLD;         active_delay = delay_estimate;     }     if (active_delay  delay_estimate < -SKEW_THRESHOLD) {         // Sender is fast compared to receiver         adjustment   = -SKEW_THRESHOLD;         active_delay = delay_estimate;     }     // Adjustment will be 0, SKEW_THRESHOLD, or SKEW_THRESHOLD. It is     // appropriate that SKEW_THRESHOLD equals the framing interval.     return adjustment; }

The assumptions of this algorithm are that the jitter distribution is symmetric and that any systematic bias is due to clock skew. If the distribution of skew values is asymmetric for reasons other than clock skew, this algorithm will cause spurious skew adaptation. Significant short-term fluctuations in the network transit time also might confuse the algorithm, causing the receiver to perceive network jitter as clock skew and adapt its playout point. Neither of these issues should cause operational problems: The skew compensation algorithm will eventually correct itself, and any adaptation steps would likely be needed in any case to accommodate the fluctuations.

Another assumption of the skew compensation described here is that it is desirable to make step adjustments to the playout pointfor example, adding or removing a complete frame at a timewhile concealing the discontinuity as if a packet were lost. For many codecsin particular, frame-based voice codecsthis is appropriate behavior because the codec is optimized to conceal lost frames and skew compensation can leverage this ability, provided that care is taken to add or remove unimportant, typically low-energy, frames. In some cases, however, it is desirable to adapt more smoothly, perhaps interpolating a single sample at a time.

If smoother adaptation is needed, the algorithm by Moon et al. ⁹⁰ may be more suitable, although it is more complex and has correspondingly greater requirements for state maintenance. The basis of their approach is to use linear programming on a plot of observed one-way delay versus time, to fit a line that lies under all the data points, as closely to them as possible. An equivalent approach is to derive a best-fit line under the data points of a plot such as that in Figure 6.9, and use this to estimate the slope of the line, and hence the clock skew. Such algorithms are more accurate, provided the skew is constant, but they clearly have higher overheads because they require the receiver to keep a history of points, and perform an expensive line-fitting algorithm. They can, however, derive very accurate skew measurements, given a long enough measurement interval.

Long-running applications should take into account the possibility that the skew might be nonstationary, and vary according to outside effects. For example, temperature changes can affect the frequency of crystal oscillators and cause variation in the clock rate and skew between sender and receiver. Nonstationary clock skew may confuse some algorithms (for example, that of Moon et al. ⁹⁰ ) that use long-term measurements. Other algorithms, such as that of Hodson et al., ⁷⁹ described earlier, work on shorter timescales and periodically recalculate the skew, so they are robust to variations.

When choosing a clock skew estimation algorithm, it is important to consider how the playout point will be varied, and to choose an estimator with an appropriate degree of accuracy. For example, applications using frame-based audio codecs may adapt by adding or removing a single frame, so an estimator that measures skew to the nearest sample may be overkill. The section titled Adapting the Playout Point later in this chapter discusses this issue in more detail.

Although they are outside the scope of this book, the algorithms of the Network Time Protocol may also be of interest to implementers. RFC 1305 ⁵ is recommended reading for those with an interest in clock synchronization and skew compensation (the PDF version, available from the RFC Editor Web site, at http://www.rfc-editor.org, is considerably more readable than the text-only version).

COMPENSATION FOR SENDER BEHAVIOR

The nature of the sender's packet generation process can influence the receiver's playout calculation in several ways, causing increased playout buffering delay.

If the sender spreads the packets that make up a frame in time across the framing interval, as is common for video, there will be a delay between the first and last packets of a frame, and receivers must buffer packets until the whole frame is received. Figure 6.11 shows the insertion of additional playout delay, T _d , to ensure that the receiver does not attempt to play out the frame before all the fragments have arrived.

Figure 6.11. Buffering Delay, to Group Packets into Frames

graphics/06fig11.gif

If the interframe timing and number of packets per frame are known, inserting this additional delay is simple. Assuming that the sender spaces packets evenly, the adjustment will be as follows:

 adjustment_due_to_fragmentation = (packets_per_frame  1)                             x (interframe_time / packets_per_frame)

Unfortunately, receivers do not always know these variables in advance. For example, the frame rate may not be signaled during session setup, the frame rate may vary during a session, or the number of packets per frame may vary during a session. This variability can make it difficult to schedule playout because it is unclear how much delay needs to be added to allow all fragments to arrive. The receiver must estimate the required playout delay, and adapt if the estimate proves inaccurate.

The estimated playout compensation could be calculated by a special-purpose routine that looked at the arrival times of fragments to calculate an average fragmentation delay. Fortunately, this is not necessary; the jitter calculation performs the same role. All packets of a frame have the same timestamprepresenting the time of the frame, rather than the time the packet was sentso fragmented frames cause the appearance of jitter (the receiver cannot differentiate between a packet delayed in the network and one delayed by the sender). The strategies for jitter compensation discussed in the next section can therefore be used to estimate the amount of buffering delay needed to compensate for fragmentation, and there is no need to account for fragmentation in the host component of the playout delay.

Similar issues arise if the sender uses the error correction techniques described in Chapter 9. For error correction packets to be useful, playout must be delayed so that the error correction packets arrive in time to be used. The presence of error correction packets is signaled during session setup, and the signaling may include enough information to allow the receiver to size the playout buffer correctly. Alternatively, the correct playout delay must be inferred from the media stream. The compensation delay needed depends on the type of error correction employed. Three common types of error correction are parity FEC (forward error correction), audio redundancy, and retransmission.

The parity FEC scheme discussed in Chapter 9 ³² leaves the data packets unmodified, and sends error correction packets in a separate RTP stream. The error correction packets contain a bit mask in their FEC header to identify the sequence numbers of the packets they protect. By observing the mask, a receiver can determine the delay to add, in packets. If packet spacing is constant, this delay translates to a time offset to add to the playout calculation. If the interpacket spacing is not constant, the receiver must use a conservative estimate of the spacing to derive the required play-out delay.

The audio redundancy scheme ¹⁰ discussed in Chapter 9, Error Correction, includes a time offset in redundant packets, and this offset may be used to size the playout buffer. At the start of a talk spurt, redundant audio may be used in two modes: Initial packets may be sent without the redundancy header, or they may be sent with a zero-length redundant block. As Figure 6.12 shows, sizing the playout buffer is easier if a zero-length redundant block is included with initial packets in a talk spurt. Unfortunately, including these blocks is not mandatory in the specification, and implementations may have to guess an appropriate playout delay if it is not present (a single packet offset is most common and makes a reasonable estimate in the absence of other information). Once a media stream has been determined to use redundancy, the offset should be applied to all packets in that stream, including any packets sent without redundancy at the beginning of a talk spurt. If a complete talk spurt is received without redundancy, it can be assumed that the sender has stopped redundant transmission, and future talk spurts can be played without delay.

Figure 6.12. Effects of Audio Redundancy Coding on the Playout Buffer

graphics/06fig12.gif

A receiver of either parity FEC or redundancy should initially pick a large playout delay, to ensure that any data packets that arrive are buffered. When the first error correction packet arrives, it will cause the receiver to reduce its playout delay, reschedule, and play out any previously buffered packets. This process avoids a gap in playout caused by an increase in buffering delay, at the expense of slightly delaying the initial packet.

When packet retransmission is used, the playout buffer must be sized larger than the round-trip time between sender and receiver, to allow time for the retransmission request to return to the sender and be serviced. The receiver has no way of knowing the roundtrip time, short of sending a retransmission request and measuring the response time. This does not affect most implementations because retransmission is typically used in noninteractive applications in which the playout buffering delay is larger than the round-trip time, but it may be an issue if the round-trip time is large.

No matter what error correction scheme is used, the sender may be generating an excessive amount of error correction data. For example, when sending to a multicast group, the sender might choose an error correction code based on the worst-case receiver, which will be excessive for other receivers. As noted by Rosen-berg et al., ¹⁰¹ it may then be possible to repair some fraction of loss with only a subset of the error correction data. In this case a receiver may choose a playout delay smaller than that required for all of the error correction data, instead just waiting long enough to repair the loss it chooses. The decision to ignore some error correction data is made solely by the receiver and based on its view of the transmission quality.

Finally, if the sender interleaves the media streamas will be described in Chapter 8, Error Concealmentthe receiver must allow for this in the playout calculation so that it can sort interleaved packets into playout order. Interleaving parameters are typically signaled during session setup, allowing the receiver to choose an appropriate buffering delay. For example, the AMR payload format ⁴¹ defines an interleaving parameter that can be signaled in the SDP a=fmtp: line, denoting the number of packets per interleaving group (and hence the amount of delay in terms of the number of packets that should be inserted into the playout buffer to compensate). Other codecs that support interleaving should supply a similar parameter.

To summarize, the sender may affect the playout buffer in three ways: by fragmenting frames and delaying sending fragments, by using error correction packets, or by interleaving. The first of these will be compensated for according to the usual jitter compensation algorithm; the others require the receiver to adjust the playout buffer to compensate. This compensation is mostly an issue for interactive applications that use small playout buffers to reduce latency; streaming media systems can simply set a large playout buffer.

COMPENSATION FOR JITTER

When RTP packets flow over a real-world IP network, variation in the interpacket timing is inevitable. This network jitter can be significant, and a receiver must compensate by inserting delay into its playout buffer so that packets held up by the network can be processed . Packets that are delayed too much arrive after their playout time has passed and are discarded; with suitable selection of a playout algorithm, this should be a rare occurrence. Figure 6.13 shows the jitter compensation process.

Figure 6.13. Network Jitter Affects Reception Time and Is Corrected in the Playout Buffer

graphics/06fig13.gif

There is no standard algorithm for calculation of the jitter compensation delay; most applications will want to calculate the play-out delay adaptively and may use different algorithms, depending on the application type and network conditions. Applications that are designed for noninteractive scenarios would do well to pick a compensation delay significantly larger than the expected jitter; an appropriate value might be several seconds. More complex is the interactive case, in which the application desires to keep the play-out delay as small as possible (values on the order of tens of milliseconds are not unrealistic , given network and packetization delays). To minimize the playout delay, it is necessary to study the properties of the jitter and use these to derive the minimum suitable playout delay.

In many cases the network-induced jitter is essentially random. A plot of interpacket arrival times versus frequency of occurrence will, in this case, be somewhat similar to the Gaussian distribution shown in Figure 6.14. Most packets are only slightly affected by the network jitter, but some outliers are significantly delayed or are back-to-back with a neighboring packet.

Figure 6.14. Distribution of Network Jitter

graphics/06fig14.gif

How accurate is this approximation ? That depends on the network path , of course, but measurements taken by me and by Moon et al. ⁹¹ show that the approximation is reasonable in many cases, although real-world data is often skewed toward larger interarrival times and has a sharp minimum cutoff value (as illustrated by the "actual distribution" in Figure 6.14). The difference is usually not critical, because the number of packets in the discard region is small.

If it can be assumed that the jitter distribution does approximate a Gaussian normal distribution, then deriving a suitable playout delay is easily possible. The standard deviation of the jitter is calculated, and from probability theory we know that more than 99% of a normal distribution lies within three times the standard deviation of the mean (average) value. An implementation could choose a playout delay that is equal to three times the standard deviation in interarrival times and expect to discard less than 0.5% of packets because of late arrival. If this delay is too long, using a playout delay of twice the standard deviation will give an expected discard rate due to late arrival of less than 2.5%, again because of probability theory.

How can we measure the standard deviation? The jitter value calculated for insertion into RTCP receiver reports tracks the average variation in network transit time, which can be used to approximate the standard deviation. On the basis of these approximations, the playout delay required to compensate for network jitter can be estimated as three times the RTCP jitter estimate for a particular source. The playout delay for a new frame is set to at least

T _playout = T _current + 3 J

where J is the current estimate of the jitter, as described in Chapter 5, RTP Control Protocol. The value of T _playout may be modified in a media-dependent manner, as discussed later. Implementations using this value as a base for their playout calculation have shown good performance in a range of real-world conditions.

Although the RTCP jitter estimate provides a convenient value to use in the playout calculation, an implementation can use an alternative jitter estimate if that proves a more robust base for the playout time calculation (the standard jitter estimate must still be calculated and returned in RTCP RR packets). In particular, it has been suggested that the phase jitterthe difference between the time a packet arrived and the time it was expectedis a more accurate measure of network timing, although this has not yet been tested in widely deployed implementations. Accurate jitter prediction for interactive playout buffering is a difficult problem, with room for improvement over current algorithms.

The jitter distribution depends both on the path that traffic takes through the network and on the other traffic sharing that path. The primary cause of jitter is competition with other traffic, resulting in varying queuing delays at the intermediate routers; clearly, changes in the other traffic also will affect the jitter seen by a receiver. For this reason receivers should periodically recalculate the amount of playout buffering delay they use, in case the network behavior has changed, adapting if necessary. When should receivers adapt? This is not a trivial question, because any change in the playout delay while the media is playing will disrupt the playout, causing either a gap where there is nothing to play, or forcing the receiver to discard some data to make up lost time. Accordingly, receivers try to limit the number of times they adapt their playout point. Several factors can be used as triggers for adaptation:

A significant change in the fraction of packets discarded because of late arrival
Receipt of several consecutive packets that must be discarded because of late arrival (three consecutive packets is a suitable threshold)
Receipt of packets from a source that has been inactive for a long period of time (ten seconds is a suitable threshold)
The onset of a spike in the network transit delay

With the exception of spikes in the network transit delay, these factors should be self-explanatory. As shown in Figure 2.12 in Chapter 2, Voice and Video Communication over Packet Networks, the network occasionally causes "spikes" in the transit delay, when several packets are delayed and arrive in a burst. Such spikes can easily bias a jitter estimate, causing the application to choose a larger playout delay than is required. In many applications this increase in playout delay is acceptable, and applications should treat a spike as any other form of jitter and increase their playout delay to compensate. However, some applications prefer increased packet loss to increased latency; these applications should detect the onset of a delay spike and ignore packets in the spike when calculating the playout delay. ⁹¹ ^, ⁹⁶

Detecting the start of a delay spike is simple: If the delay between consecutive packets increases suddenly, a delay spike likely has occurred. The scale of a "sudden increase" is open to some interpretation: Ramjee et al. ⁹⁶ suggest that twice the statistical variance in interarrival time, plus 100 milliseconds, is a suitable threshold; another implementation that I'm familiar with uses a fixed threshold of 375 milliseconds (both are voice-over-IP systems using 8kHz speech).

Certain media events can also cause the delay between consecutive packets to increase, and should not be confused with the onset of a delay spike. For example, audio silence suppression will cause a gap between the last packet in one talk spurt and the first packet in the next. Equally, a change in video frame rate will cause interpacket timing to vary. Implementations should check for this type of event before assuming that a change in interpacket delay implies that a spike has occurred.

Once a delay spike has been detected , an implementation should suspend normal jitter adjustment until the spike has ended. As a result, several packets will probably be discarded because of late arrival, but it is assumed that the application has a strict delay bound, and that this result is preferable to an increased playout delay.

Locating the end of a spike is harder than detecting the onset. One key characteristic of a delay spike is that packets that were evenly spaced at the sender arrive in a burst after the spike, meaning that each packet has a progressively smaller transit delay, as shown in Figure 6.15. The receiver should maintain an estimate of the "slope" of the spike, and once it is sufficiently close to flat, the spike can be assumed to have ended.

Figure 6.15. Network Transit Time during a Delay Spike

graphics/06fig15.gif

Given all these factors, pseudocode to compensate playout delay for the effects of jitter and delay spikes is as follows:

 int adjustment_due_to_jitter(...) {     delta_transit = abs(transit  last_transit);     if (delta_transit > SPIKE_THRESHOLD) {         // A new "delay spike" has started         playout_mode = SPIKE;         spike_var = 0;         adapt = FALSE;     } else {         if (playout_mode == SPIKE) {             // We're within a delay spike; maintain slope estimate             spike_var = spike_var / 2;             delta_var = (abs(transit  last_transit) + abs(transit                                              last_last_transit))/8;             spike_var = spike_var + delta_var;             if (spike_var < spike_end) {                 // Slope is flat; return to normal operation                 playout_mode = NORMAL;             }             adapt = FALSE;         } else {             // Normal operation; significant events can cause us to             //adapt the playout             if (consecutive_dropped > DROP_THRESHOLD) {                 // Dropped too many consecutive packets                 adapt = TRUE;             }             if ((current_time  last_header_time) >                                               INACTIVE_THRESHOLD) {                 // Silent source restarted; network conditions have                 //probably changed                 adapt = TRUE;             }         }     }     desired_playout_offset = 3 * jitter     if (adapt) {         playout_offset = desired_playout_offset;     } else {         playout_offset = last_playout_offset;     }     return playout_offset; }

The key points are that jitter compensation is suspended during a delay spike, and that the actual playout time changes only when a significant event occurs. At other times the desired_playout_offset is stored to be instated at a media-specific time (see the section titled Adapting the Playout Point).

COMPENSATION FOR ROUTE CHANGES

Although infrequent, route changes can occur in the network because of link failures or other topology changes. If a change occurs in the route taken by the RTP packets, it will manifest itself as a sudden change in the network transit time. This change will disrupt the playout buffer because either the packets will arrive too late for playout, or they will be early and overlap with previous packets.

The jitter and delay spike compensation algorithms should detect the change in delay and adjust the playout to compensate, but this approach may not be optimal. Faster adaptation can take place if the receiver observes the network transit time directly and adjusts the playout delay in response to a large change. For example, an implementation might adjust the playout delay if the transit delay changes by more than five times the current jitter estimate. The relative network transit time is used as part of the jitter calculation, so such observation is straightforward.

COMPENSATION FOR PACKET REORDERING

In extreme cases, jitter or route changes can result in packets being reordered in the network. As discussed in Chapter 2, Voice and Video Communication over Packet Networks, this is usually a rare occurrence, but it happens frequently enough that implementations need to be able to compensate for its effects and smoothly play out a media stream that contains out-of-order packets.

Reordering should not be an issue for correctly designed receivers: Packets are inserted into the playout buffer according to their RTP timestamp, irrespective of the order in which they arrive. If the playout delay is sufficiently large, they are played out in their correct sequence; otherwise they are discarded as any other late packet. If many packets are discarded because of reordering and late arrival, the standard jitter compensation algorithm will take care of adjusting the playout delay.