Adapting the Playout Point


There are two basic approaches to adapting the playout point: receivers can either slightly adjust the playout time for each frame, making continual small adjustments to the playout point, or they can insert or remove complete frames from the media stream, making a smaller number of large adjustments as they become necessary. No matter how the adjustment is made, the media stream is disrupted to some extent. The aim of the adaptation must be to minimize this disruption, which requires knowledge of the media stream; accordingly , audio and video playout adaptation strategies are discussed separately.

Playout Adaptation for Audio with Silence Suppression

Audio is a continuous media format, meaning that each audio frame occupies a certain amount of time, and the next is scheduled to start immediately after it finishes. There are no gaps between frames unless silence suppression is used, and hence there is no convenient time to adapt the playout delay. For this reason the presence of silence suppression has a significant effect on the design of audio playout buffer algorithms.

For conversational speech signals, an active speaker will generate talk spurts several hundred milliseconds in duration, separated by silent periods. Figure 6.16 shows the presence of talk spurts in a speech signal, and the gaps left between them. The sender detects frames representing silent periods and suppresses the RTP packets that would otherwise be generated for those frames. The result is a sequence of packets with consecutive sequence numbers , but a jump in the RTP timestamp depending on the length of the silent period.

Figure 6.16. Talk Spurts in a Speech Signal

graphics/06fig16.gif

Adjusting the playout point during a talk spurt will cause an audible glitch in the output, but a small change in the length of the silent period between talk spurts will not be noticeable. 92 This is the key point to remember in the design of a playout algorithm for an audio tool: If possible, adjust the playout point only during silent periods.

It is usually a simple matter for a receiver to detect the start of a talk spurt, because the sender is required to set the marker bit on the first packet after a silent period, providing an explicit indication of the start of a talk spurt. Sometimes, however, the first packet in a talk spurt is lost. It is usually still possible to detect that a new talk spurt has started, because the sequence number/timestamp relationship will change as shown in Figure 6.17, providing an implicit indication of the start of the talk spurt.

Figure 6.17. Implicit Indication of the Start of a Talk Spurt

graphics/06fig17.gif

Once the start of the talk spurt has been located, you may adjust the playout point by slightly changing the length of the silent period. The playout delay is then held constant for all of the packets in the talk spurt. The appropriate playout delay is calculated during each talk spurt and used to adapt the playout point for the following talk spurt, under the assumption that conditions are unlikely to change significantly between talk spurts.

Some speech codecs send low-rate comfort noise frames during silent periods so that the receiver can play appropriate background noise to achieve a more pleasant listening experience. The receipt of a comfort noise packet indicates the end of a talk spurt and a suitable time to adapt the playout delay. The length of the comfort noise period can be varied without significant effects on the audio quality. The RTP payload type does not usually indicate the comfort noise frames, so it is necessary to inspect the media data to detect their presence. Older codecs that do not have native comfort noise support may use the RTP payload format for comfort noise, 42 which is indicated by RTP payload type 13.

In exceptional cases it may be necessary to adapt during a talk spurt ”for example, if multiple packets are being discarded because of late arrival. These cases are expected to be rare because talk spurts are relatively short and network conditions generally change slowly.

Combining these features produces pseudocode to determine an appropriate time to adjust the playout point, assuming that silence suppression is used, as follows :

 int should_adjust_playout(rtp_packet curr, rtp_packet prev, int contdrop) {     if (curr->marker) {         return TRUE; // Explicit indication of new talk spurt     }     delta_seq = curr->seq  prev->seq;     delta_ts = curr->ts - prev->ts;     if (delta_seq * inter_packet_gap != delta_ts) {         return TRUE; // Implicit indication of new talk spurt     }     if (curr->pt == COMFORT_NOISE_PT)  is_comfort_noise(curr)) {         return TRUE; // Between talk spurts     }     if (contdrop > CONSECUTIVE_DROP_THRESHOLD) {         contdrop = 0;         return TRUE; // Something has gone badly wrong, so adjust     }     return FALSE; } 

The variable contdrop counts the number of consecutive packets discarded because of inappropriate playout times ”for example, if a route change causes packets to arrive too late for playout. An appropriate value for CONSECUTIVE_DROP_THRESHOLD is three packets.

If the function should_adjust_playout() returns TRUE , the receiver is either in a silent period or has miscalculated the playout point. If the calculated playout point has diverged from the currently used value, it should adjust the playout points for future packets, by changing their scheduled playout time. There is no need to generate fill-in data, only to continue playing silence/comfort noise until the next packet is scheduled (this is true even when adjustment has been triggered by multiple consecutive packet drops because this indicates that playout has stopped ).

Care needs to be taken when the playout delay is being reduced because a significant change in conditions could bring the start of the next talk spurt so far forward that it would overlap with the end of the previous talk spurt. The amount of adaptation that may be performed is thus limited, because clipping the start of a talk spurt is not desirable.

Playout Adaptation for Audio without Silence Suppression

When receiving audio transmitted without silence suppression, the receiver must adapt the playout point while audio is being played out. The most desirable means of adaptation is tuning the local media clock to match that of the transmitter so that data can be played out directly. If this is not possible, because the necessary hardware support is lacking, the receiver will have to vary the playout point either by generating fill-in data to be inserted into the media stream or by removing some media data from the playout buffer. Either approach inevitably causes some disruption to the playout, and it is important to conceal the effects of adaptation to ensure that it does not disturb the listener.

There are several possible adaptation algorithms, depending on the nature of the output device and the resources of the receiver:

  • The audio can be resampled in software to match the rate of the output device. A standard signal-processing text will provide various algorithms, depending on the desired quality and resource trade-off. This is a good, general-purpose solution.

  • Sample-by-sample adjustments in the playout delay can be made on the basis of knowledge of the media content. For example, Hodson et al. 79 use a pattern-matching algorithm to detect pitch cycles in speech that are removed or duplicated to adapt the playout (pitch cycles are much shorter than complete frames, so this approach gives fine-grained adaptation). This approach can perform better than resampling, but it is highly content-specific.

  • Complete frames can be inserted or deleted, as if packets were lost or duplicated. This algorithm is typically not of high quality, but it may be required if a hardware decoder designed for synchronous networks is used.

In the absence of silence suppression, there is no obvious time to adjust the playout point. Nevertheless, a receiver can still make intelligent choices regarding playout adaptation, by varying the playout at times where error concealment is more effective ”for example, during a period of relative quiet or a period during which the signal is highly repetitive ”depending on the codec and error concealment algorithm. Loss concealment strategies are described in detail in Chapter 8, Error Concealment .

Playout Adaptation for Video

Video is a discrete media format in which each frame samples the scene at a particular instant in time and the interval between frames is not recorded. The discrete nature of video provides flexibility in the playout algorithm, allowing the receiver to adapt the playout by slightly varying the interframe timing. Unfortunately, display devices typically operate at a fixed rate and limit possible presentation times. Video playout becomes a problem of minimizing the deviation between the intended and possible frame presentation instant.

For example, consider the problem of displaying a 50-frame-per-second video clip on a monitor with 85Hz refresh rate. In this case the monitor refresh times will not match the video playout times, causing unavoidable variation in the time at which frames are presented to the user , as shown in Figure 6.18. Only a change in the frame rate of the capture device or the refresh rate of the display can address this problem. In practice, the problem is often insoluble because video capture and playback devices often have hardware limits on the set of possible rates. Even when capture and playback devices have nominally the same rate, it may be necessary to adapt the playout according to the effects of jitter or clock skew.

Figure 6.18. Mismatch between Media Frame Times and Output Device Timing

graphics/06fig18.gif

There are three possible occasions for adaptation to occur: (1) when the display device has a higher frame rate than the capture device, (2) when the display device has a lower frame rate than the capture device, and (3) when the display and capture devices run at the same nominal rate.

If the display device has a higher frame rate than the capture device, possible presentation times will surround the desired time, and each frame can be mapped to a unique display refresh interval. The simplest approach is to display frames at the refresh interval closest to their playout time. One can achieve better results by moving frames in the direction of any required playout adjustment: displaying a frame at the refresh interval following its playout time if the receiver clock is relatively fast, or at the earlier interval if the receiver clock is slow. Intermediate refresh intervals, when no new frames are received from the sender, can be filled by repetition of the previous frame.

If the display device has a lower frame rate than the capture device, displaying all frames is not possible, and the receiver must discard some data. For example, the receiver may calculate the difference between the frame playout time and the display times, and choose to display the subset of frames closest to possible display times.

If the display device and capture device run at the same rate, the playout buffer can be slipped so that frame presentation times align with the display refresh times, with the slippage providing some degree of jitter buffering delay. This is an uncommon case: Clock skew is common, and periodic jitter adjustments may upset the time-line. Depending on the direction of the required adjustment, the receiver must then either insert or remove a frame to compensate.

One can insert a frame into the playout sequence simply by repeating it for two intervals. Likewise, removing a frame from the sequence is a straightforward matter. (Note that other frames may be predicted from a nondisplayed frame, so it is often impossible to completely discard a frame without decoding, except for the last predicted frame before a full frame.)

No matter how the adjustment is made, note that the human visual system is somewhat sensitive to nonuniform playout, and the receiver should seek to keep the interframe presentation times as uniform as possible to prevent disturbing artifacts. Inserting or removing frames should be considered a last-resort operation, after smaller playout adjustments (choosing an earlier or later display frame time) have proven insufficient.



RTP
RTP: Audio and Video for the Internet
ISBN: 0672322498
EAN: 2147483647
Year: 2003
Pages: 108
Authors: Colin Perkins

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net