Receiver Behavior | RTP: Audio and Video for the Internet

A receiver is expected to determine which media streams should be synchronized, and to align their presentation, on the basis of the information conveyed to it in RTCP packets.

The first part of the process ”determining which streams are to be synchronized ”is straightforward. The receiver synchronizes those streams that the sender has given the same CNAME in their RTCP source description packets, as described in the previous section. Because RTCP packets are sent every few seconds, there may be a delay between receipt of the first data packet and receipt of the RTCP packet that indicates that particular streams are to be synchronized. The receiver can play out the media data during this time, but it is unable to synchronize them because it doesn't have the required information.

More complex is the actual synchronization operation, in which the receiver time-aligns audio and video for presentation. This operation is triggered by the reception of RTCP sender report packets containing the mapping between the media clock and a reference clock common to both media. Once this mapping has been determined for both audio and video streams, the receiver has the information needed to synchronize playout.

The first step of lip synchronization is to determine, for each stream to be synchronized, when the media data corresponding to a particular reference time is to be presented to the user . Because of differences in the network behavior or other reasons, it is likely that data from two streams that was captured at the same instant will not be scheduled for presentation at the same time if the playout times are determined independently according to the methods described in Chapter 6, Media Capture, Playout, and Timing. The playout time for one stream therefore has to be adjusted to match the other. This adjustment translates into an offset to be added to the playout buffering delay for one stream, such that the media are played out in time alignment. Figures 7.4 and 7.5 illustrate the process.

Figure 7.4. Lip Synchronization at the Receiver

graphics/07fig04.gif

Figure 7.5. Mapping between Timelines to Achieve Lip Synchronization at the Receiver

graphics/07fig05.gif

The receiver first observes the mapping between the media clock and reference clock as assigned by the sender, for each media stream it is to synchronize. This mapping is conveyed to the receiver in periodic RTCP sender report packets, and because the nominal rate of the media clock is known from the payload format, the receiver can calculate the reference clock capture time for any data packet once it has received an RTCP sender report from that source. When an RTP data packet with media timestamp M is received, the corresponding reference clock capture time, T _S (the RTP timestamp mapped to the reference timeline), can be calculated as follows :

graphics/07inequ01.gif

where M _sr is the media (RTP) timestamp in the last RTCP sender report packet, T _Ssr is the corresponding reference clock (NTP) timestamp in seconds (and fractions of a second) from the sender report packet, and R is the nominal media timestamp clock rate in hertz. (Note that this calculation is invalid if more than 2 ³² ticks of the media clock have elapsed between M _sr and M , but that this is not expected to occur in typical use.)

The receiver also calculates the presentation time for any particular packet, T _R , according to its local reference clock. This is equal to the RTP timestamp of the packet, mapped to the receiver's reference clock timeline as described earlier, plus the playout buffering delay in seconds and any delay due to the decoding, mixing, and rendering processes. It is important to take into account all aspects of the delay until actual presentation on the display or loudspeaker, if accurate synchronization is to be achieved. In particular, the time taken to decode and render is often significant and should be accounted for as described in Chapter 6, Media Capture, Playout, and Timing.

Once the capture and playout times are known according to the common reference timeline, the receiver can estimate the relative delay between media capture and playout for each stream. If data sampled at time T _S according to the sender's reference clock is presented at time T _R according to the receiver's reference clock, the difference between them, D = T _R “ T _S , is the relative capture-to-playout delay in seconds. Because the reference clocks at the sender and receiver are not synchronized, this delay includes an offset that is unknown but can be ignored because it is common across all streams and we are interested in only the relative delay between streams.

Once the relative capture-to-playout delay has been estimated for both audio and video streams, a synchronization delay, D = D _audio “ D _video , is derived. If the synchronization delay is zero, the streams are synchronized. A nonzero value implies that one stream is being played out ahead of the other, and the synchronization delay gives the relative offset in seconds.

For the media stream that is ahead, the synchronization delay (in seconds) is multiplied by the nominal media clock rate, R , to convert it into media timestamp units, and then it is applied as a constant offset to the playout calculation for that media stream, delaying playout to match the other stream. The result is that packets for one stream reside longer in the playout buffer at the receiver, to compensate for either faster processing in other parts of the system or reduced network delay, and are presented at the same time as packets from the other stream.

The receiver can choose to adjust the playout of either audio or video, depending on its priorities, the relative delay of the streams, and the relative playout disturbance caused by an adjustment to each stream. With many common codecs, the video encoding and decoding times are the dominant factors, but audio is more sensitive to playout adjustments. In this case it may be appropriate to make an initial adjustment by delaying the audio to match the approximate video presentation time, followed by small adjustments to the video playout point to fine-tune the presentation. The relative priorities and delays may be different in other scenarios, depending on the codec, capture, and playout devices, and each application should make a choice based on its particular environment and delay budget.

The synchronization delay should be recalculated when the playout delay for any of the streams is adjusted because any change in the playout delay will affect the relative delay of the two streams. The offset should also be recalculated whenever a new mapping between media time and reference time, in the form of an RTCP sender report packet, is received. A robust receiver does not necessarily trust the sender to keep jitter out of the mapping provided in RTCP sender report packets, and it will filter the sequence of mappings to remove any jitter. An appropriate filter might track the minimum offset between media time and reference time, to avoid a common implementation problem in which the sender uses the media time of the previous data packet in the mapping, instead of the actual time when the sender report packet was generated.

A change in the mapping offset causes a change in the playout point and may require either insertion or deletion of media data from the stream. As with any change to the playout point, such changes should be timed with care, to reduce the impact on the media quality. The issues discussed in the section titled Adapting the Playout Point in Chapter 6, Media Capture, Playout, and Timing, are relevant here.