A receiver is expected to determine which media streams should be synchronized, and to align their presentation, on the basis of the information conveyed to it in RTCP packets.
The first part of the process ”determining which streams are to be synchronized ”is straightforward. The receiver synchronizes those streams that the sender has given the same CNAME in their RTCP source description packets, as described in the previous section. Because RTCP packets are sent every few seconds, there may be a delay between receipt of the first data packet and receipt of the RTCP packet that indicates that particular streams are to be synchronized. The receiver can play out the media data during this time, but it is unable to synchronize them because it doesn't have the required information.
More complex is the actual synchronization operation, in which the receiver time-aligns audio and video for presentation. This operation is triggered by the
The first step of lip synchronization is to determine, for each stream to be synchronized, when the media data corresponding to a particular reference time is to be presented to the
Figure 7.4. Lip Synchronization at the Receiver
Figure 7.5. Mapping between Timelines to Achieve Lip Synchronization at the Receiver
The receiver first observes the mapping between the media clock and reference clock as assigned by the sender, for each media stream it is to synchronize. This mapping is conveyed to the receiver in periodic RTCP sender report packets, and because the nominal rate of the media clock is known from the payload format, the receiver can calculate the reference clock capture time for any data packet once it has received an RTCP sender report from that source. When an RTP data packet with media timestamp
is received, the corresponding reference clock capture time,
(the RTP timestamp mapped to the reference timeline), can be calculated as
is the media (RTP) timestamp in the last RTCP sender report packet,
is the corresponding reference clock (NTP) timestamp in seconds (and fractions of a second) from the sender report packet, and
is the nominal media timestamp clock rate in hertz. (Note that this calculation is invalid if more than 2
The receiver also calculates the presentation time for any particular packet, T R , according to its local reference clock. This is equal to the RTP timestamp of the packet, mapped to the receiver's reference clock timeline as described earlier, plus the playout buffering delay in seconds and any delay due to the decoding, mixing, and rendering processes. It is important to take into account all aspects of the delay until actual presentation on the display or loudspeaker, if accurate synchronization is to be achieved. In particular, the time taken to decode and render is often significant and should be accounted for as described in Chapter 6, Media Capture, Playout, and Timing.
Once the capture and playout times are known according to the common reference timeline, the receiver can estimate the relative delay between media capture and playout for each stream. If data sampled at time T S according to the sender's reference clock is presented at time T R according to the receiver's reference clock, the difference between them, D = T R “ T S , is the relative capture-to-playout delay in seconds. Because the reference clocks at the sender and receiver are not synchronized, this delay includes an offset that is unknown but can be ignored because it is common across all streams and we are interested in only the relative delay between streams.
Once the relative capture-to-playout delay has been estimated for both audio and video streams, a synchronization delay, D = D audio “ D video , is derived. If the synchronization delay is zero, the streams are synchronized. A nonzero value implies that one stream is being played out ahead of the other, and the synchronization delay gives the relative offset in seconds.
For the media stream that is ahead, the synchronization delay (in seconds) is multiplied by the nominal media clock rate,
, to convert it into media timestamp units, and then it is applied as a constant offset to the playout calculation for that media stream, delaying playout to match the other stream. The result is that packets for one stream reside longer in the playout buffer at the receiver, to compensate for either faster processing in other
The receiver can choose to adjust the playout of either audio or video, depending on its priorities, the relative delay of the streams, and the relative playout disturbance caused by an adjustment to each stream. With many common codecs, the video encoding and decoding times are the dominant factors, but audio is more sensitive to playout adjustments. In this case it may be appropriate to make an initial adjustment by delaying the audio to match the approximate video presentation time, followed by small adjustments to the video playout point to fine-tune the presentation. The relative priorities and delays may be different in other scenarios, depending on the codec, capture, and playout devices, and each application should make a choice based on its particular environment and delay budget.
The synchronization delay should be recalculated when the playout delay for any of the streams is adjusted because any change in the playout delay will affect the relative delay of the two streams. The offset should also be recalculated whenever a new mapping between media time and reference time, in the form of an RTCP sender report packet, is received. A robust receiver does not
A change in the mapping offset causes a change in the playout point and may require either insertion or deletion of media data from the stream. As with any change to the playout point, such changes should be timed with care, to reduce the impact on the media quality. The issues discussed in the section titled Adapting the Playout Point in Chapter 6, Media Capture, Playout, and Timing, are relevant here.
Video Over IP, Second Edition: IPTV, Internet Video, H.264, P2P, Web TV, and Streaming: A Complete Guide to Understanding the Technology (Focal Press Media Technology Professional Series)
The H.264 Advanced Video Compression Standard
Internet Communications Using SIP: Delivering VoIP and Multimedia Services with Session Initiation Protocol (Networking Council)
SIP: Understanding the Session Initiation Protocol (Artech House Telecommunications)