9.2 MPEG Transports | ITV Handbook: Technologies and Standards

MPEG transports are unbounded sequences of bits (i.e., bit-streams) grouped into 188-byte packets, each with a header and payload section (see Figure 9.2) [MPEG2]. These bits are extracted from a physical modulation such as the Vestigial Side Band (VSB), QAM, and QPSK.

Figure 9.2. The relationship between MPEG bits, packets, and PIDs [MPEG].

Each 188-byte packet is associated with a packet header that contains a PID. The PID is not a unique identifier of the packet, but rather an identifier of the data stream that the packet is part of. For example, packets having PID values of 0x0000 are always assembled in the order in which they are received into MPEG-sections, tables and other structures. Generic MPEG equipment enables management at the PID level, namely, it is able to identify components of the transport by the PID values. More specialized equipment may be able to manipulate the data within the PID, namely inspect the structure of the data carried within the collection of packets having a common PID.

In a broadcast network environment, data is constantly broadcast without interruption, and packets are transmitted regardless of whether or not they carry content. To enable not sending data, a special type of packet is used, called the Null packet, whose PID is always 0x1FFF.

The notion of Null packets is important for the understanding of transport management. The transport bandwidth is usually measured by the number of bits broadcast per second. Assume that a broadcast equipment is configured to receive 19.2 million bits per seconds (Mbps), but the content of a single channel only requires 8 Mbps. In this case, for every 5 non-Null packet transmitted, 7 Null packets are transmitted. To transmit two channels of 8Mbps the transmission includes 2 Null packets for every 5 packets transmitted carrying the content of each of the two channels. Although on average, over periods of seconds, the ratios between the packet PIDs should be invariant, the exact sequence of packet PIDs may vary (see Figure 9.3).

Figure 9.3. Example packet sequences for two 8 Mbps channels carried in a 19.2 Mbps transport.

In addition to transmission of content, packets containing information about the program are transmitted in the PSI tables. The PSI includes the PAT, PMT, and other tables. The PAT contains a data structure specifying which PIDs carry the PMT, and the PMT contains a data structure specifying which PIDs carry the video, audio and data stream that collectively constitute the broadcasted program. Various MPEG-based standards specify various constraints on the bandwidth and frequency of appearance associated with transmission of MPEG PSI.

Multiplexers decide which transport packets to insert and when. They are responsible for ensuring that the average bandwidth allocated to the PSI and each program element (i.e., packets having a common PIDs) are as specified by the applicable standards and content characteristics.

Whereas transport packets are the basic transport encapsulation, PES is the basic streaming structure depicted in Figure 9.4; 188 byte transport packets have segments of PES packets in their payload. PES is used for carrying both video and audio frames in the Data Bytes field of the packet.

Figure 9.4. PES Packet structure.

PES is rarely used for data carriage; DSM-CC ISO/IEC 13818-6 is used to carry data (see Chapter 11). With PES, synchronization information, as well as other meta-data critical for the encoding and rendering of content, is carried in the header of a PES packet. A PES packet can be up to 65542 bytes long, and thus its carriage may require up to 349 transport packets (each 188 bytes long). The MPEG-2 standard has some peculiar and detailed PES encapsulation rules (see the MPEG standard for details). As an example, whereas video frames are required to be aligned with PES packets, audio frames have no such requirements.

9.2.1 Audio Content

Sound is pressure differences in air. When picked up by a microphone and fed through an amplifier this becomes voltage levels. The voltage is sampled a number of times per second and converted into digital signals using analog-to-digital conversion. For CD-audio quality there is a need to sample 44,100 times per second where each sample has a resolution of 16 bits. When recording in stereo mode, this requires 1.4 Mbit per second, raising the need for compression.

When delivered using MPEG transports, audio data is carried in PES packets and broken into frames. It is synchronized utilizing a portion of the frame header called frame sync. Synchronization between audio and video may depend on the performance of the video and audio decoders (see later discussion regarding the PTS_DTS_flag in the PES header). To achieve smooth performance, both encoder and decoder must comply with the MPEG delivery contract called the Transport System Target Decoder (T-STD) buffer model.

One critical aspect of media streaming, and a key weakness of Internet-based streaming, is the real-time representation of content. The concept of real-time is often mistaken for the concept of fast, but in fact, the two concepts are completely different. Real-time refers to time guarantees regardless of performance. For example, there needs to be a synchronization guarantee containing the difference between the reception times of video frame X and audio frame Y to be less then some dt seconds. In particular, all MPEG packets need to be received and decoded before the time specified by the DTS and is rendered onto a display at the time specified by the PTS. The Internet, in which TCP/IP (rather than, e.g., UDP/IP) is the most popular protocol, is not able to provide such guarantees regardless of the virtual bandwidth available between the emitter and receiver.

9.2.2 IP-based Media Streaming

The RTSP and RTP specify the necessary client-server interaction to stream multimedia presentations.

9.2.2.1 RTSP

The RTSP protocol, RFC 2326, is a client-server multimedia presentation control protocol, designed for efficient multimedia streaming over IP networks [RTSP]. It is not at all concerned with the actual delivery of streams; that is achieved by the RTP (see below). It leverages existing Web infrastructure (e.g., utilizing authentication and PICS from HTTP) and works well both for large audiences and single-viewer media on demand. The protocol was jointly developed by RealNetworks, Netscape Communications and Columbia University within the Multiparty Music (MMUSIC) working group of the Internet Engineering Task Force (IETF). In April, 1998, RTSP was published as a Proposed Standard by the IETF.

RTSP is designed to work with time-based media, such as streaming audio and video, as well as any application where time-based delivery is essential. It has mechanisms for time-based seeks into media clips, and has compatibility with many time stamp formats, such as SMPTE time codes [TIME-CODE]. In addition, RTSP is designed to control multicast delivery of streams and is ideally suited to full multicast solutions. It also supports multicast-unicast hybrid solutions for heterogeneous networks like the Internet.

The protocol is intentionally similar in syntax and operation to HTTP/1.1 so that extension mechanisms to HTTP can, in most cases, also be added to RTSP. However, RTSP differs in a number of important aspects from HTTP:

RTSP introduces a number of new methods and has a different protocol identifier.
An RTSP server needs to maintain state by default in almost all cases, as opposed to the stateless nature of HTTP.
Both an RTSP server and client can issue requests .
Data is carried out-of-band by a different protocol (with some exception).
RTSP is defined to use ISO 10646 (UTF-8) rather than ISO 8859-1, consistent with current HTML internationalization efforts.
The Request-URI always contains the absolute URI and puts the host name in a separate header field to ensure backward compatibility with a historical blunder, HTTP/1.1.

The RTSP protocol supports the following operations:

Retrieval of media from media server : The client can request a presentation description via HTTP or some other method. If the presentation is being multicast, its description contains the multicast addresses and ports to be used for the continuous media. If the presentation is to be sent only to the client via unicast, the client provides the destination for security reasons.
Invitation of a media server to a conference : A media server can be invited to join an existing conference, either to play back media into the presentation or to record all or a subset of the media in a presentation. This mode is useful for distributed teaching applications.
Addition of media to an existing presentation : Particularly for live presentations, it is useful if the server can tell the client about additional media becoming available.

Each presentation and media stream may be identified by an RTSP URL. The overall presentation and the properties of the media that make up the presentation are defined by a presentation description file, the format of which is outside the scope of this specification. The presentation description file may be obtained by the client using HTTP or other means such as Email and may not necessarily be stored on the media server.

The presentation description file contains a description of the media streams making up the presentation, including their encoding, language, and other parameters that enable the client to choose the most appropriate combination of media. In this presentation description, each media stream that is individually controllable by RTSP is identified by an RTSP URL, which points to the media server handling that particular media stream and names the stream stored on that server. Several media streams can be located on different servers; for example, audio and video streams can be split across servers for load sharing. The description also enumerates which transport methods the server is capable of.

To correlate RTSP requests with a stream, RTSP servers needs to maintain session state whose transitions are described in RFC 2326 and reproduced in Table 9.1. The following commands are central to the allocation and usage of stream resources on the server:

Setup : Causes the server to allocate resources for a stream and start an RTSP session.
Play and Record : Starts data transmission on a stream allocated via SETUP.
Pause : Temporarily halts a stream without freeing server resources.
Teardown : Frees resources associated with the stream. The RTSP session ceases to exist on the server.

Note that the RTSP scheme requires that these commands are issued via a protocol that provides delivery guarantees, such as TCP. Though it may be implied from its name, RTSP does not provide real-time delivery guarantees (it is possible to provide either delivery or real-time guarantees, but not both). For the real-time guarantee, the RTP protocol is used.

Table 9.1. RTSP Session State Transitions

state	message sent	state after response
Init	Setup	Ready
Init	Teardown	Init
Ready	Play	Playing
	Record	Recording
	Teardown	Init
	Setup	Ready
Playing	Pause	Ready
	Teardown	Init
	Play	Playing
	Setup	Playing (changed transport)
Recording	Pause	Ready
	Teardown	Init
	Record	Recording
	Setup	Recording (changed transport)

9.2.2.2 RTP

Complementing RTSP is the RTP both an IETF Standard RFC 1889 [RTP], and an International Telecommunications Union (ITU) Standard H.225.0. As opposed to RTSP, RTP (being delivered on top of UDP as opposed to TCP) does provide real-time delivery guarantees. It is a packet format for multimedia data streams. RTP is used in conjunction with many standard protocols, such as RTSP, H.323, Session Invitation Protocol (SIP) for IP telephony applications, and Session Announcement Protocol (SAP) over Session Description Protocol (SDP) for pure multicast applications.

An RTP session is the association among a set of participants communicating with RTP. For each participant, the session is defined by a transport address that is a pair of a network IP address plus a port. The destination transport address may be common for all participants , as in the case of IP multicast, or may be different for each, as in the case of individual unicast.

The Real-Time Control Protocol (RTCP) is a part of RTP and helps with synchronization and QoS management. RTCP is a control protocol for initiating and directing delivery of streaming multimedia from media servers. Although the RTCP connection may be used to tunnel RTP traffic for ease of use with firewalls and other network devices, RTCP does not deliver data. RTP and RTCP work well together, but either protocol can be used without the other. The RTCP specification contains a section on the use of RTP with RTCP.

RTP relies on a Synchronization SouRCe (SSRC), which is the source of a stream of RTP packets, identified by a 32-bit numeric SSRC identifier, that is random and unique within an RTP session. This is carried in the RTP header so as not to be dependent on the network address. All packets from a synchronization source form part of the same timing and sequence number space, so a receiver groups packets by synchronization source for playback. A synchronization source may change its data format (e.g., audio encoding) over time. A participant need not use the same SSRC identifier for all the RTP sessions in a multimedia session; the binding of the SSRC identifiers is provided through RTCP. If a participant generates multiple streams in one RTP session (e.g., from separate video cameras ), each must be identified as a different SSRC.

In contrast, a Contributing SouRCe (CSRC) is a stream of RTP packets that contributes to the combined stream produced by an RTP mixer. The mixer inserts a list of the SSRC identifiers of the sources that contributed to the generation of a particular packet into the RTP header of that packet. This list is called the CSRC list. An example application is audio conferencing where a mixer indicates all the talkers whose speech was combined to produce the outgoing packet, allowing the receiver to indicate the current talker, even though all the audio packets contain the same SSRC identifier (that of the mixer).

An RTP mixer (equivalent to a re-multiplexer in the MPEG world) is an intermediary receiving RTP packets from one or more sources, possibly changing the data format, combining the packets in some manner, and then forwarding a new RTP packet. Because the timing among multiple input sources is not generally synchronized, the mixer makes timing adjustments among the streams and generate its own timing for the combined stream. This enables all data packets originating from a single mixer to be identified as having that mixer as their synchronization source.

In contrast, an RTP Translator is an intermediary that forwards RTP packets with their SSRC identifier intact. Examples of translators include devices that convert encodings without mixing, replicators from multicast to unicast, and application level filters in firewalls.

9.2.2.3 ITU H.323

H.323 is an ITU conferencing framework standard [H323]. It is used for peer-to-peer, two-way delivery of audio and video phone data. It is designed to interface with standard phone and with Internet phone gateways.

H.323 and RTSP are complementary in function. H.323 is best for setting up conferences in moderately- sized peer-to-peer groups, whereas RTSP is best for large-scale broadcasts, and VOD. One could view H.323 as offering services equivalent to a telephone with three-way calling, whereas RTSP offers services like a video store with delivery services.

Both H.323 and RTSP use RTP as their standard means of actually delivering the multimedia data. This data-level compatibility makes efficient gateways between the protocols possible, since only control messages need to be translated.