9.5 MPEG-2 Video Streaming | ITV Handbook: Technologies and Standards

As is the case for audio, video streaming requires capture, transmission and rendering. The term video often refers to combining both moving pictures and sounds, both encoded at elementary streams within PES packets. The pictures are associated with the sound through time stamps. The conversion from the elementary stream format into a transport format, typically performed by a multiplexer, essentially breaks down the data, carried in large data structures or packets, into small transport packets. In the case of MPEG, the PES packets carrying the audio and video streams are broken down to 188 byte transport packets, each carrying 4 byte headers and 184 bytes of data [MPEG2]. Finally, to enable transmission, the digital transport stream is modulated onto an RF signal (see Figure 9.14).

Figure 9.14. A simplified typical video transmission scenario.

Reception of a video stream follows the reverse path (see Figure 9.14). The signal is demodulated using reception equipment, de-multiplexed and decoded to produce the signal that can be rendered. Note that transmission of the signal may involve several stages, in which a demultiplexer -multiplexer or receiver-transmitter pair could are used.

Similar to audio, the basic two parameters for video capture are again sample rate and resolution. As opposed to audio, however, video samples are two-dimensional, giving rise to a wide range of issues not present in one dimension. The basic sample area unit is a pixel, is typically sampled at several locations, and the result of the samples are averaged. Because the human eye is more sensitive to intensity than color , when capturing each pixel, four Y samples are averaged as opposed to a single U and a single V sample (see Figure 9.15).

Figure 9.15. An example technique for sampling a single video pixel.

A summary of sample rate and resolution combinations defined by various standards are given in Table 9.4. To achieve HDTV quality, a matrix of 1080 (DVD is 480) interlaced rows of 1920 pixels needs to be sampled 60 times per second (two fields of 540 lines, each sampled 30 times per second), where each sampled pixel contains 24 bits. Without appropriate compression processing such data is not practical as it amounts to 540 x 1920 x 24 x 60 bits per second! State of the art sophisticated compression techniques are able to deliver HDTV video for less than 19.2Mbps, a common DTV broadcast bandwidth.

Table 9.4. Summary of Video Format Standards

Standard	Resolution	Frames Per Second
Film		24/23.98
Analog NTSC	720/704/640 x 480/486/512	29.97/30/59.94/60
Analog PAL	720/704/640 x 576/612	50/25
Analog ITU-R BT.601-4	864/858/720 x 625i/525i/483i	50/30/25
HDTV SMPTE 260M	1920 x 1035i (Interleave)	30/29.97
HDTV SMPTE 296M	1280 x 720P (Progressive)	60/59.94
HDTV SMPTE 274M	1920 x 1080i (Interleave)	60/59.94/30/29.97/25/24/23.98
HDTV SMPTE 274M	1920 x 1080P (Progressive)	30/29.97/25/24

The ITU-R Digital Terrestrial Television Broadcasting Task Group 11/3, and ATSC DTV Standard A53B [DTV], specify a model in which the digital television system can be seen to consist of three subsystems:

Coding and compression : This refers to the bitrate reduction methods , also known as data compression, appropriate for application to the video, audio, and ancillary digital data streams. The ancillary data input includes control data, conditional access control data, and data associated with the program audio and video services, such as closed captioning. Ancillary data can also refer to independent program services. The purpose of the coder is to minimize the number of bits needed to represent the audio and video information. The digital television system employs the MPEG-2 video stream syntax for the coding of video and the AC-3 standard for the coding of audio.
Service multiplex and transport : This refers to the means of dividing the digital data stream into packets of information, the means of uniquely identifying each packet or packet type, and the appropriate methods of multiplexing video data stream packets, audio data stream packets, and ancillary data stream packets into a single data stream. In developing the transport mechanism, interoperability among digital media, such as terrestrial broadcasting, cable distribution, satellite distribution, recording media, and computer interfaces, was a prime consideration. The digital television system employs the MPEG-2 transport stream syntax for the packetization and multiplexing of video, audio, and data signals for digital broadcasting systems. The MPEG-2 transport stream syntax was developed for applications where channel bandwidth or recording media capacity is limited and the requirement for an efficient transport mechanism is paramount. It was designed also to facilitate interoperability with the Asynchronous Transfer Mode (ATM) transport mechanism.
RF and transmission : This refers to channel coding and modulation. The channel coder takes the data bit-stream and adds additional information that can be used by the receiver to reconstruct the data from the received signal which, due to transmission impairments, may not accurately represent the transmitted signal. The modulation (or physical layer) uses the digital data stream information to modulate the transmitted signal. The North American (and South Korean) modulation subsystem offers two VSB transmission modes: a terrestrial broadcast mode (8 VSB), and a high data rate mode (16 VSB). In Europe, the Coded Orthogonal Frequency Division Multiplexing (COFDM) is used.

9.5.1 Video Stream Structure

An MPEG video stream contains a video elementary stream encapsulated as PES packets, each typically carried in about 348 transport stream packets. The structure of PES packets is complex and requires multilevel decoding. The PES header contains information about the content of the PES packet data bytes. PES packets can be of variable length, and up to 64 kBs. Key fields in this structure are the PTS and DTS, which allow the decoder to correctly synchronize the reconstruction of the video frames (see later description in this chapter).

9.5.1.1 Discrete Cosine Transform (DCT)

The PES packet data bytes contain a sequence of still pictures or frames that, without compression, would require a data rate far too high. To achieve compression, each frame is divided into an array of macroblocks, each 16 x 16 pixels in size and comprising 4 blocks of Y (luminance), and 1 block each of U and V (color) information. The color information therefore has half the horizontal and vertical resolution of the luminance information. The Y, U and V information in each macroblock is compressed using DCT encoding and motion compensation.

A waveform can be represented as a weighted sum of cosines of angles varying at frequencies that are multiples of a base frequency. The DCT of a signal is the list of weights of that signal. The DCT is central to many kinds of signal processing, especially in image and video compression. Given data A(i), where i is an integer in the range 0 to N-1, the forward DCT (which would be used by an encoder) is:

B(k) = sum _{i = 0 to N-1} A(i) cos((pi k/N) (2 i + 1)/2)

where B(k) is defined for all values of the frequency-space variable k, but only values in the range 0 to N-1 are meaningful. The inverse DCT (which would be used e.g., by a decoder) is:

AA(i) = sum _{k = 0 to N-1} B(k) (2-delta(k-0)) cos((pi k/N)(2 i + 1)/2)

where delta(k) is the Kronecker delta. Mathematically, this transform pair is exact, i.e., i = AA(A(i)), resulting in lossless coding; only when some of the coefficients are approximated does compression occur.

The main difference between the DCT and a discrete Fourier Transform (FT) is that the DFT traditionally assumes that the data A(i) is periodically continued with a period of N, whereas the DCT assumes that the data is continued with its mirror image, then periodically continued with a period of 2N. A fast DCT algorithm is analogous to the Fast FT (FFT).

9.5.1.2 Interlacing Fields

One of the key goals for the MPEG-2 standard was to introduce support for interlaced video sources. The MPEG-1 standard was targeted at the bitrate of around 1.5 Mbps. Therefore, it was assumed that the source video signal would be digitized at around 352 x 240 for 60 Hz systems (e.g., in U.S.) and 352 x 288 for 50 Hz systems (e.g., in Europe). The standard video signals carry twice the scan lines as these sampling rates, with an interlaced scanning order. Therefore, the simplest way of creating a half-size digital picture was simply sampling only one field from each frame. The other field was always discarded. Because only one field from every frame is used, these sampled fields form a progressively-scanned video sequence. MPEG-1 therefore addressed the coding parameters and algorithms for progressively-scanned sequences only. However, it should be noted that the syntax of the MPEG-1 standard does not constrain the bitrate or the picture size to any such values.

In MPEG-2, the term picture refers to either a frame or a field. Therefore, a coded representation of a picture may be reconstructed to a frame or a field. During the encoding process, the encoder has a choice of coding a frame as one frame picture or two field pictures. If the encoder decides to code the frame as field pictures, each field is coded independently of the other, namely two fields are coded as if they were two different pictures, each with one-half the vertical size of a frame.

9.5.1.3 Motion Prediction Based Compression

In frame pictures, each macroblock can be predicted (using motion compensation) on a frame or field basis. The frame-based prediction uses one motion vector per direction (forward or backward) to describe the motion relative to the reference frame. In contrast, field-based prediction uses two motion vectors, one from an even field and the other from an odd field. Therefore, there can be up to four vectors (two per direction, and forward and backward directions) per macroblock. In field pictures, the prediction is always field-based, but the prediction may be relative to either an even or odd reference field.

9.5.1.4 Frame Types

This motion-prediction based compression scheme gives rise to the following frame types:

I-frames : Intra coded frames use DCT encoding to compress a single frame without reference to any other frame in the sequence. Typically I-frames are encoded with 2 bits per pixel on average. The initial data comprises 4 bytes of Y, 1 byte of U and 1 byte of V (total 6 bytes = 48 bits) per pixel. This implies a compression ratio of 24:1. For random playing of MPEG video, the decoder starts decoding from an I-frame not a P-frame. I-frames are inserted every 12 “15 frames and are used to start a sequence, allowing video to be played from random positions and enabling fast forward or reverse. Decoding of video can start only at an I-frame.
P-frames : Predicted frames are coded as differences from the last I or P frame. The new P-frame is first predicted by taking the last I or P frame and predicting the values of each new pixel. P-frames use motion prediction and DCT encoding. As a result, P-frames provide a compression ratio better than I-frames but depending on the amount of motion present. The differences between the predicted and actual values are encoded. Most prediction errors are small because pixel values do not have large changes within a small area. Therefore, the error values compress better than the values themselves . Quantization of the prediction errors further reduces the information.
B-frames : Bidirectional frames are coded as differences from the last or next I or P frame. B-frames use prediction as for P-frames but for each block either the previous I or P frame is used or the next I or P frame is used. P-frames use motion prediction and DCT encoding. B-frames require both previous and subsequent frames for correct decoding. This gives improved compression compared with P-frames, because it is possible to choose for every macroblock whether the previous or next frame is taken for comparison. Therefore, the order of transmission and reception of MPEG frames is not the same as the decoding and display order.

The decoding of a new video starts by locating an I-frame, and therefore it is often referred to as the first frame. I-frames are compressed using only information in the picture itself, excluding motion information.

Storing differences between the frames gives the massive reduction in the amount of information needed to reproduce the sequence. Therefore, following I-frames are one or more P-frames. The P-frame data is based on its preceding I-frame. For example, in the case of a moving car, the P-frame specifies how the position of the car has changed from the previous I-frame. The description of changes requires a fraction of the space that would be required for encoding the entire frame. Shape or color changes are also encoded in the P frame. Each of the following P-frame is based on its predecessor P-frame. Because of quality degradation due to error propagation, only a few P-frames are allowed before a new I-frame is introduced into the sequence as a new reference point, since a small margin of error creeps in with each P-frame.

Between I-frames and P-frames are B-frames, based on the nearest I-frame or P-frame both before and after them. In the moving car example, the B-frame stores the difference between the car's image in the previous I-frame or P-frame and the following I-frame or P-frame. To recreate the B-frame when playing back the sequence, the MPEG decoder uses a combination of the two references. There may be a number of B-frames between I-frame or P-frames; No other frame is based on a B-frame. Typically, there are two or three Bs between Is or Ps, and perhaps three to five Ps between subsequent Is. A shorthand description uses the letters I,B,P to form a sequence string, such as IBBPBBP or IBPBPBPBP. The former is more difficult to encode but provides a higher compression ratio than the latter (for details see [MPEG2]).

9.5.1.5 Encoding

Video encoders are expected to process various types of input format (see Table 9.5). There are two domains within the encoder where a set of frequencies are related , the source coding domain and the channel coding domain. The source coding domain, represented schematically by the video, audio, and transport encoders, uses a family of frequencies based on a 27 MHz clock. The channel coding domain is represented by the FEC/Sync Insertion subsystem and the VSB modulator (see ATSC A53B for details).

Table 9.5. Summary of Standard Video Formats

Standard	Resolution	Frames Per Second
Film		24/23.98
Analog NTSC	720/704/640 x 480/486/512	29.97/30/59.94/60
Analog PAL	720/704/640 x 576/612	50/25
Analog ITU-R BT.601-4	864/858/720 x 625i/525i/483i	50/30/25
HDTV SMPTE 260M	1920 x 1035i (Interleave)	30/29.97
HDTV SMPTE 296M	1280 x 720P (Progressive)	60/59.94
HDTV SMPTE 274M	1920 x 1080i (Interleave)	60/59.94/30/29.97/25/24/23.98
HDTV SMPTE 274M	1920 x 1080P (Progressive)	30/29.97/25/24

9.5.1.6 Decoding

When decoding streaming video, the overall concatenation of chunks over time is consistent with the combinations of syntactic elements described in ISO/IEC 13818-2 to build a legal MPEG-2 video stream. Typically, MPEG file decoders are expected to fully decode I-frames, but are not always expected to decode P-frames and B-frames. Some standards (e.g., MHP) also requires support for an MPEG-2 video drip feed mode that requires handling I-frames and P-frames (not B-frames).

9.5.2 Video On Demand (VOD)

VOD systems are comprised of server and client software applications. VOD client applications run on the digital set-top box as native applications, or they can be downloaded when the subscriber accesses the VOD application via the EPG home-page.

9.5.2.1 Architecture

The VOD architecture is simple (see Figure 9.16). To select the movie to be viewed , the client interacts with the head-end through a cable connection or a Web-server through an ISP connection. Once the movie is selected, an RTSP session is established against a file containing the transport stream packets for the selected video. The packets are transferred to the client over RTP or MPEG-2 transports.

Figure 9.16. A typical VOD architecture.

Typically, a VOD service requires the coordination of a broadcast with an interactive service. A management server (i.e., program scheduler, configuration and billing manager) coordinates the broadcaster and the ISP. Viewers select the movie to view using an interactive guide application delivered through the interactive network, which is typically IP-based (e.g., using the HTTP and RTSP protocols), but could also be DSM-CC based. Once a movie is selected, the media for that movie is delivered from a video streaming server. Receivers interface with both the interactive and broadcast networks and uses local buffering and disk storage as needed.

The relationship between the Multiple (cable) System Owner (MSO), or broadcast service provider, and the ISP, is depicted in Figure 9.17. The client set-top box has a unidirectional video-in interface and a bidirectional IP-based interface. The video-interface connects to a downstream network typically controlled by an MSO. The bidirectional interaction interface connects to an IP-based network typically maintained by a number of ISPs and backbone infrastructure organizations. The iTV broadcaster is responsible for coordinating both the unidirectional broadcast and the interaction channel.

Figure 9.17. A simplified VOD architecture; typically either RTP or DSM-CC are used, not both.

The movie selection process could be generalized and simplified into four steps (see Figure 9.18). In the first step, the client is using an EPG style application to search for a movie; this step often relies on the HTTP request-response mechanisms. Once a candidate movie is selected, a trailer could be made available; while this step relies on HTTP for the delivery of the trailer page, the video content may be delivered as a file or as a low-bitrate MPEG stream over IP multicast. Next, once a movie is selected, an RTSP session is established against an RTSP server. The RTSP server controls the transmission of the media stream from the RTP or MPEG-DSM-CC server.

Figure 9.18. Typical VOD scenario.

VOD servers typically provide both RTSP and RTP or MPEG streaming support. VOD server applications integrate and provide a common front end user interface to these capabilities. Typically, the functions of the VOD server application include the following:

Importing asset information from the VOD system
Publishing availability information for assets and packages to the VOD client application. This can be handled using HTML over HTTP.
Connectivity to Remote Authentication Dial-In User Service (RADIUS) and billing servers for accepting or rejecting subscriber purchase requests .

9.5.2.2 Infrastructure Expectations

Most of the infrastructure needed for VOD is already in place. There are a number of network storage options. One needs to select between centralized, distributed or hybrid approaches. The list of issues to consider includes storage costs (trading off content duplication for performance), content distribution and management, security considerations, network infrastructure type and bandwidth, rack space, power, and even cooling requirements.

Storage Issues

With the distributed network approach, video servers are typically housed at unmanned hubs, while content is duplicated at each hub and distributed to video servers via tape, FTP, or multicast distribution. This approach can be deployed IP networks, but requires expensive physical maintenance; the primary limits to scalability, however, are rack space, power, cooling and storage capacity.

With the centralized approach, all video servers are located at Headends from which all customers are served . This approach simplifies content management and security, and often reduces costs. For a given storage size, much larger content libraries can be supported. With this approach, the primary limit to scalability is network bandwidth for streaming from Headends to clients ; typically, as the number of subscribers grows, the need for bandwidth requires using Dense Wavelength Division Multiplexer (DWDM).

The hybrid approach to storage combines both centralized and distributed approaches, and often seems to achieve the best cost trade-off. It utilizes distributed servers for frequently accessed content, and a centralized server for archived content. Content not on a hub server is streamed from central server, as rebalancing is performed in the background in off-peak hours.

The hybrid approach has a number of advantages. It optimized the use of network bandwidth because most of the access requests can be satisfied locally; distributed servers can simply serve as a cache of the content. Content distribution can occur at off-peak hours, and can be redistributed based on access trends and history. This approach delivers higher quality of service because the central and distributed servers back each other up. The obvious disadvantages of the hybrid approach is the complexity of content management as more servers are required resulting in increased maintenance and support costs.

An initial launch could use the centralized approach, and as the traffic grows, a gradual migration to the distributed approach can be achieved by adding local servers. Such a strategy can be supported with either analog with DWDM or digital-based technologies. The HFC DTV architecture foundation needed is already deployed in many markets. Available are 750+ MHz of optical node-level addressable bandwidth, capable of delivering MPEG-2 digital video over 64 or 256 QAM. The bidirectional IP-based communications infrastructure is well entrenched. State-of-the-art equipment is developed for high-bandwidth DOCSIS cable modem support. Set-top box Web-browsers do exist that support rich Web-based navigation and user interfaces. Low-cost IP-stacks implementing HTTP and RTSP are available both for consumer-electronics manufacturers and server developers.

A simplified sample infrastructure map is depicted in Figure 9.19. Typically, a node should be capable of servicing about 500 homes (each about 3.8 Mbps MPEG-2 streams), of which about 70% receive conditional access content (typically using 256 QAM 6MHz RF bandwidth to deliver about 38 Mbps). Simultaneous peak rate is assumed to be around 20%, implying that a total of 2 or 3 RF channels should be sufficient to support a VOD service.

Figure 9.19. Simplified infrastructure map.

In a typical installation, the headend receives RF signals from a satellite. The signals pass through an RF to Intermediate Frequency (IF) converted onto a digital transport multiplexer. A number of such multiplexers are interconnected through an Optical Carrier (OC) or SONET Transmission Manager (STM) using Packet-over-SONET over Synchronous Digital Hierarchy (SDH) Optical Services (see Figure 9.21).

Figure 9.21. Example of Headed-to-Hub MPEG-2 MPTS over ATM AAL 1/5 installation.

The interfacing between the video, Web and database servers and the RF combiner is depicted in Figure 9.20. Typically, video servers can produce output streams in various formats, including Small Computer System Interface (SCSI), ATM AAL5, DVB-ASI or QAM. If the SCSI output is used, then the data is passed through a rack-mounted MPEG decoders, onto an RF modulator, to feed an RF combiner. If the ATM AAL5 or DVB-ASI interface is used, then the data is passed through an MPEG-2 multiplexer and a QAM modulator to feed an RF combiner. If the video server outputs a QAM signal, then this signal can be fed directly to the RF combiner.

Figure 9.20. Simplified system component interconnectivity map.

The movie selection and RTSP play commands are received from the RF combiner and targeted to either the video server, Web server or database server. These signals are fed into a CMTS and propagate through an IP network that may be open to the Internet.