4.2 Platform Architecture

Interactive TV platforms are essentially extending traditional TV receivers with a data service engine (see Figure 4.3). This engine receives input data demultiplexed out of an MPEG-2-based transport, as well as input from an IP-based transport, and produces graphics as output. The output graphics are then blended with the audio and video to produce the complete integrated rendering of a viewer's experience.

Figure 4.3. Combining audio, video, and data service rendering.

4.2.1 Hardware Abstraction

Personal computer architectures are often simplified by describing a CPU communicating with RAM and display. The architecture needed to implement iTV receivers, although more complex, is essentially similar. In addition to the usual components found in a PC, an iTV platform includes a graphic processor minimally supporting the blending of multiple planes having different pixel formats, a video processor minimally supporting scaling and repositioning , and an MPEG-2 system decoder (a component of the data service engine) minimally capable of extracting data from an MPEG-2 transport stream (see Figure 4.4). A graphics output integration capability is also needed to align and blend the video with the graphics planes managed by the graphics processor. The video-processing engine is similar to the video-processing engine found in DVD players; it is responsible for decompressing the video and generating the frame pixels.

Figure 4.4. Conceptual block diagram of receiver's hardware components.

The obvious challenge is to achieve interoperability among sophisticated hardware components. A common method used to address this challenge is the introduction of a hardware abstraction layer. This is a software interface layer that tadpoles the details of operating each hardware component from its functionality. Hardware abstraction can, for example, construct a single hierarchy of files that corresponds to the data transported into the receivers through multiple networks utilizing different protocols. Hardware abstraction can ensure alignment of video and (possibly overlaid) graphics planes by automatically performing matching between the graphics configuration requested by an application and the closest configuration supported by a receiver.

4.2.2 Broadcast Protocol Stack

Interactive TV receivers are radically different from noninteractive DTV receivers in that they comprehend data services embedded in a DTV program. These data services are broadcast using multi-layered MPEG-2 transport encapsulations (see Figure 4.5). At the bottom of the protocol stack is the Radio Frequency (RF) modulation layer, which converts RF signals into sequences of bits of 0s and 1s. Those bit sequences are organized into MPEG-2 packets, each exactly 188 bytes long. The packets are organized into MPEG-2 sections of varying lengths. These sections are then used to assemble complex data structures. These structures typically carry some signaling and a collection of files, whereby the signaling contains all the information needed to locate and acquire the files.

Figure 4.5. Simplified iTV broadcast protocol stack.

The transport is a sequence of MPEG-2 packets that are assembled into MPEG-2 sections according to the MPEG-2 standard. MPEG-2 sections either specify PSI or are private. Sections are called private because the syntax and semantics of the content they carry are not specified in the MPEG-2 systems standard. It is expected that other standard organizations, such as ATSC, SCTE, ETSI, ARIB, as well as vendors , specify the syntax and semantics of the bit-streams encapsulated within these sections.

One of the roles of the PSI is to group packets and sections into an MPEG program. Sections are transmitted by encapsulating them into transport packets (188 bytes each). Transmission of a section should complete before the next section from the same program is transmitted. However, packets carrying sections belonging to different programs may be interleaved.

Private sections are used to encapsulate DSM-CC download protocol messages. Some of these sections, such as the Download Data Block (DDB) sections, are used to carry modules. Modules are used to encapsulate serialized objects of the DSM-CC object carousel . Each of these objects encapsulates files, directories, and other components (see Chapter 11).

All the information needed to acquire DDBs and assemble them into modules is delivered through Data Info Indication (DII) sections. Private sections could also be used to deliver data streams (in addition to bounded files) by encapsulating IP datagrams.

The binding of channels is also delivered using private sections. DTV channel numbers have major and minor numbers. While the major number is often referred to as the channel number, the pair of major and the combination of major + minor channel numbers uniquely identifies a virtual channel.

An iTV program is usually delivered over a virtual channel; it is possible, however, for a program to be coordinated across multiple channels. Each program usually contains video, audio, and data stream(s) (delivered via a sequence of 188 byte packets). The video and audio streams are usually extracted by the video-processing hardware and directly rendered on the display. The data stream usually contains data repeatedly transmitted to allow acquisition by receivers that may tune to a channel at random unpredictable times.

4.2.3 Broadcast Decoders

Decoding is the process of reading the bit representation of the content and constructing a representation that can be processed by the receiver for the purpose of display. The nature of that runtime representation may vary across implementations .

A complex architecture is needed to decode the MPEG-2 transport and flow the data to the platform components. The decoder often includes numerous hardware modules with complex interactions (see Figure 4.7). Buffers should be dedicated to support the decoding of the various MPEG-2 sections, also referred to as tables, such as the PAT and the PMT it references. Also needed, for each video stream that is to be displayed, is a video buffer whose behavior is compliant with the MPEG-2 System Target Decoder (STD). That STD is also used to decode DTV Closed Captioning (CC) data. A buffer and decoder are also needed for at least one audio stream per STD.

Figure 4.7. A simplification of a common multi-layered software stack.

The input packets are fed into transport buffers TB ₁ TB _n , where state-of-the-art chips support 48 or 64 buffers. These buffers accumulate data arriving within packets at an unpredictable rate. For each transport buffer there is a need for several smoothing section buffers, SB ₁ SB _n , that compensate for the instantaneous bit rate variations, accumulate sections in a predictable manner, recover an average bit rate, and decouple the timing of section reception from the timing of section consumption. Once in the SBs, data is processed in a fashion specific to its type.

4.2.3.1 Video Decoders

Sections carrying iTV video are processed and the result is stored in a frame buffer. It is expected that receivers have at least one video decoding pipeline capable of decoding at least I-frames, P-frames, and B- frames .

I-frames : I-frames (intracoded frames) use DCT encoding only to compress a single frame without reference to any other frame in the sequence. Typically I-frames are encoded with 4 bits per pixel on average, and the initial data comprises 4 bytes of Y, 1 byte of U, and 1 byte of V (total 6 bytes = 48 bits) per pixel, giving rise to a compression ratio of 12:1. For random playing of MPEG video, the decoder must start decoding from an I-frame, not a P-frame. I-frames are inserted every 12 to 15 frames and are used to start a sequence, allowing video to be played from random positions and for fast forward or reverse. Decoding of video can start only at an I-frame.
P-frames : P-frames (predicted frames) are coded as differences from the last I-frame or P-frame. The new P-frame is first predicted by taking the last I-frame or P-frame and predicting the values of each new pixel. P-frames use motion prediction and DCT encoding. As a result P-frames give a compression ratio better than I-frames, depending on the amount of motion present. The differences between the predicted and actual values are encoded. The error of these difference values compresses better than the values themselves . Quantization of the prediction errors further improves compression.
B-frames : B-frames (bidirectional frames) are coded as differences from the last or next I-frame or P-frame. B-frames use prediction, as for P-frames, but for each block either the previous I-frame or P-frame is used or the next I-frame or P-frame. P-frames use motion prediction and DCT encoding. Because B-frames require both previous and subsequent frames for correct decoding, the order of MPEG frames as read is not the same as the displayed order. This gives improved compression compared with P-frames, because it is possible to choose for every macroblock whether the previous or next frame is taken for comparison.

The decoding of a new video must start by locating an I-frame that is often referred to as the first frame. I-frames are compressed using only information in the picture itself, excluding motion information.

Storing differences between the frames gives the massive reduction in the amount of information needed to reproduce the sequence. Therefore, following I-frames will be one or more P-frames. The P-frame data is based on its preceding I-frame. For example, in the case of a moving car, the P-frame specifies how the position of the car has changed from the previous I-frame. The description of changes requires a fraction of the space that would be required for encoding the entire frame. Shape or color changes are also encoded in the P frame. Each of the following P-frame is based on its predecessor P-frame.

Between I-frames and P-frames are B-frames, based on the nearest I-frame or P-frame both before and after them. In the moving car example, the B frame stores the difference between the car's image in the previous I-frame or P-frame and the following I-frame or P-frame. To recreate the B-frame when playing back the sequence, the MPEG decoder uses a combination of the two references. There may be a number of B-frames between I-frames or P-frames. No other frame is based on a B-frame, avoiding the propagation of errors found in P-frame sequences. Typically, there are two or three Bs between Is or Ps, and perhaps three to five Ps between subsequent Is. A short hand description of these interleaved sequences uses the letter I,B,P to form a sequence string, such as IBBPBBP or IBPBPBPBP. The former is more difficult to encode but provides a higher compression ratio than the latter.

Typically, MPEG file decoders are expected to fully decode MPEG-2 I-frames, but are not always expected to decode P-frame and B-frames. Some standards (e.g., MHP) also requires support for an MPEG-2 video drip feed mode which only requires handling I-frame and P-frames (but not B-frame).

When decoding streaming video, the overall concatenation of chunks over time must be consistent with the combinations of syntactic elements described in ISO/IEC 13818-2 to build a legal MPEG-2 video stream.

The outputs of the video decoder are fed into a display model, which scales the video and blends it with the background and graphics planes. This display model might support the rendering of multiple video frames scaled and clipped according to some dynamic configuration. This model should support alpha blending of all its inputs, even if those inputs have different pixel formats (i.e., different resolution, pixel dimensions, and color maps).

4.2.3.2 Data Service Decoders

iTV data service binding, modules, and files are passed into the buffer model, which in turn , forwards detected events and data to the event manager, the AEE (in ATSC DASE it is the procedural application execution environment) and the Presentation Engine (PE) (in ATSC DASE it is the declarative application execution environment). This event forwarding and data transfer occurs through the RTOS and JVM layers , which render the information accessible to application execution context through the JavaTV API (see Figure 4.6) [JavaTV].

Figure 4.6. Broadcast decoding architecture.

iTV MPEG-2 PSI comprises minimally of a PAT, a PMT, and a CAT. The PAT, the starting point for the acquisition of all programs, points to a list of PMTs, each binding together the components of a single program. The CAT contains conditional access information, including descrambling information. The PAT, PMT, and CAT pass through to an MPEG PSI decoder, which is responsible, among other tasks , for generating events whenever a new or updated PAT, PMT, or CAT is received. The output of the PSI decoder passes events to the event manager. These events eventually find their way to the application via the JavaTV API.

iTV audio enters an audio codec and proceeds in a separate path from the video onto an audio output module. An audio module, possibly controlled by the JMF API, is used to control the reception of audio streams from multiple sources, the mixing of these streams, and the routing to output devices. This module is responsible for blending audio, which is considered part of a video program, with audio that is generated by a software application running on the iTV receiver. It enables, for example, the accompaniment of a video with a language other than that provided by the video.

Synchronization is performed by means of extracting, from MPEG packets, a Decoding Time Stamp (DTS) and PTS from the transport. These time stamps are scoped to the PES Packets (variable length, up to 64 K) starting at the transport packets (188 bytes) specifying those time stamps. The DTS is the deadline for receivers to complete decoding, and hence, it is a deadline for emitters and multiplexers to complete emission of the PES packet; it is essentially a requirement for multiplexers to ensure that packets are inserted into the transport sufficiently early before the DTS.

Typically, a PTS scoped to the set of frames encapsulated with the first PES packet encapsulated in the transport packets following the appearance of the PTS. However, because audio encapsulation is not required to be aligned with the boundaries of the PES packets, it is possible for a PTS to be scoped to a non-integer number of audio frames. Decoders often have non-interoperable implementation-specific solutions to this problem.

The output of the audio and video codec is sent both directly to the display as well as to the AEE and a PE through the RTOS and JVM layers, and forwarded to application execution context through the JMF and HAVi GUI API.

To indicate its availability or imminent arrival, the data service is signaled in the transport using special purpose MPEG-2 sections. Signaling of data services enables the acquisition of the components applications as well as the data they consume . Usually, the PMT points to a table (i.e., MPEG section) that defines the data service. In ATSC, these tables are the DST and the NRT; through the NRT, the ATSC Data broadcast serves as a bridge between MPEG and IP transports. In DVB MHP, this table is the Application Information Table (AIT). These, in turn, point to data streams used by applications. The three types of data streams used, asynchronous, synchronized, and synchronous, are often acquired through a buffer and decoder dedicated to each of the three stream types. Asynchronous data delivery, often performed through repeated carousel retransmission, is most often used for transferring files, modules, or other bounded data resources. Synchronized data transfers are used to deliver triggers and other data that have a critical presentation time.

4.2.4 Software Stack

All interactive TV receivers implement a complex software stack. A simplified layered software stack is depicted in Figure 4.7, which highlights only components directly impacted by specifications of iTV standards; numerous other components are needed. At the bottom there is the hardware layer, which is assumed to contain video decoding pipelines. Above the hardware layer is the Hardware Abstraction Layer (HAL), depicted with the transport decoder abstraction. Above the hardware abstraction layer are the transient and persistent file systems. Whereas the persistent file system is similar to traditional file systems commonly found in computer systems, the transient file system is tied to and populated by the broadcast transport and effectively serves to mirror what is available from the broadcast. Above the file system layer is the RTOS layer, whose role in an iTV receiver is similar to its role in a computer system. Above the RTOS layer are the internal APIs, which provide the programmatic means to integrate with the RTOS; in a Windows-CE-based platform this API layer corresponds to the Win32 API layer. Above the internal RTOS API layer, is a layer of native host libraries. These libraries implement the infrastructure services, such as security, utilizing the RTOS internal APIs. One major component is the event module, sometimes referred to as the messaging module, which receives events from the transport decoder abstraction and the transient file system and propagates them to all registered listeners. All non-Java applications are implemented on top of the native libraries.

Above the layer of native libraries is the Java Native Interface (JNI) layer, which interfaces with the Java Native Methods (JNM) and serves as the foundation for the execution of the JVM. The JVM serves as a middle layer between interoperable application to educable their execution from the particularities of the native environment. Quality JVM implementations include bytecode-to-bytecode run-time space optimizer, bytecode-to-JNM run-time speed optimizer, and dynamic bytecode-to-JNM (just-in-time or other types of) compilers.

Above the JVM layer are the Java-middleware libraries and JVM (i.e., within the AEE). For interactive TV receivers, this minimally includes an application manager, PersonalJava, JavaTV, JMF, and HAVi Level 2 GUI. The application manager, specified by middleware standards such as DVB MHP, OCAP, and DASE, is responsible for registering (e.g., using Java listener interfaces) to receive application layer events (as opposed to transport signaling events) produced by the event module, merge those event into application execution events to establish a compliant application life-cycle. The PersonalJava environment includes some basic Java libraries such as the Abstract Windows Toolkit (AWT), Java reflection, Java beans, and Java net. The JavaTV environment include APIs for scanning transports, establishing service contexts, accessing broadcasted files through the transient file system, and accessing information in support of EPG.

The middleware's AEE implements the API layer defined by various iTV standards. A receiver compliant with a certain iTV standard must implement the APIs specified by that standard. This serves to enable applications downloaded to the receiver through the broadcast or an Internet connection (i.e., a return channel) to execute uniformly (in theory) across multiple receivers as expected by authors of content that use compliant authoring stations .

Actual software stacks are significantly more complex. A more detailed view of the components and their relationships is illustrated in Figure 4.8. The lower layer can be further divided into hardware sublayer, driver sublayer, system services sublayer, and the operating system sublayer and the porting APIs sublayer, much of it covered by Open Services Gateway Initiative (OSGI) [OSGI]. The upper layer, also known as the application layer, contains interoperable (downloadable) applications, as well as non-interoperable applications, all utilizing the middleware APIs. Those non-interoperable applications may include a TV executive, controlling all the functions of a TV (e.g., channel change, volume, and display settings), an EPG application, a Personal Video Recorder (PVR) application, a DVD player application, email client, and a chat or instant messaging tool.

Figure 4.8. A more detailed view of a common software stack.

The dashed rectangle in Figure 4.8 marks those components that are subject to standardization, covering procedural and declarative content downloadable from an iTV broadcast or a return channel. These two types of content have the following characteristics:

Procedural content ” Procedural content includes Java Xlets (rather than Applets) which are Java bytecode delivered in Java class files as well as additional data files (e.g., images). This code is executed by a JVM ”that part of the procedural AEE. The code may invoke any of the standard APIs specified by the standard which the receiver complies with (e.g., DVB MHP, ATSC DASE). Common APIs include PersonalJava, JavaTV, JMF, HAVi L2 GUI, and Document Object Model (DOM) API.
Declarative content ” Declarative content, sometimes referred to as a declarative application, comprises of markup content such as HTML pages. The decoder of declarative content is often referred to as a user agent, browser, or declarative application execution environment. A distinction is often made between content available on the wide open Internet and content that is compliant with iTV standards. This content includes graphics information such as Cascading Style Sheets (CSS), as well as non-declarative ECMA Script code via access to DOM (see Chapter 6). Further, a declarative application may include, via the <Object/> element, procedural JavaTV Xlets, and invoke the procedural AEE.

4.2.5 The Java OS Option

Developers of iTV receivers must cope with a wide range of implementation issues inside the set-top box, a bewildering array of connectivity issues outside of it, and a virtual blizzard of programming environments. To cut through the complexity, the industry is moving on several fronts to deliver on the concept of JavaOS which effectively eliminates several middle layers and serves as a middleware (see Figure 4.9). With the JavaOS approach, a single vendor could deliver much of the infrastructure required for the AEE, thinning it to the point of extinction .