2. Content Representation

Content representation plays a critical role in determining the content adaptability. Content representation as used in this chapter refers to the media encoding types, presentation format, the storage format and the meta data format used. The media encoding type specifies the encoding format of the elementary streams in the content, e.g., MPEG-2 video, MPEG-1 audio, JPEG images, etc. The encoding format determines the basic media encoding properties: the encoding algorithm, scalability, bitrate, spatial resolution and temporal resolution (frame rate). Not all the media types are characterized by these parameters; e.g., a JPEG image does not have any temporal resolution. The presentation format specifies the inter-relationships among the objects in the presentation and determines the presentation order of the content. The inter-relationships are described using technologies such as MPEG-4 binary representation for scenes (BIFS) [9] and Synchronised Multimedia Integration Language (SMIL) [15]. The storage format specifies the stored representation of the content. MPEG-4 file format specifies a format for efficient storage of object based presentations and lends itself nicely to content adaptation and UMA. The meta data specifies the additional information about the content that can be used to efficiently process the content including content adaptation. MPEG-7 specifies the formats for content meta data descriptions. In this section we give a brief overview of the content representation primarily covering MPEG-4 and MPEG-7 technologies.

2.1 MPEG-4 Systems

Traditionally the visual component of multimedia presentations has mainly been rectangular video, graphics and text. Advances in image and video encoding and representation techniques [9,10,11] have made possible encoding and representation of audio-visual scenes with semantically meaningful objects. The traditionally rectangular video can now be coded and represented as a collection of arbitrarily shaped visual objects. The ability to create object-based scenes and presentations creates many possibilities for a new generation of applications and services. The MPEG-4 series of standards specify tools for such object-based audio-visual presentations. While Audio and Video parts of the standard specify new and efficient algorithms for encoding media, the Systems part of MPEG-4 makes the standard radically different by specifying the tools for object-based representation of presentations. The most significant addition to the MPEG-4 standards, compared with MPEG-1 and MPEG-2, is the scene composition at the user terminal. The individual objects that make up a scene are transmitted as separate elementary streams and composed upon reception according to the composition information delivered along with the media objects. These new representations techniques will make flexible content adaptation possible.

2.1.1 Components of Object Based Presentations

An object-based presentation consists of objects that are composed to create a scene; a sequence of scenes forms a presentation. There is no clear-cut definition of what constitutes an object. When used in the sense of audio-visual (AV) objects, an object can be defined as something that has semantic and structural significance in an AV presentation. An object can thus be broadly defined as a building block of an audio-visual scene. When composing a scene of a city, buildings can be the objects in the scene. In a scene that shows the interior of a building, the furniture and other items in the building are the objects. The granularity of objects in a scene depends on the application and context. The main advantage of breaking up a scene into objects is the coding efficiency gained by applying appropriate compression techniques to different objects in a scene. In addition to coding gains, there are several other benefits of object-based representation: modularity, reuse of content, ease of manipulation, object annotation, as well as the possibility of playing appropriate objects based on the network and the receiver resources available. To appreciate the efficiency of object-based presentations, consider a home shopping channel such as the ones currently available on TV. The information on the screen consists mostly of text, images of products, audio and video (mostly quarter-screen and sometimes full screen). All this information is encoded using MPEG-2 video/audio at 30 fps. However, if this content is created using object-based technology, all static information such as text and graphics is transmitted only at the beginning of a scene and the rest of the transmission consists of only audio, video and text and image updates that take up significantly less bandwidth. In addition to this, the ability to interact with individual objects makes applications such as e-commerce possible.

The key characteristic of the object-based approach to audio-visual presentations is the composition of scenes from individual objects at the receiving terminal, rather than during content creation in the production studio (e.g., MPEG-2 video). This allows prioritising objects and delivering individual objects with the QoS required for that object. Multiplexing tools such as FlexMux [9] allow multiplexing of objects with similar QoS requirements in the same FlexMux stream. Furthermore, static objects such as a scene background are transmitted only once and result in significant bandwidth savings. The ability to dynamically add and remove objects from scenes at the individual user terminal even in broadcast systems makes a new breed of applications and services possible. Frame-based systems do not have this level of sophistication and sometimes use makeshift methods such as image mapping to simulate simple interactive behaviour. This paradigm shift while creating new possibilities for applications and services makes content creation and delivery complex. The end user terminals that process object-based presentations are now more complex but also more capable.

Figure 36.2 shows a scene with four visual objects, a person, a car, the background and the text. In object-based representation, each of these visual objects is encoded separately in a compression scheme that gives the best quality for that object. The final scene as seen on a user terminal would show a person running across a road and the text at the bottom of the scene, just like in a frame-based system. To compose a scene, object-based systems must also include the composition data for the objects that a terminal uses for spatio-temporal placement of objects in a scene. The scene may also have audio objects associated with (or independent of) visual objects. The compressed objects are delivered to a terminal along with the composition information. Since scenes are composed at the user end of the system, users may be given control on which objects are played. If a scene has two audio tracks (in different languages) associated with it, users can choose the track they want to hear. Whether the system continues to deliver the two audio tracks even though only one track is played is system dependent; broadcast systems may deliver all the available tracks while remote interactive systems with upstream channels may deliver objects as and when required. Since even text is treated and delivered as a separate object, it requires far less bandwidth than transmitting the encoded image of the rendered text. However the delivered text object now has to include font information and the user terminals have to know how to render fonts. User terminals could be designed to download fonts or decoders necessary to render the objects received.

click to expand
Figure 36.2: Example of an object-based scene.

2.1.2 Scene Composition

Scene composition can simply be defined as spatio-temporal placement of objects in the scene. Spatial composition determines the position of objects in a scene while temporal composition determines its position over time. Operations such as object animation, addition, and removal can be accomplished by dynamically updating the composition parameters of objects in a scene. All the composition data that is provided to a terminal can itself be treated as a separate object.

Since the composition is the most critical part of object-based scenes, the composition data stream has very strict timing constraints and is usually not loss tolerant. Any lost or even delayed composition information could distort the content of a presentation. Treating the composition data as a separate data object allows the system to deliver it over a reliable channel. Figure 36.3 shows the parameters for spatial composition of objects in a scene. The gray lines are not part of the scene; x and y are horizontal and vertical displacements from the top left corner. The cube is rotated by an angle of θ radians. The relative depth of the ellipse and a cylinder are also shown. The ellipse is closer to the viewer (z < z') and hence is displayed on top of the cylinder in the final rendered scene. Even audio objects may have spatial composition associated with them. An object can be animated by continuously updating the necessary composition parameters.

Figure 36.3: Composition parameters.

2.2 Overview of MPEG-4 Systems

The MPEG-4 committee has specified tools to encode individual objects, compose presentations with objects, store these object-based presentations and access these presentations in a distributed manner over networks [9]. The MPEG-4 Systems specification provides the glue that binds the audio-visual objects in a presentation [12,13]. The basis for the MPEG-4 Systems architecture is the separation of the media and data streams from the stream descriptions. The scene description stream, also referred to as BIFS, describes a scene in terms of its composition and evolution over time and includes the scene composition and scene update information. The other data stream that is part of the MPEG-4 systems is the object description or OD stream, which describes the properties of data and media streams in a presentation. The description contains a sequence of object descriptors, which encapsulate the stream properties such as scalability, QoS required to deliver the stream and the decoders and buffers required to process the stream. The object descriptor framework is an extensible framework that allows separation of an object and the object's properties. This separation allows for providing different QoS for different streams; for example, scene description streams which have very low or no loss tolerance and the associated media streams, which are usually loss tolerant. These individual streams are referred to as elementary streams at the system level. The separation of media data and meta data also makes it possible to use different media data (MPEG-1 or H.263 video) without modifying the scene description. The media and media descriptions are communicated to the receivers on separate channels. The receivers can first receive the media descriptions and then request appropriate media for delivery based on its capabilities.

An elementary stream is composed of a sequence of access units (e.g., frames in an MPEG-2 video stream) and is carried across the Systems layer as sync-layer (SL) packetized access units. The sync-layer is configurable and the configuration for a specific elementary stream is specified in its elementary stream (ES) descriptor. The ES descriptor for an elementary stream can be found in the object descriptor for that stream which is carried separately in the OD stream. The sync layer contains the information necessary for inter-media synchronization. The sync-layer configuration indicates the mechanism used to synchronize the objects in a presentation by indicating the use of timestamps or implicit media specific timing. Unlike MPEG-2, MPEG-4 Systems does not specify a single clock for the elementary streams. Each elementary stream in an MPEG-4 presentation can potentially have a different clock speed. This puts additional burden on a terminal, as it now has to support recovery of multiple clocks. In addition to the scene description and object description streams, an MPEG-4 session can contain Intellectual Property Management and Protection (IPMP) streams to protect media streams, or Object Content Information (OCI) streams that describe the contents of the presentation, and a clock reference stream. All the data flows between a client and a server are SL-packetized.

The data communicated to the client from a server includes at least one scene description stream. The scene description stream, as the name indicates, carries the information that specifies the spatio-temporal composition of objects in a scene. The MPEG-4 scene description is based on the VRML specification. VRML was intended for 3D modelling and is a static representation (a new object cannot be dynamically added to the model). MPEG-4 Systems extended the VRML specification with additional 2D nodes, a binary representation, dynamic updates to scenes, and new nodes for server interaction and flex-timing [14]. A scene is represented as a graph with media objects associated with the leaf nodes. The elementary streams carrying media data are bound to these leaf nodes by means of BIFS URLs. The URLs can either point to object descriptors in the object descriptor stream or media data directly at the specified URL. The intermediate nodes in the scene graph correspond to functions such as transformations, grouping, sensors, and interpolators.

The VRML event model adopted by MPEG-4 systems has a mechanism called ROUTEs that propagates events in a scene. A ROUTE is a data path between two fields; a change in the value of the source field will effect a change in the destination field. Through intermediate nodes, this mechanism allows user events such as mouse clicks to be translated into actions that transform the content displayed on the terminal. In addition to VRML functionality, MPEG-4 includes features to perform server interaction, polling terminal capability, binary encoding of scenes, animation and dynamic scene updates. MPEG-4 also specifies a Java interface to access a scene graph from an applet. These features make possible content with a range of functionality blurring the line between applications and content.

Figure 36.4 shows the binding of elementary streams in MPEG-4 Systems. The figure shows a scene graph with a group node (G), a transform node (T), an image node (I), an audio node (A) and a video node (V). Elementary streams are shown in the figure with a circle enclosing the components of the stream. The scene description forms a separate elementary stream. The media nodes in a scene description are associated with a media object by means of object IDs (OD ID). The object descriptors have a unique ID in the scene and are carried in an object descriptor stream. An object descriptor is associated with one or more elementary streams. The elementary streams are packetized and carried in separate channels. A receiver processes the scene description stream first and determines the objects it needs to render the scene. The receiver then retrieves the corresponding object descriptors from the object descriptor stream and determines the elementary streams it has to request from the sender. Since the receiver knows the types of the objects before it requests them from the sender, it can avoid requesting any object it cannot process or it can initiate content adaptation negotiation to request the object in a format that the receiver can process.

click to expand
Figure 36.4: Stream Association in MPEG-4 Systems.

The MP4 file format that is part of the MPEG-4 specification offers a versatile container to store content in a form that is suitable for streaming. The MP4 file format can be used to store multiple variations of the same content and select an appropriate version for delivery. The hint track mechanism supported by the MP4 file format allows supporting multiple transport formats, e.g., RTP and MPEG-2 TS, with minimal storage overhead. In summary, MPEG-4 specifies a flexible framework to develop content and applications suitable for UMA services.

2.3 MPEG-7 Tools

The description of multimedia content is an extremely important piece of the UMA puzzle as it provides an essential understanding of the source material to be distributed. MPEG-7 provides a standardized set of description tools that allow for the description of rich multimedia content. Among the many tools available, a subset of these tools is particularly targeted towards supporting the UMA concept and framework. These tools are highlighted here and their use in the context of UMA applications is described. In particular, we describe the tools for data abstraction, tools that facilitate the description of multiple versions of the content, tools that provide transcoding hints and tools that describe user preferences. Complete coverage of the MPEG-7 standard is provided in Chapter 29 and as well as the standard itself [16].

2.3.1 Data Abstraction

In general, summaries provide a compact representation, or an abstraction, of the audio-visual content to enable discovery, browsing, navigation, and visualization. The Summary Description Schemes (DS) in MPEG-7 enable two types of navigation modes: hierarchical and sequential. In the hierarchical mode, the information is organized into successive levels, each describing the audiovisual content at a different level of detail. The levels closer to the root of the hierarchy provide more coarse summaries, while levels further from the root provide more detailed summaries. On the other hand, the sequential summary provides a sequence of images or video frames, possibly synchronized with audio, which may compose a slide-show or audio-visual skim. There are many existing methods for key frame extraction and visual summarization; some examples can be found in [18,57,58,59].

The description of summaries can be used for adaptive delivery of content in a variety of cases in which limitations exist on the processing power of a terminal, the bandwidth of a network, or even an end-user's time. For example, in [19], a system for delivering content to multiple clients with varying capabilities was proposed. In this system, video summarization was performed prior to transcoding based on inputs from the user.

2.3.2 Multiple Variations

Variations provide information about different variations of audio-visual programs, such as summaries and abstracts; scaled, compressed and low-resolution versions; and versions with different languages and modalities, e.g., audio, video, image, text and so forth. One of the targeted functionalities of MPEG-7's Variation DS is to allow a server or proxy to select the most suitable variation of the content for delivery according to the capabilities of terminal devices, network conditions, or user preferences. The Variations DS describes the different alternative variations. The variations may refer to newly authored content, or correspond to content derived from another source. A variation fidelity value gives the quality of the variation compared to the original. The variation type attribute indicates the type of variation, such as summary, abstract, extract, modality translation, language translation, colour reduction, spatial reduction, rate reduction, compression, and so forth. Further details on the use of variations for adaptable content delivery can be found in [17].

2.3.3 Transcoding Hints

Transcoding hints provide a means to specify information about the media to improve the quality and reduce the complexity for transcoding applications. The generic use of these hints is illustrated in Figure 36.5.

click to expand
Figure 36.5: Generic illustration of the transcoding hints application.

Among the transcoding hints that have been standardized by MPEG-7 are the Difficulty Hint, Shape Hint, Motion Hint and Coding Hint. The Difficulty Hint describes the bit-rate coding complexity. This hint can be used for improved bit rate control and bit rate conversion, e.g., from constant bit-rate (CBR) to variable bit-rate (VBR). The Shape Hint specifies the amount of change in a shape boundary over time and is proposed to overcome the composition problem when encoding or transcoding multiple video objects with different frame-rates. The Motion Hint describes: i) the motion range, ii) the motion uncompensability and iii) the motion intensity. This metadata can be used for a number of tasks including, anchor frame selection, encoding mode decisions, frame-rate and bit-rate control, as well as bitrate allocation among several video objects for MPEG-4 object-based transcoding. The Coding Hint provides generic information contained in a video bitstream, such as the distance between anchor frames and the average quantization parameter used for coding. These transcoding hints, especially the search range hint, aim to reduce the computational complexity of the transcoding process. Further information about the use of these hints may be found in [20,21,53].

2.3.4 User Preferences

The UserInteraction DS defined by MPEG-7 describe preferences of users pertaining to the consumption of the content, as well as usage history. The MPEG-7 content descriptions can be matched to the preference descriptions in order to select and personalize content for more efficient and effective access, presentation and consumption. The UserPreference DS describes preferences for different types of content and modes of browsing, including context dependency in terms of time and place. The UserPreference DS describes also the weighting of the relative importance of different preferences, the privacy characteristics of the preferences and whether preferences are subject to update, such as by an agent that automatically learns through interaction with the user. The UsageHistory DS describes the history of actions carried out by a user of a multimedia system. The usage history descriptions can be exchanged between consumers, their agents, content providers, and devices, and may in turn be used to determine the user's preferences with regard to content.

It should be obvious that the above descriptions of a user are very useful to customize and personalize content according to the particular person, which is a key target of the UMA framework. Extensions to these descriptions of the user are being explored within the context of MPEG-21. Further details may be found in Section 4 of this chapter.