3. Personalized Video

We now describe a multimedia personalization service that automatically extracts multimedia content segments (i.e., clips), based on individual preferences (key terms/words, content sources, etc.) that the user identifies in a profile stored in the service platform. The service can alternatively be thought of as an electronic clipping service for streaming media that allows content to be captured, analyzed, and stored on a platform. A centralized server records the video streams of appropriate interest to the target audiences (broadcast television, training videos, presentations, executive meetings, etc.) as they are being broadcast. Additionally, it can include content that has been prerecorded. The content is preprocessed as described in Section 2.3. User profiles are individually created data structures (stored in XML format) containing the key terms or words of interest to the user. The profiles are compared against the content database using information retrieval techniques and if a match is found, multimodal story segmentation algorithms determine the appropriate length of the video clip as described above. The service then provides the video clips matching the user profile via streaming - eliminating long downloads (assuming that there is sufficient bandwidth for streaming). The user may then view/play these segments. A key feature of the service is to automatically aggregate the clips of diverse sources, providing a multimedia experience that revolves around the user's provided profile. This means that users have direct access to their specific items of interest. The provided segments are dynamically created, thereby the user has recent information including possibly content from live broadcasts earlier in the day. The profiles make searching personalized content easier as the user does not need to retype search strings every time the service is used. Additionally, the service supports e-mailing and archiving (book marking) clips.

3.1 System Architecture

The architecture will be described from the hardware and software point of view. As we describe each component of the architecture, we will describe the function of each component as well as the hardware and software system details.

3.1.1 Hardware Architecture

The Hardware Architecture is shown in Figure 43.3. The Client (A) requests hypertext markup language (HTML) pages related to the service from the Web/Index Server (C). The retrieved HTML pages contain JavaScript which are responsible for most of the interaction between the Client and the Web/Index Server. From the user profile, the JavaScript builds queries to the Web/Index Server to determine clip information for clips that match the search terms in the profile. After the Web/Index Server has returned the clip information to the Client, the user can navigate the clips by selecting them. The Video Acquisition Server (D) has already captured and recorded past broadcast TV programming. These recorded video files are shipped to the Video Server (B) from which they will be streamed. When the Client selects a clip, the video is streamed from the Video Server.

click to expand
Figure 43.3: Hardware architecture.

3.1.2 Software Architecture

The Software Architecture is shown in Figure 43.4. The architecture employs well-defined interface protocols and XML schema for data storage to maximize interoperability and extensibility. Various JavaScript libraries are used in the Client to access the XML databases, read XML query responses from the Server, and stream the video to the Player Web page. The Client makes extensive use of the JavaScript XML Document Object Model (DOM) Dynamic HTML (DHTML) and of Cascading Style Sheets (CSS). In addition to JavaScript on the Client side, Perl on the Server side is also used to access the XML databases.

click to expand
Figure 43.4: Software architecture.

The profile database contains the search topics and associated search terms. This includes restrictions on shows (for example, the search may be limited to certain broadcast TV programs). The show database contains the list of potential shows to be used in search queries. The archive database is simply the clips that the user has saved.

The Web Server handles the client requests for HTML pages related to the service. The Perl common gateway interface (CGI) scripts that the client navigates in order to perform the functions of the service deal with login/registration related pages, home page, profile related pages, archive related pages, and player pages. The player script is launched in a separate Player Web page. The Streaming Server will stream the video from the Video Storage to the Player Web page.

The Index Server handles query requests from the Client (how many clips, which shows have clips, etc.) and requests for clip content (metadata that describes clip content including pointers to video content). The Index Server finds the video clips by searching the previously indexed metadata content associated with the video content. The Index Server also determines the start and stop times of the clips. The Index Server generates XML responses to the queries, which are parsed by the Client JavaScript.

The Video Acquisition Server also performs various other functions including the content preprocessing, multimodal story segmentation, and compression. The Meta Data is shipped to the Meta Data database on the Index Server. Thumbnails are also included as part of the Meta Data. The recorded video files are shipped to the Video Storage.

The Web/Index Server (Figure 43.3) includes a Web (HTTP) interface, a relational database function (e.g., for selecting content from a particular broadcaster), and a full text information retrieval component (Text IR) as shown in Figure 43.5. The lower layer provides the text IR and relational database functions. Each acquisition also has properties such as the program title or date that can be queried against. The MM Specific layer provides the synchronization between multimedia elements (such as thumbnail images and media streams) and the text and other metadata elements. It also includes modules for XML generation and supports dynamically generating clip information. The responses from the MM DB are in standard XML syntax, with application-specific schema as shown in Figure 43.6.

Figure 43.5: Multimedia database.

 <?xml version="1.0" ?>   <clip start="58.475" duration="268.719">     <program >        <title>Nightly News</title>        <owner>nbc</owner>        <date>Tuesday, May 29, 2001 18:30 GMT</date>        <path>/nbc-data/nn/2001/05/29</path>     </program>     <keyword>bush</keyword>     <keywordtimes>       58.475, 140.974, 214.915, 220.704, 265.849     </keywordtimes>     <img  src="/books/2/495/1/html/2//cgi-bin/upi?/nbc-data/nn/2001/05/29+0012"/>     <text>Power play -- President Bush goes head-to-head with California's governor over the energy crunch.</text>   </clip>

Figure 43.6: XML representing a clip.

3.2 User Interface

Our goal is not merely to create topical video clips, but rather to create personalized video presentations. As such, the user interface is of critical importance, since it is the user who will ultimately be the judge of the quality and performance of the system. The user interface must be designed to gracefully accommodate the errors in story segmentation which any automated system will inevitability produce. Even if the story segmentation is flawless, the determination of which stories are of interest to the user is largely subjective and therefore impossible to fully automate. Further, we cannot know a priori which stories the user is already familiar with (possibly having heard or read about from other sources,) and therefore wishes to skip over. It is often the case that there will be several versions of a given news event covered by several different news organizations. Should we filter the clip list to include only one version, perhaps the one from the user's favorite broadcaster, or if the topic is of sufficient interest, should we offer multiple versions? Obviously, to deal with these issues, tools for browsing the stories will be a key component of the user interface. It is important to balance the need for flexibility and functionality of the interface against the added complexity that having such control brings.

We have chosen to have the system create a clip play list which is used as a baseline for the personalized video presentation. From this the user is given simple controls to navigate among the clips. We move some of the complexity of clip selection to an initial configuration phase. For example, when the user sets up their profile, they choose which content sources will be appropriate for a given topic. This makes the initial setup more complex, but establishes more accurate clip selection, thus simplifying the user interaction when viewing the personalized presentations. The approach of using a baseline clip play list is of even greater importance when we consider viewing this multimedia content on client devices with limited user interaction capabilities (described in detail in Section 3.4.) In these cases the "user interface" may consist of one or two buttons on a TV remote control device. We might have one button for skipping to the next clip within a set of clips about a particular topic and another for skipping to the next topic.

There are fundamental decisions to make in developing the player interface for a web browser:

Separate player page vs. integrated player page: If we choose a separate player page then we must make sure that there are sufficient controls on the player page to prevent the end user from having to go back and forth often. If we choose an integrated player page then we must make sure that the set of controls on the single page does not overwhelm the customer. (If the player is on a digital set-top box whose display is the TV, then this is a significant issue.)
Which of the three standard players (MS Media Player, Real, QuickTime) to use: The choice of the player not only specifies the available features and application programmer's interface (API) but also dictates the streaming protocol and servers used since the available players are largely not interoperable.
Embed the player into a web page or launch it via mime-type mapping: If we do not embed the player then we will have only the controls that are available on the player. If it is embedded, we may decide which controls to expose to the end user, and we may add customized controls.
Support for additional features:
- Personal Clip Archive: Maintain an extracted set of video files or a set of pointers to the full video.
- E-mail clips: Raises additional security issues and has long-term video storage implications.
- Peer-to-peer sharing of clips, lists, profiles, etc.

We chose to develop a browser-based player with the software architecture of the overall system as shown in Figure 43.7. We form a clip play list data structure as the result of the data exchange between the Client and the multimedia database (MM DB). This is a list of program video stream file names with clip start times, clip durations, and metadata such as the program title and date. These data are converted from XML format and maintained in JavaScript variables in the client. Playback is done entirely by streaming. During playback, we use JavaScript to implement control functionality, to iterate through the clip play list items, and to display clip metadata such as clip time remaining, etc. JavaScript on the client also generates HTML from the clip metadata for display of the thumbnails and text and for the archival and email features of the player.

click to expand
Figure 43.7: Client functional modules.

The end user interface for the desktop client is depicted in Figure 43.8 and Figure 43.9. The users are assumed to have registered with the service and to have set up a profile containing topics of interest and search terms for those topics. After authentication (login) the individualized home page for the user is displayed (see Figure 43.8) which indicates the number of clips available for each of the topics. The user can select a topic, or may choose to play all topics. The player (see Figure 43.9) displays the video and allows navigation among the clips and topics through intuitive controls.

click to expand
Figure 43.8: Individualized home page for desktop clients.

click to expand
Figure 43.9: Player page for desktop clients.

3.3 Parameters

There are many possible variations of the system parameters that would make the system suitable for different applications. One successful application uses North American broadcast and cable television content sources. That instance of the system allows users to search the most recent 14 days of broadcasts from 23 video program sources which corresponds to about 300 hours of content. Actually the archive maintains over 10,000 hours of indexed video content going back a period of several years, but we only return clips from content less than two weeks old for this application. Also, the user interface displays a maximum of 10 clips per topic. The media encoding parameters are shown in the "Desktop Client" column of Table 43.1.

3.4 Alternative Clients

In addition to the desktop client version described above, other user interface solutions were developed for handheld Personal Digital Assistants (PDA), a WebPad that was conceived as a home video control device, and for the telephone. In this section we will describe the challenges of implementing a multimedia service on each of these devices, some of which have very limited capabilities.

3.4.1 Handheld Client

A personalized automated news clipping service for handheld devices can take many forms, depending on the networking capabilities and performance of the device.

The three main networking possibilities for handheld devices are:

No networking: All content is stored and accessed locally on the device.
Low bit-rate: bit-rates of up to 56Kbps, e.g., CDPD @ 19.2Kbps
High bit-rate: bit-rates over 56Kbps, e.g., IEEE 802.11b @ 11Mbps

The display and computational capabilities of the device limits its ability to render multimedia content and can be classified as follows:

Text only
Text and images
Text, images, and audio
Text, images, audio, and video

We implemented option D for both the local content and the high bit-rate scenarios (options 1 and 3). See Table 1 in the Appendix for information detailing the differences between the desktop and the handheld implementations. Some of the key limitations for handheld devices include:

Storage: limits the total number of minutes of playback time for non-networked applications
Processing power: imposes limitations on video decoding and possibly on client player control functionality (e.g., JavaScript may not be supported)
Screen Size and Input Modality: restricts graphical user interface options, and lack of keyboard makes keyword searches difficult.

The limited capabilities of handheld devices require major architectural changes. The goal of these changes is to offload much of the computation from the client to the server. We assume that the user interface for creating and modifying the user profile database is done using a desktop PC with keyboard and mouse. However, since the user interface is based on HTML and HTTP, it is possible (although cumbersome) to modify the profile from a handheld device.

The desktop client architecture (see Figure 43.10) makes extensive use of the JavaScript Extensible Markup Language Document Object Model (XML DOM), Dynamic HTML (DHTML), and of CSS. The server is responsible for fielding queries and generating XML representations in response. For the Player Page, the client uses the Microsoft Media Player ActiveX control to play the video stream and generates HTML for the clip thumbnail and text. The user has considerable user interface control over the video content (skipping selected clips, selecting topics, etc.).

click to expand
Figure 43.10: Client-server architecture for desktop clients.

For handheld clients, much of the complexity must be moved from the client to the server (see Figure 43.11). The new architecture makes no use of JavaScript, DHTML, or CSS on the client. The clip HTML generation has been moved to the server, and HTML (with no use of CSS) rather than XML is sent to the client. As the server assembles this page, it also creates a clip playlist file in Active Stream Redirector (ASX) format. The "Play" button is linked to this dynamically created file and launches the media player using a Multipurpose Internet Mail Extension (MIME) type mapping. The GUI for navigating clips is limited to the Media Player capabilities. Figure 43.12 and Figure 43.13 represent the user interface.

click to expand
Figure 43.11: Client-server architecture for handheld clients.

Figure 43.12: Individualized home page for handheld clients.

Figure 43.13: Clip play list display for handheld clients.

3.4.2 WebPad Video Controller

We have also implemented the service using a WebPad (or tablet PC) Video Controller and a custom standalone Video Decoder appliance in order to play the video content with full broadcast television quality on standard TV monitors. It is contrasted with the PC implementation in Table 43.1. The WebPad (essentially a pen-based Windows PC) uses a browser such as MS Internet Explorer like the standard Client PC but the video is not streamed to the tablet. Instead of seeing the video stream (MPEG-4) via the Media Player ActiveX control on the WebPad, the video stream is sent to a special-purpose video endpoint that decodes the MPEG-2 video before sending the video directly to the television monitor in analog form. The video screen part of the Player Page on the WebPad is replaced with a still frame graphic (the thumbnail) and pause/volume controls. The other standard playback controls (skip back a clip, stop, play, skip forward a clip, skip topic, play beyond end of clip) work as before. The navigation among clips remains the same (click on thumbnails, play selected clips, etc.) but the video stream is played on the TV instead of on the WebPad. Note that hybrid solutions are possible in which video may be displayed on the tablet, perhaps for preview purposes.

The WebPad implementation is achieved by removing the Media Player ActiveX control from the client and adding a JavaScript object that mimics the Player object. The client causes the video stream to be redirected from the Video Server to the video endpoint. The video endpoint is a small embedded Linux computer with a 100BaseT IP network interface and hardware MPEG-2 decoder. It includes an HTTP Server with custom software. When the user interacts with the Player Page controls on the WebPad, CGI scripts on the video endpoint enable the controls (stop, play, pause, increase/decrease volume, etc.) This implementation supports a range of high quality video from digital cable quality through full DVD quality (4-10Mbps MPEG-2.)

3.4.3 Telephone Client

Lastly, we have implemented the service over the phone by using a front-end VoiceXML (VXML) Gateway (see Figure 43.14.) The client device in this case is any standard touch-tone telephone. During the registration process (using the PC) a new user creates a login and a 4 digit identifier which is needed when using the telephone interface to the service. The audio content is extracted from the video content, resampled to 8 KHz, and quantized to 8 bit μ-law as part of the acquisition process as required by the VXML platform. Similar to the handheld client architecture, the server is responsible for fielding queries and generating XML representations in response. However, instead of generating HTML, here we generate the VXML for the call session. Thus, the VXML is dynamically created using knowledge of the user's profile. After the user calls into the service, Dual-Tone Multi-Frequency (DTMF) input or speech input can be used to navigate the clips. The user can navigate the topics in the profile or the clips within a topic. In one instance of the service, the DTMF interface requires the user to enter "#" to skip to the next clip or "##" to skip to the next topic. The speech interface requires the user to speak the topic name in order to jump to a topic, or the user can say "next clip" or "next topic". At any time, the user can ask for clip details (enter "5" or say "clip details"), which consists of the show name, show date, and the duration of the clip. Because all the segmentation work has already been done during the processing stage, it is relatively straightforward to play clips using a telephone interface. Enhancements to the telephone interface could allow the user to hear more than the clip (play content before the start point or play content beyond the end point).

click to expand
Figure 43.14: Architecture for Telephone Clients.