4. Evaluation and User Studies

The Informedia Project at Carnegie Mellon University has created a multi-terabyte digital video library consisting of thousands of hours of video, segmented into tens of thousands of documents. Since Informedia's inception in 1994, numerous interfaces have been developed and tested for accessing this library, including work on multimedia surrogates that represent a video document in an abbreviated manner [11, 56]. The video surrogates build from automatically derived descriptive data, i.e., metadata, such as transcripts and representative thumbnail images derived from speech recognition, image processing and language processing. Through human computer interaction evaluation techniques and formal user studies, the surrogates are tested for their utility as indicative and informative summaries, and iteratively refined to better serve users' needs. This section discusses evaluations for text titles, thumbnail images, storyboards and skims.

Various evaluation methods are employed with the video surrogates, ranging from interview and freeform text user feedback to "discount" usability techniques on prototypes to formal empirical studies conducted with a completed system. Specifically, transaction logs are gathered and analyzed to determine patterns of use. Text messaging and interview feedback allow users to comment directly on their experiences and provide additional anecdotal comments. Formal studies allow facets of surrogate interfaces to be compared for statistically significant differences in dependent measures such as success rate and time on task.

Discount usability techniques, including heuristic evaluation, cognitive walkthrough and think-aloud protocol, allow for quick evaluation and refinement of prototypes [41]. Heuristic evaluation lets usability specialists review an interface and categorize and justify problems based on established usability principles, i.e., heuristics. With cognitive walkthrough, the specialist takes a task, simulates a user's problem-solving process at each step through the interaction and checks if the simulated user's goals and memory content can be assumed to lead to the next correct action. With think-aloud protocol, a user's interaction with the system to accomplish a given task is videotaped and analyzed, with the user instructed to "think aloud" while pursuing the task.

4.1 Text Titles

A great deal of text can be associated with a video document in the news and documentary genres. The spoken narrative can be deciphered with speech recognition and the resulting text time-aligned to the video [56]. Additional text can also be generated through "video OCR" processing, which detects and translates into ASCII format the text overlaid on video frames [48]. From this set of words for a video, a text title can be automatically extracted, allowing for interfaces as shown in Figure 9.8, where the title for the third result is shown.

click to expand
Figure 9.8: Informedia results list, with thumbnail surrogates and title shown for third video result.

These text titles act as informative summaries, a label for the video that remains constant across all query and browsing contexts. Hence, a query on OPEC that matches the third result shown in Figure 9.8 would present the same title.

Initially, the most significant words, as determined by the highest TF-IDF value, were extracted from a video's text metadata and used as the title. Feedback from an initial user group of teachers and students at a nearby high school showed that the text title was referred to often, and used as a label in multimedia essays and reports, but that its readability needed to be improved. This feedback took the form of anecdotal reporting through email and a commenting mechanism located within the library interface, shown as the pull-down menu "Comments!" in Figure 9.8. Students and teachers were also interviewed for their suggestions, and timed transaction logs were generated of all activity with the system, including the queries, which surrogates were viewed and what videos were watched [8].

The title surrogate was improved by extracting phrases with high TF-IDF score, rather than individual words. Such a title is shown in Figure 9.8. The title starts with the highest scoring TF-IDF phrase. As space permits, the highest remaining TF-IDF phrase is added to the list, and when complete the phrases are ordered by their associated video time (e.g., dialogue phrases are ordered according to when they were spoken). In addition, user feedback noted the importance of reporting the copyright/production date of the material in the title, and the size in standard hours:minutes:seconds format. The modified titles were well-received by the users, and phrase-based titles remain with the Informedia video library today, improved by new work in statistical analysis and named entity extraction.

Figure 9.8 shows how the text title provides a quick summary in readable form. Being automatically generated, it has flaws, e.g., it would be better to resolve pronoun references such as "he" and do better with upper and lower case. As with the surrogates reported later, though, the flaws with any particular surrogate can be compensated with a number of other surrogates or alternate views into the video. For example, in addition to the informative title summary it would be nice to have an indicative summary showing which terms match for any particular video document.

With Figure 9.8, the results are ordered by relevance as determined by the query engine. For each result, there is a vertical relevance thermometer bar that is filled in according to the relevance score for that result. The bar is filled in with colors matching the colors used for the query terms. In this case, with Colin colored red, "Powell" colored violet, and trip colored blue, the bar shows at a glance which terms match a particular document, as well as their relative contribution to that document. The first six results of Figure 9.8 all match on all three terms, with the next ten in the page only matching on "Colin" and "Powell." Hence, the thermometer bar provides the indicative summary showing matching terms at a glance. It would be nice to show not only what terms match, but also match density and distribution within the document, as is done with TileBars [26]. Such a more detailed surrogate is provided along with storyboards and the video player, since those interfaces include a timeline presentation.

4.2 Thumbnail Image

The interface shown in Figure 9.8 makes use of thumbnail images, i.e., images reduced in resolution by a quarter in each dimension from their original pixel resolution of 352 by 240 (MPEG-1). A formal empirical study was conducted with high school scholarship students to investigate whether this thumbnail interface offered any advantages over simply using a text menu with all of the text titles [9]. Figure 9.9 illustrates the 3 interfaces under investigation: text menus, "naive" thumbnails in which the key frame for the first shot of the video document is used to represent the document, and query-based thumbnails. Query-based thumbnails select the key frame for the highest scoring shot for the query, as described earlier in Section 3.

click to expand
Figure 9.9: Snapshots of the 3 treatments used in thumbnail empirical study with 30 subjects.

The results of the experiment showed that a version of the thumbnail menu had significant benefits for both performance time and user satisfaction: subjects found the desired information in 36% less time with certain thumbnail menus over text menus. Most interestingly, the manner in which the thumbnail images were chosen was critical. If the thumbnail for a video segment was taken to be the key frame for the first shot of the segment (treatment B), then the resulting pictorial menu of "key frames from first shots in segments" produced no benefit compared to the text menu. Only when the thumbnail was chosen based on usage context, i.e., treatment C's query-based thumbnails, was there an improvement. When the thumbnail was chosen based on the query, by using the key frame for the shot producing the most matches for the query, then pictorial menus produced clear advantages over text-only menus [C9].

This empirical study validated the use of thumbnails for representing video segments in query result sets, as shown in Figure 9.8. It also provided evidence that leveraging from multiple processing techniques leads to digital video library interface improvements. Thumbnails derived from image processing alone, such as choosing the image for the first shot in a segment, produced no improvements over the text menu. However, through speech recognition, natural language processing and image processing, improvements can be realized. Via speech recognition the spoken dialogue words are tightly aligned to the video imagery. Through natural language processing the query is compared to the spoken dialogue, and matching words are identified in the transcript and scored. Via the word alignment, each shot can be scored for a query, and the thumbnail can then be the key frame from the highest scoring shot for a query. Result sets showing such query-based thumbnails produce advantages over text-only result presentations and serve as useful indicative summaries.

4.3 Storyboards

Rather than using only a single image to summarize a video, another common approach presents an ordered set of representative thumbnails simultaneously on a computer screen [23, 31 54, 58, 60]. This storyboard interface, referred to in the Informedia library as "filmstrip," is shown for a video clip in Figure 9.10. Storyboards address certain deficiencies of the text title and single image surrogates. Text titles and single thumbnail images (see Figure 9.8) are quick communicators of video segment content, but do not present any temporal details. The storyboard surrogate communicates information about every shot in a video segment. Each shot is represented in the storyboard by a single image, or key frame. Within the Informedia library interface, the shot's middle frame is assigned by default to be the key frame. If camera motion is detected and that motion stops within the shot, then the frame where the camera motion ends is selected instead. Other image processing techniques, such as those that detect and avoid low-intensity images and those favoring images where faces or overlaid text appears, further refine the selection process for key frames [52].

click to expand
Figure 9.10: Storyboard, with overlaid match "notches" following query on "man walking on the moon."

As with the thermometer bar of Figure 9.8, the storyboard can indicate which terms match by drawing color-coded match "notches" at the top of shots. The locations of the notches indicate where the matches occur, showing match density and distribution within the video. Should the user mouse over a notch, as is done in the twelfth shot of Figure 9.10, the matching text for that notch is shown, in this case "moon." As the user mouses over the storyboard, the time corresponding to that storyboard location is shown in the information bar at the top of the storyboard window, e.g., the time corresponding to the mouse location inside of the twelfth shot in Figure 9.10 is 44 seconds into the 3 minute and 36 second video clip. The storyboard summary facilitates quick visual navigation. To jump to a location in the video, e.g., to the mention of "moon" in the twelfth shot, the mouse can be clicked at that point in the storyboard.

One major difficulty with storyboards is that there are often too many shots to display in a single screen. An area of active research attempts to reduce the number of shots represented in a storyboard to decrease screen space requirements [6, 34, 58]. In Video Manga [55, 6], the interface presents thumbnails of varying resolutions, with more screen space given to the shots of greater importance. In the Informedia storyboard interface, the query-based approach is again used: the user's query context can indicate which shots to emphasize in an abbreviated display. Consider the same video document represented by the storyboard in Figure 9.10. By only showing shots containing query match terms, only 11 of the 34 shots need to be kept. By reducing the resolution of each shot image, screen space is further reduced. In order to see visual detail, the top information bar can show the storyboard image currently under the mouse pointer in greater resolution, as illustrated by Figure 9.11. In this figure, the mouse is over the ninth shot at 3:15 into the video, with the storyboard showing only matching shots at 1/8 resolution in each dimension.

click to expand
Figure 9.11: Reduced storyboard display for same video represented by full storyboard in Figure 9.10.

4.4 Storyboard Plus Text

Another problem with storyboards is that they apply to varying degrees of effectiveness depending on the video genre [32]. For visually rich genres like travelogues and documentaries, they are very useful. For classroom lecture or conference presentations, they are far less important. For a genre such as news video, in which the information is conveyed both through visuals (especially field footage) and audio (such as the script read by the newscaster), a mixed presentation of both synchronized shot images and transcript text extracts may offer benefits over image-only storyboards. Such a "storyboard plus text" surrogate is shown in Figure 9.12.

click to expand
Figure 9.12: Scaled-down view of storyboard with full transcript text aligned by image row.

This storyboard plus text surrogate led to a number of questions:

Does text improve navigation utility of storyboards?
Is less text better than complete dialogue transcripts?
Is interleaved text better than block text?

For example, the same video represented in Figure 9.12 could have a concise storyboard-plus-text interface in which the text corresponding to the video for an image row is collapsed to at most one line of phrases, as shown in Figure 9.13.

click to expand
Figure 9.13: Storyboard plus concise text surrogate for same video clip represented by Figure 9.12.

To address these questions, an empirical study was conducted in May 2000 with 25 college students and staff [12]. Five interface treatments were used for the experiment: image-only storyboard (like Figure 9.10), interleaved full text (Figure 9.12), interleaved condensed text (Figure 9.13), block full text (in which image rows are all on top, with text in a single block after the imagery) and block condensed text.

The results from the experiment showed that text clearly improved utility of storyboards for a known item search into news video, with statistically significant results [12]. This was in agreement with prior studies showing that the presentation of captions with pictures can significantly improve both recall and comprehension, compared to presenting either pictures or captions alone [43, 27, 33]. Interleaving text with imagery was not always best: even though interleaving the full text by row like Figure 9.12 received the highest satisfaction measures from subjects, their task performance was relatively low with such an interface.

The fastest task times were accomplished with the interleaved condensed text and the block full text. Hence, reducing text is not always the best surrogate for faster task accomplishment. However, given that subjects preferred the interleaved versions and that condensed text takes less display space than blocking the full transcript text beneath the imagery, the experiment concluded that an ideal storyboard plus text interface is to condense the text by phrases, and interleave such condensed text with the imagery.

4.5 Skim

While storyboard surrogates represent the temporal dimension of video, they do so in a static way: transitions and pace may not be captured; audio cues are ignored. The idea behind a "video skim" is to capture the essence of a video document in a collapsed snippet of video, e.g., representing a 5 minute video as a 30 second video skim that serves as an informative summary for that longer video. Skims are highly dependent on genre: a skim of a sporting event might include only scoring or crowd-cheering snippets, while a skim of a surveillance video might include only snippets where something new enters the view.

Skims of educational documentaries were studied in detail by Informedia researchers and provided in the digital library interface. Users accessed skims as a comprehension aid to understand quickly what a video was about. They did not use skims for navigation, e.g., to jump to the first point in a space documentary where the moon is discussed. Storyboards serve as much better navigation aids because there is no temporal investment that needs to be made by the user; for skims, the user must play and watch the skim.

For documentaries, the audio narrative contains a great deal of useful information. Early attempts at skims did not preserve this information well. Snippets of audio for an important word or two were extracted and stitched together in a skim, which was received poorly by users, much like early text titles comprised of the highest TF-IDF words were rejected in favor of more readable concatenated phrases. By extracting audio snippets marked by silence boundaries, the audio portion of the skim became greatly improved, as the skim audio was more comprehensible and less choppy.

A formal study was conducted to investigate the importance of aligning the audio with visuals from the same area of the video, and the utility of different sorts of skims as informative summaries. Specifically, it was believed that skims composed of larger snippets of dialogue would work better than shorter snippets, the equivalent of choosing phrases over words. A new skim was developed that comprised snippets of audio bounded by significant silences, more specifically audio signal power segmentation [10]. The transcript text for the audio snippets was ranked by TF-IDF values and the highest valued audio snippets were included in the skim, with the visual portion for the skim snippets being in the close neighborhood of the audio.

Five treatments were seen by each of 25 college students:

DFS: a default skim using short 2.5 second components, e.g., comprising seconds 0–2.5 from the full source video, then seconds 18.75–21.25, seconds 37.5–40, etc.
DFL: a default skim using long 5 second components, e.g., consisting of seconds 0–5, then seconds 37.5–42.5, seconds 75–80, etc.
NEW: a new skim outlined above and discussed in more detail in [10]
RND: same audio as NEW but with reordered video to test synchronization effects
FULL: complete source video, with no information deleted or modified

These treatment derivations are illustrated in Figure 9.14.

click to expand
Figure 9.14: Skim treatments used in empirical study on skim utility as informative summary.

Following a playing of either a skim or the full video, the subject was asked which of a series of images were seen in the video just played, and which of a series of text summaries would make sense as representing the full source video. As expected, the FULL treatment performed best, i.e., watching the full video is an ideal way to determine the information content of that full video. The subjects preferred the full video to any of the skim types. However, subjects favored the NEW skim over the other skim treatments, as indicated by subjective ratings collected as part of the experiment. These results are encouraging, showing that incorporating speech, language and image processing into skim video creation produces skims that are more satisfactory to users.

The RND skim distinguished itself as significantly poorer than NEW on the text-phrase gisting instrument, despite the fact that both RND and NEW use identical audio information. This result shows that the visual content of a video skim does have an impact on its use for gisting. The DFS and DFL skim treatments did not particularly distinguish themselves from one another, leaving open the question of the proper component size for video skims. The larger component size, when used with signal-power audio segmentation, produced the NEW skim that did distinguish itself from the other skims. If the larger component size is used only for subsampling, however, it yields no clear objective or subjective advantage over short component size skims, such as DFS. In fact, both DFS and DFL often rated similarly to RND, indicating perhaps that any mechanistically subsampled skim, regardless of granularity, may not do notably well.

While very early Informedia skim studies found no significant differences between a subsampled skim and a best audio and video skim, this study uncovered numerous statistically significant differences [10]. The primary reasons for the change can be traced to the following characteristics of the audio data in the skim:

Skim audio is less choppy due to setting phrase boundaries with audio signal-processing rather than noun-phrase detection.
Synchronization with visuals from the video is better preserved.
Skim component average size has increased from three seconds to five.

Although the NEW skim established itself as the best design under study, considerable room for improvement remains. It received mediocre scores on most of the subjective questions, and its improvement over the other skims may reflect more on their relatively poor evaluations than on its own strengths. NEW did distinguish itself from RND for the image recognition and text-phrase gisting tasks, but not from the DFS and DFL skims. The NEW skim under study achieved smoother audio transitions but still suffered abrupt visual changes between image components. Transitions between video segments should also be smoothed — through dissolves, fades or other effects — when they are concatenated to form a better skim.

4.6 Lessons Learned From Single Document Video Surrogates

In summary, usage data, HCI techniques and formal experiments have led to the refinement of single document video surrogates in the Informedia digital video library over the years. Thumbnail images are useful surrogates for video, especially as indicative summaries chosen based on query-based context. The image selection for thumbnails and storyboards can be improved via camera motion data and corpus-specific rules. For example, in the news genre shots of the anchorperson in the studio and weather reporter in front of a map typically contribute little to the visual understanding of the news story. Such shots can be de-emphasized or eliminated completely from consideration as single image surrogates or for inclusion in storyboards.

Text is an important component of video surrogates. Text titles are used as identifying labels for a video document and serve as a quick identifier. Adding synchronized text to storyboards helps if interlaced with the imagery. Assembling from phrases (longer chunks) works better than assembling from words (shorter chunks).

Showing distribution and density of match terms is useful, and can naturally be added to a storyboard or a video player's play progress bar. The interface representation for the match term can be used to navigate quickly to that point in the video where the match occurs.

Finally, with a temporal summary like skims, transition points between extracted snippets forming the skim are important. When the audio for a skim breaks at silence points, the skim is received much better than skims with abrupt, choppy transitions between audio snippets.