2. An Archology of Internet Video | Handbook of Video Databases: Design and Applications (Internet and Communications)

2. An Arch ology of Internet Video

Video is being hailed, especially in circles prone to an easy technophilia, as the next frontier of the Internet. The premise on which this assessment rests is not unreasonable: the network bandwidth available to the most advanced users is increasing to the point in which the possibility of transmitting videos is no longer a long-term goal. The promises are also rather substantial: as I am writing these words, I am sitting in my office in San Diego watching the news on the third channel of Italian television, albeit in a format just a little larger than a large postal stamp [3].

Yet, there are many problems to be solved before the symbiosis of video and the Internet can be successful, many of them connected to the different, almost incompatible characteristics of the two carriers: where video is a more passive background process, which we can absorb for long periods of time without interruption, the Internet is characterized by a more rapid interaction cycle, and by shorter times of passive absorption. In the case of my Italian news, for instance, I find that almost invariably after a few minutes I put the video window in the background and start doing something else while listening to the news: looking at other Internet sites or, as is the case now, working on an overdue book chapter.

As [2] puts it, one source of incompatibility is that the video screen is transparent while the computer screen is opaque. That is, the video screen opens a virtual window that reveals a world behind it: one doesn't look at the screen, but at what lies behind it. The computer screen, on the other hand, is the location where the tools of interactivity (buttons, menus, etc.) are located and, as such, a user operates on the screen and looks at it. This opacity makes the computer screen (and, therefore, the Internet, on which the computer screen is displayed) an unlikely site for the lengthy contemplation of a prolonged stream.

In addition to this, a source of contrast is the clash between the syntagmatic relations at work in video and the paradigmatic relations at work in an interface. A syntagmatic [4] relation holds between units that are placed next to another in a meaningful structure, for example the words cat and mat in the expression "the cat is on the mat." On the other hand, one could also construct sentences by choosing possible alternatives to the word "cat," as "the dog is on the mat," or "the table is on the mat." The relation existing between the word "cat" and "dog" or "table" is paradigmatic.

In video, the signification units are related by a syntagmatic relation: they are given meaning by the fact that they follow one another, and this is the reason that the only way of completely understanding video is to sit down and look at it. Interfaces, on the other hand, are paradigmatic: they present a set of alternatives through which one can navigate and each act will result in the presentation of another series of alternative. It is true that the actual act of navigation will create a temporal (therefore syntagmatic) sequence, but this sequence is implicit: what is made explicit in an interface is the paradigmatic relations between the possible alternatives.

In order to facilitate the access to video in a highly interactive medium, the generally accepted solution is that of exposing its structure, allowing the user to navigate it, receiving information on the general content of the video and, upon demand, watching short video segments on topics that he or she deems interesting. This is the road followed, for instance, by the French station TV5 for its Internet news [5]: the video window contains a list of the topics covered in the news broadcast together with a brief synopsis, and the user can select which one to watch.

Observations like these have spurred a great interest in the means of defining, describing, and displaying the structure of a video, and in integrating a suitable display of the video structure with the display of part of the video stream. The structure is used, in most cases, to facilitate access to specific parts of video, much in the same way in which a directory structure is used to facilitate access to an otherwise undifferentiated set of files. In order to do this the structure must rely on some organizational principles of video.

The principles around which the structure of the video is encoded are taken in a double bind between contrasting requirements. On one hand, they must be semantically meaningful: it would be useless to organize a video along lines that don't make sense to the user, much like it would be quite useless to divide a group of files in directories based, say, on their length or on the value of their fifth byte. On the other hand, if the structure is so complex, or the video library so large and mutable that an automatic structuring mechanism is necessary, it is necessary that the organizational principles be derivable from the video data, that is, that they have an easily identifiable syntactic referent in the video data.

Because of their semantic connection, the organizational principles of video depend on the general structure, purpose, cultural location of the video that is being analyzed. As a start, one can define two broad categories of video: semiotic and phenœstetic, to which I will now turn.

Semiotic Video

Semiotic video is, loosely speaking, video specifically designed, constructed, or assembled to deliver a message. Produced video, from films to TV programs, news programs, and music videos are examples of semiotic video: they carry an explicit message (or multiple messages), which is encoded not only in their content, but also in the syntax of the video itself.

The syntax of the most common forms of video expression derives from that of cinema. Cinema has been, from many points of view, the form of expression most characteristic of the XX century. Not only did cinema provide a new form for artistic expression, the particular version of narrative it created became the main modality of communication of the century.

The language of cinema was developed in a span of less than 20 years, roughly from the turn of the century to the end of the 1920s. At the beginning of the century, cinema was essentially recorded theatre, while at the end of the 1920s its language was so sophisticated that many people saw the introduction of the spoken word as a useless trick, which added little or nothing to the expressive power of the medium.

The main contribution to the development of the cinematic language came from the Russian constructivists, most notably people like Eisenstein and Vertov. Vertov's The Man with the Movie Camera (1929) can be seen as the ultimate catalogue of the expressive possibilities of the film language and of its principal syntactic construct: the montage.

The language of montage has extended beyond cinema through its adoption by other cultural media like TV or music videos. Its relative importance as a carrier of the semiosis of a video depends, of course, on the general characteristics of the form of expression in which it is employed, or even on the personal characteristics of the creator of a given message: we have at one extreme film directors who make a very spartan use of montage and, at the other extreme, music videos in which the near totality of the visual message is expressed by montage.

Attempts to an automatic analysis of video have long relied on the structure imposed by montage, due in part to the relative ease of detection of montage artefacts.

Montage is not the only symbolic system at work in semiotic video, although it is the most common and the easiest to detect. Specific genres of video use other means that a long use has promoted to the status of symbols in order to express certain ideas. In cartoons, for instance, characters running very fast always leave a cloud behind them, and, whenever they are hit on the head, they vibrate. These messages are, to an extent, iconic (certain materials do vibrate when they are hit) but their use is now largely symbolic and is being extended especially rather crudely to other genres (such as action movies) that rely on visuals more than on dialog to describe an action; thanks to the possibilities offered by sophisticated computer techniques, the language of action movies is moving away from realism and getting closer and closer to a cartoonish symbolism.

It is quite likely that in the near future, as this type of symbolism becomes even more common, analysis techniques will follow suit and, just as today they try to extract part of the meaning of a film through an analysis of the montage, they will begin analyzing the visual symbols with which directors express their messages.

In addition to these, which we might call intramedial messages, there are in many videos extramedial messages: shared cultural messages that provide meaning not because of the presence of a certain object in a video, but because of common connotations attached to the object. So, the presence of Adolf Hitler in a video may evoke a meaning of xenophobia, violence, or war, not because of the specific characteristics of the video, but because of the general connotations attached to the image of the German dictator. Directors use these connotations in concert with more strictly cinematic techniques, sometimes to create agreement, sometimes to create contrast both between media (e.g., traditional Jewish chants or peace anthems in the background of a Nazi gathering), or through contrast in time with the use of techniques like collage or pastiche.

All these semantic modalities, although not yet used to the fullest extent of their possibilities, can provide an invaluable aid to the task of identifying the meaning of a piece of semiotic video, at least, within the boundaries of a culture in which certain codes and connotations are shared.

Phenœstetic Video

By contrast, the kind of video that I have called phenœstetic is characterized by the almost absolute absence of cultural references or expressive possibilities that rest on a shared cultural background. It is, to use an evocative, albeit simplistic, image, video that just happens to be. Typical examples of phenœstetic video are given by security cameras or by the more and more common phenomenon of web cameras [6,7].

In this type of video, there is in general no conscious effort to express a meaning except, possibly, through the actions of the people in the video. Expressive means with a simple syntax, like montage, are absent and, in most cases, the identities of the people present in the video will afford no special connotation. In a few words, phenœstetic video is a stream of undifferentiated, uninterrupted narrative, a sort of visual stream of consciousness of a particular situation.

It should be apparent that imposing a structure to this kind of video for the purpose of Internet access is much more problematic than with semiotic video, since all the syntactic structures to which the semantic structure is anchored are absent. It would be futile, for instance, to try to infer the character of a phenœstetic video by looking at the average length of the shots [8] or the semiotic characteristics of its color distribution [9], for there are no shots to be detected and colors are not purposefully selected.

The position that I will try to carry forward in this chapter is that the proper way to organize phenœstetic video is around meaningful events, that is, that the structure of phenœstetic video is given by the interesting events that take place in it and by the structural relations between these events.

The "catch" of this definition is that there is no absolute or syntactic way to define an event: an event of interest is a purely semantic entity whose nature and location depend on the context in which the video is analyzed. So, for phenœstetic video, there will not be in general one structure around which the exploration of the video can proceed, but as many structures as there are contexts in which the video is used.

Experience Units

At this level, video is, in general, no longer alone, but the events of interest are associated to information coming from other media, which completes and complements the video information. It is important, in fact, to realize that on a generalist and interactive carrier like the Internet, media do not, by and large, stand by themselves, but are collected and associated in wider experience units. Informally, an experience unit is a complete view of an event of interest, regardless of the specific medium in which the experience is carried, and complete with all the connections between the different sources of information and with other experience units.

It should be stated from the outset that the concept of experience unit is a practical approximation and that, as a principle, it is indefensible. There are, inexperience, no Leibnitzian windowless monads: an experience unit derives its expressivity through the interaction with other experience units. A better way to regard an aggregation of media in a highly interactive carrier is as a graph of units connected by semantically structured edges. Each unit is, in itself, a collection of fragments derived from different media that can, individually, be connected to other fragments belonging to other units. A schematic view of such a graph is shown in Figure 22.1.

click to expand
Figure 22.1: Semantic view of a graph.

Architecturally, the dependence of events on context entails a general scheme like that of Figure 22.2. The video stream is first analyzed by a syntactic detector module, which detects syntactically determined events. These occurrences form then the input of a number of context-dependent semantic detectors, each one of which encodes a particular application domain.

click to expand
Figure 22.2: A general scheme that shows the dependence of events.

In semantic detectors, events are seen within the context of a particular domain, the data and the relations relevant to that domain are determined and, possibly, sought in other media available to the system.

Consider, as an example, a set of cameras installed in the different rooms of a museum and let us assume that a suitable video processing unit is capable of determining video regions corresponding to the people who walk around and, by some suitable features (the color of the dresses, possibly), is capable of assigning an identifier to each region, and that the region will maintain its identifier across the different rooms in the museum.

Syntactic events in this model can include the following:

A visitor (that is, a moving region) goes near a painting.
A visitor goes from one room to another.
A large assembly of visitors is formed somewhere.
A visitor goes into the souvenir shop and buys something (detected, of course, by noticing that the visitor spends some time near the cash register).

Several people may be interested in determining meaningful events based on these syntactic occurrences. For instance:

A security guard will be interested in generically "suspicious" behavior: visitors coming in at unusual times (the security guard will certainly be interested in a "visitor" arriving in the middle of the night!), or, of course, in passing by a small artifact just before that artifact disappears.
A museum curator might be interested in people who spend unusual amounts of time in different rooms: she might want to look at the artifact that these visitors look at, and possibly plan a different arrangement of the rooms or of the artifacts inside a room.
The person in charge of the gift shop might be interested in people who spend a lot of money (the "buy" event would be, in this case, connected to the cash register to see how much people spend) and, possibly, in tracking the same people during their visit to the museum to see whether the art they are interested in influences their buying and, possibly, to plan for a better offering in the gift shop.

One could continue with a wealth of other professional figures involved in the administration of a museum. The point is that the same syntactic events detected in the same video of people visiting the museum will offer radically different ways of organising the video, depending on the peculiar interests of the people who are looking at the video: it is the role of the viewer that determines what constitutes an event, how events should be connected, and what relevant information should be attached to them. The video data only provides the raw material around which these events are built.