Section IX: Panel of Experts: The Future of Video Databases | Handbook of Video Databases: Design and Applications (Internet and Communications)

Part Overview

In this chapter, world-renowned experts answer fundamental questions about the state of the art and future research directions in video databases and related topics.

The following four intriguing questions were answered by each panelist.

What do you see as the future trends in multimedia research and practice?
What do you see as major present and future applications of video databases?
What research areas in multimedia would you suggest to new graduate students entering this field?
What kind of work are you doing now and what is your current major challenge in your research?

EXPERT PANEL PARTICIPANTS

(alphabetically)

Dr. Al Bovik received the Ph.D. degree in Electrical and Computer Engineering in 1984 from the University of Illinois, Urbana-Champaign. He is currently the Robert Parker Centennial Endowed Professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin, where he is the Director of the Laboratory for Image and Video Engineering (LIVE) in the Center for Perceptual Systems. Al's main interests are in combining the sciences of digital multimedia processing and computational aspects of biological perception.

Al Bovik's mission and professional statement

I would really like to know, and be able to predict (at least statistically), why people look where they do! If I knew this, then I think I would be able to change the way people think about Multimedia Processing. I have been technically inspired since early on by such thinkers as Arthur C. Clarke, Isaac Asimov, Carl Sagan, in general science; by my advisors in graduate school, Dave Munson and Tom Huang for their enthusiasm and storehouse of knowledge, guidance, and intuition; by my many peer colleagues in the field, such as Ed Delp and Rama Chellappa, for their professionalism, knowledge, and humor; but most of all, by the many brilliant students that I have been lucky enough to work with. Although I love my work, I escape from it regularly through my practices of enjoying my wife Golda and daughter Dhivya (one year old at this writing), my practice of yoga, my hobbies of deep-sky astronomy and close-up magic, and by near-daily dips and mile-long swims in Austin's wonderful Barton Springs Pool.

Dr. Alberto Del Bimbo graduated with Honors in Electronic Engineering at the University of Florence, Italy, in 1977. From graduation to 1988, he worked at IBM Italia SpA. In 1988, he was appointed as Associate Professor of Computer Engineering at the University of Florence. He was then appointed, in 1994, as Professor at the University of Brescia and, shortly thereafter 1995, returned to the University of Florence as Professor of Computer Engineering, Florence, in charge of Research and Innovation Transfer.

Prof. Del Bimbo's scientific interests and activities have dealt with the subject Image Technology and Multimedia. Particularly, they address content-based retrieval from 2D and 3D image databases, automatic semantic annotation and retrieval by content from video databases, and advanced man-machine interaction based on computer vision technology.

He authored over 180 publications and was the guest editor of many special issues on distinguished journals. He authored the monograph "Visual Information Retrieval" edited by Morgan Kaufman Publishers Inc., San Francisco, in 1999. He was the General Chairman of the 9th IAPR International Conference on Image Analysis and Processing, Florence 1997 and the General Chairman of the 6th IEEE International Conference on Multimedia Computing and Systems, Florence 1999.

Dr. Nevenka Dimitrova is a Principal Member Research Staff at Philips Research. She obtained her Ph.D. (1995) and MS (1991) in Computer Science from Arizona State University (USA), and BS (1984) in Mathematics and Computer Science from University of Kiril and Metodij, Skopje, Macedonia. Her main research passion is in the areas of multimedia content management, digital television, content synthesis, video content navigation and retrieval, and content understanding.

Dr. Dimitrova's mission and professional statement

I believe that the advancement of multimedia information systems can help improve quality of life (and survival). My professional inspiration is drawn from colleagues on this panel, particularly Prof. Ramesh Jain. I believe that our inspiration for moving from Hume-multimedia processing to Kant-multimedia processing should come from philosophy and psychology but the research should be firmly grounded on formal mathematical basis.

Dr. Shahram Ghandeharizadeh received his Ph.D. degree in computer science from the University of Wisconsin, Madison, in 1990. Since then, he has been on the faculty at the University of Southern California. In 1992, Dr. Ghandeharizadeh received the National Science Foundation Young Investigator's Award for his research on the physical design of parallel database systems. In 1995, he received an award from the School of Engineering at USC in recognition of his research activities.

During 1994 to 1997, he led a team of graduate students to build Mitra, a scalable video-on-demand system that embodied several of his design concepts. This software prototype functioned on a cluster of PCs and served as a platform for several of his students' dissertation topics. In 1997, Matsushita Information Technology Laboratory, MITL, purchased a license of Mitra for research and development purposes.

Dr. Ghandeharizadeh's primary research interests are in the design and implementation of multimedia storage managers, and parallel database management systems. He has served on the organizing committees of numerous conferences and was the general co-chair ACM-Multimedia 2000. His activities are supported by several grants from the National Science Foundation, Department of Defense, Microsoft, BMC Software, and Hewlett-Packard. He is the director of the Database Laboratory at USC.

Dr. David Gibbon is currently a Technology Consultant in the Voice Enabled Services Research Laboratory at AT&T Labs - Research. He received the M.S. degree in electronic engineering from Monmouth University (New Jersey) in 1990 and he joined Bell Laboratories in 1985. His research interests include multimedia processing for searching and browsing of video databases and real-time video processing for communications applications.

David and his colleagues are engaged in an ongoing project aimed at creating the technologies for automated and content-based indexing of multimedia information for intelligent, selective, and efficient retrieval and browsing. Such multimedia sources of information include video programs, TV broadcasts, video conferences, and spoken documents. Lawrence Rabiner is one of the people who inspired David professionally. His unwavering demonstration of his work ethic, pragmatism, positive attitude, and research excellence serve as models to strive to emulate. David is a member of the IEEE, SPIE IS&T, and the ACM and serves on the Editorial Board for the Journal of Multimedia Tools and Applications. He has twenty-eight patent filings and holds six patents in the areas of multimedia indexing, streaming, and video analysis.

Dr. Forouzan Golshani is a Professor of Computer Science and Engineering at Arizona State University and the director of its Multimedia Information Systems Laboratory. Prior to his current position, he was with the Department of Computing at the Imperial College, London, England, until 1984. His areas of expertise include multimedia computing and communication, particularly video content extraction and representation, and multimedia information integration.

Forouzan is the co-founder of Corporate Enhancement Group (1995) and Roz Software Systems (1997), which was funded and became Active Image Recognition, Inc. in 2002. He has worked as consultant with Motorola, Intel Corp., Bull Worldwide Information Systems, iPhysicianNet, Honeywell, McDonnell Douglas Helicopter Company, and Sperry. He has successfully patented nine inventions and others are in the patent pipeline. Forouzan's more than 150 technical articles have been published in books, journals, and conference proceedings.

He is very actively involved in numerous professional services, mostly in IEEE Computer Society, including chair of the IEEE Technical Committee on Computer Languages and IEEE Distinguished Speaker program. Among his conference activities are: chairing of six international conferences, program chair for five others, and a member of program committee for over 70 conferences. Forouzan is currently the Editor-in-Chief of IEEE Multimedia and has served on its editorial board since its inception. He wrote the Review column and the News column at different times. He also serves on the editorial boards of a number of other journals. He received a PhD in Computer Science from the University of Warwick in England in 1982.

Dr. Thomas S. Huang is William L. Everitt Distinguished Professor of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. He received Sc.D. from Massachusetts Institute of Technology in 1963. His research interests include image processing, computer vision, pattern recognition, machine learning, multimodal (esp. visual and audio) human computer interaction, and multimedia (esp. images and video) databases.

Dr. Ramesh Jain is Rhesa Farmer Distinguished Chair in Embedded Experiential Systems, and Georgia Research Alliance Eminent Scholar, School of Electrical and Computer Engineering and College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0250. He received his Ph.D. from Indian Institute of Technology, Kharagpur in 1975.

Dr. Ramesh's mission and professional statement

My professional goal is to address some challenging real problems and develop systems that are useful. People that inspired me professionally include among others Prof. Azriel Rosenfeld and Prof. Hans-Helmut Nagel.

I enjoy taking research out of my lab to practice, particularly to develop products. Working with some bright research students to develop products and build companies has been a rewarding experience. I am also trying to develop devices to take computing to masses - to people in third world countries who are illiterates. It is time that we try to bridge the digital divide by developing 'folk computing'. Folk computers will be multimedia systems that will use text only when essential - exactly opposite of what happens today.

Dr. Dragutin Petkovic obtained his Ph.D. at UC Irvine, in the area of biomedical image processing. He spent over 15 years at IBM Almaden Research Center as a scientist and in various management roles. His projects ranged from use of computer vision for inspection, to multimedia databases and content based retrieval for image and video. Dr. Petkovic received numerous IBM awards for his work and is one of the founders of content-based retrieval area and IBM's QBIC project.

Dr. Petkovic's mission and professional statement

My passion is for multimedia systems, which are very easy to use by nontechnical users, offer automatic indexing and data summarization, and which exhibit some form of intelligence. My inspiration comes from technical, scientific, and business leaders who are successful but at the same time compassionate and people oriented.

Dr. Rosalind Picard is Associate Professor of Media Arts and Science, MIT Media Lab and Co-director MIT Things That Think Consortium. She received her Doctorate from MIT in Electrical Engineering and Computer Science in 1991.

Dr. Picard's mission and professional statement

The mission statement for my research group can be found at http://affect.media.mit.edu. Basically, our research focuses on creating personal computational systems with the ability to sense, recognize, and understand human emotions, together with the skills to respond in an intelligent, sensitive, and respectful manner toward the user. We are also developing computers that aid in communicating human emotions, computers that assist and support people in development of their skills of social-emotional intelligence, and computers that "have" emotional mechanisms, as well as the intelligence and ethics to appropriately manage, express, and otherwise utilize such mechanisms. Embracing the latter goal is perhaps the most controversial, but it is based on a variety of scientific findings from neuroscience and other fields, which suggest that emotion plays a crucial role in enabling a resource-limited system to adapt intelligently to complex and unpredictable situations.

In short, we think mechanisms of emotion will be needed to build human-centered machines, which are able to respond intelligently and sensitively to the complex and unpredictable situations common to human-computer interaction. Affective computing research aims to make fundamental improvements in the ability of computers to serve people, including reducing the frustration that is prevalent in human-computer interaction.

With respect to personal challenges, my greatest challenge (and delight) is raising my sons. People that inspired me professionally are Fran Dubner, Anil Jain, Nicholas Negroponte, Alan Oppenheim, John Peatman, Alex Pentland, Ron Schafer, and many other teachers, colleagues, and friends.

Expert Panel Discussion

What do you see as the future trends in multimedia research and practice?

Rosalind Picard

With respect to long-term needs and challenges, I see a demand for:

(a) Development of better pattern recognition, machine learning, and machine understanding techniques, especially for enabling semantic and conversational inquiries about content, and (b) Interfaces that recognize and respond to user state during the interaction (detecting, for example, if the user is displeased, pleased, interested, or bored, and responding intelligently to such).

Neither of these is entirely new, but I would like to see a change in their emphasis: from being centered around cool new algorithms and techniques, to being centered around what people are good at. How can multimedia systems enable users to not only feel successful at the task, but also to enjoy the experience of interacting with the system? One way of thinking about this is to build the system in a way that shows respect for what people like to do naturally, vs. expecting people to become trained in doing things the system's way. If a person assumes something is common sense, the system should be capable of understanding this. If the person communicates to the system that it is not functioning right or it is confusing, then the system should acknowledge this, and even be able to take the initiative to offer the user an alternative. If the person communicates that one approach is enjoyable, and another is annoying, then the system could consider adapting to make the experience more enjoyable for that user. If the user communicates in a way that is natural, that another person would understand, but the system still doesn't get it, then we haven't succeeded yet. Clearly we have a long way to go, since today's computers really "don't get it" with respect to understanding most of what we communicate to them.

Ramesh Jain

Ten year ago, people had to plan their budget to create a multimedia environment for acquiring, processing, and presenting audio, video, and related information. Now computers below $1,000 are all equipped with powerful multimedia environment. Mobile phones are coming equipped with video cameras. The trend to make everything multimedia exists and will continue.

This will give rise to some new interesting applications. I believe that one of the very first will be that e-mail will become more and more audio and video. This will be used by people all over the world by people who do not know English, or who don't even know how to read and write in their own languages. Text mail will remain important, but audio and video mail will become popular and someday overtake text in its utility. This has serious implications for multimedia researchers because this will force people to think beyond the traditional notion of everything being a page of text. Storage, retrieval, and presentation of true multimedia will emerge. Current thinking about audio and video being an appendix to text will not take us too far.

Multimedia research is not keeping pace with the technology, however. In research, multimedia has to become multimedia. We all know that 3D is not the same as 3 X 1d. Current multimedia systems rarely address multimedia issues. Look at all the conferences and books; they usually have sections on images, audio, video, and text. Each section is more or less independent. This is definitely not multimedia. Multimedia must address more and more Gestalt of the situation.

Multimedia research must address how the semantics of the situation emerges out of individual data sources independent of the medium represented by the source. This issue of semantics is critical to all the applications of multimedia.

Alberto Del Bimbo

High speed networking will emphasize multimedia applications on the Internet that include 3D digital data and video. It will also enhance interactivity and enrich the type of combinations of media employed. The range of future applications is definitely broad. Applications that will interest large public will be, among the others, interactive TV, multimedia database search, permanent connectivity to the net. Niche applications will include new modalities of man-machine interaction, based on gestures, emotions, speech, and captured motion. We expect that these applications will drive most of research on Multimedia in the next years. It will develop on different lines: Operating System Support - including network and service management, database management, quality of service; Network Architectures; Multimedia Databases - including digital library organization and indexing, search engines, content based retrieval; Multimedia Standards; Man-machine Interaction - including new terminals for multimedia information access and presentation, multimodal interactivity, simultaneous presence in virtual and augmented environments.

Nevenka Dimitrova

The trend in multimedia research so far was to use existing methods in computer vision and audio analysis and databases. However, in the future, multimedia research will have to break new frontiers and extend the parent fields. First of all, we need to rely on context and memory. Context is the larger environmental knowledge that includes the laws of biology and physics and common sense. In philosophical terms so far, we have been using what I call the "Hume" model of signal processing where the only things that exist in the present frame are real, and we should transcend to the "Kant" model where there is a representation which accounts for common sense knowledge and assumptions about the expected behavior of the entities that are sought for. Memory is an important aiding factor in analysis with longer term goals. In this respect our methods have severe anterograde amnesia and we just keep a very localized information about the current computations. In detection of dissolves, we keep a buffer of frames. In computing scenes we keep a few minutes worth of data. However, in multimedia processing we need to keep more information for longer periods of time, such as full programs, episodes, and genres.

The practice will have to include wide applications that use this technology in support of "normal" activities of the users: their everyday life, work, and entertainment. In all three categories they could be served by storage, communications, and productivity activities.

Thomas Huang

In the past 10 years, researchers have been working on various components of Video Databases, such as shot segmentation and key frame extraction, genre classification, and retrieval based on sketching. In the future, we shall see more work in integrating these components, and in searching for important and meaningful applications.

Al Bovik

I was quite the reader of science fiction when I was in college. Among the most famous works that I liked very much is, of course, Isaac Asimov's Foundation Trilogy. While I was in school, Asimov published a fourth volume in the series entitled Foundation's Edge, which wasn't as good as the original trilogy, but part of it still sticks in my mind. What has this to do with future trends in Multimedia? Well, Asimov's work, like much of good science fiction, contains good speculative science writing. One of the things that I remember from Foundation's Edge (and little else, actually) is a chapter where the hero of the moment (whose exotic name I forget) is piloting a small starship. The way in which he interacted with the shipboard computer is what I remember most: it was almost a merging of the human with the computer at all levels of sensory input: visual, tactile, aural, olfactory, etc. My recollection (25 years later; please don't write to correct me on the details!) is that the hero and computer communicated efficiently, synergistically, and bidirectionally. The impression was given that the spaceship deeply analyzed the operator's sensory and manual capacities and, using this information, delivered maximal sensory information to the operator, which, simply stated, gave the operator the sensation of being part of the ship itself, with awareness of all of its sensory data, control over its mechanics, and complete capability for interaction with every aspect of the ship and its environment. This makes things sound both very mysterious and technical, whereas the writer intended to convey that human and machine were experiencing one another, at all levels involving sensory apparatus. Perhaps an efficient way to encapsulate this is to say that the fictional multimedia engineers that had designed the systems on this spaceship were, generally, masters of human-computer interface design, and, specifically, what I would call Multimedia Ergonomics.

So, this is my answer regarding which future trends will eventually dominate multimedia research and practice: Multimedia Ergonomics! I am sure that there are many important answers to this question that are more immediate and accessible. However, since many of the most relevant to near-term practice will likely be addressed by other authors writing in this space, I have decided to try to take a longer view and be speculative.

So, again, Multimedia Ergonomics. What do I mean by this? Well, first, what is meant by ergonomics? Most of us think in terms of back-supporting office chairs and comfy keyboard and mouse supports. Dictionary.com gives us this: "The applied science of equipment design, as for the workplace, intended to maximize productivity by reducing operator fatigue and discomfort." Well, I would propose to take the definition and application further, meaning that ergonomics for multimedia should be closed-loop, active, sensor-driven, and configurable in software. All of these leading to a more relaxed, productive, data-intensive, and yes, comfy multimedia experience.

Certainly, the ultimate multimedia system could deliver an all-enveloping, sense-surround audio-visual-tactile-olfactory-etc. experience; indeed, much has been accomplished in these directions already: visit your local IMAX theatre, advanced video game, or museum. But regardless of the degree of immersion in the multimedia experience, I propose that future emphasis be placed on truly closing the loop; rather than just surrounding the participant with an onslaught of sensory data, multimedia systems should familiarize themselves with the human(s) involved, by measuring the parameters, positions, and responses of their sensory organs; by measuring the complete three-dimensional physical form, position, movements, and perhaps even state of health and fitness of the participant(s); by providing exceptionally rich and absorbable sensory data across the range of sensory capacity, and means for delivering it to the full range of sensory organs. In other words, completely merge the human and machine at the interface level, but in as noninvasive a way as possible.

Since all this is highly speculative, I cannot say how to best do all of these things. However, I do observe that much work has been done in this direction already (with enormously more to do): methods for visual eyetracking, head-tracking, position sensing, body and movement measurement, human-machine interfacing, assessing human sensory response to data; immersive audio-visual displays, and even olfactory sensing and synthesis are all topics that are currently quite active, and are represented by dedicated conferences, meetings, and so on. However, most of these modes of interactive sensing still have a long way to go. For example, visual eyetracking, or the means for determining the gaze direction of the human participant(s), has become mostly non-invasive, but still requires fairly expensive apparatus, a cooperative subject, and a tradeoff between calibration time taken and accuracy in each session. Visual algorithms that use visual eyetracking data are still few and far between. Methods for assessing human sensory response to data of various types have advanced rapidly, owing primarily to the efforts of sensory psychologists. Our own contribution to this book details our current inability to even predict whether visual data will be considered "of good quality" to a human observer. Advanced methods for human-machine interfacing are high priorities with researchers and funding agencies, but my desktop computing environment doesn't look much different than it did ten years ago.

So, how to approach these problems and make progress? I think that the answer lies in looking within, meaning to understand better the sensory functioning and sensory processing capacities of human beings. Sensory psychologists are, of course, energetically engaged in this effort already, but rarely with an eye towards the engineering applications, and specifically, towards implications for Multimedia Ergonomics that I view as critical for advancing this art.

For my part, I believe that the answer is for Multimedia Engineers to work closely with sensory psychologists, for their common ground, I have discovered, is actually quite wide when recognized. I plan to continue working with my colleagues in both fields as I have been doing for some time now. Come to think of it, I think I may soon return to Asimov's Foundation Series for a reread and some new ideas.

Forouzan Golshani

Let me point out two areas, each with a completely different emphasis as trends that we should follow. One is the general area of "security, biometrics, and information assurance" which is now receiving considerably more attention, and the other is media technologies and arts. I elaborate on each one separately.

Whereas the premise that digital information is no longer limited to the traditional forms of numbers and text is well accepted, why is it that multimedia information is not specifically addressed in such areas as information assurance, security and protection? With the ubiquitous nature of computing and the Internet, the manner in which information is managed, shared and preserved has changed drastically. Such issues as intellectual property (IP) protection, security, privacy rights, confidentiality, authorization, access control, authentication, integrity, non-repudiation, revocation and recovery are at the heart of many tasks that computer users have to face in different degrees. These issues significantly impact both the technology used for information processing (namely hardware, software and the network) and the people who perform a task related to handling of information (i.e., information specialists.) Thus, the major elements are information, technology and people, with information being the prime element. This departure from system orientation and emphasis on information itself implies that information must be considered in its most general form, which would encompass multimedia, multi-modal, and multi-dimensional information for which the traditional methods may not be sufficient. In addition, information fusion is at the heart of the matter. Take, for example, the general security area and one its most essential elements, biometrics. Whether it is face images of 2D or 3D, voice samples, retina scans or iris scans, the multimedia area plays a major role in this area. Similarly, information protection, say intellectual property preservation, clearly overlaps significantly with our field. Examples range from watermarking of objects to content-based comparison of digital documents for identifying significant overlap. Despite these obvious examples, the current discussions on information assurance still center around securing the "information system" with a clear emphasis on protection against unauthorized access and protection from malicious modifications, both in storage and in transit. Shouldn't parameters that our community hold important, say, quality of service, be part of the general area of assurance?

With respect to media and the arts, I would like to focus on a discontinuum that currently exists between multimedia technologies and multimedia content. Arts are mostly about experiences. Many (including Ramesh Jain, a leading multimedia pioneer) have argued for inserting yet another layer, namely "experience" to the traditional hierarchy of "data, information, knowledge, wisdom, …", where data and information are used to bring about "experiences" which may lead to knowledge. For example, a movie at the data level would be identified by pixels, colors, transitions, etc., and at information level by recognized scenes and characters. All these, however, must convey a certain experience that the artist intended to communicate. Looking at the state of multimedia technology, it seems that sensory data processing, transmission and fusion, data abstraction and indexing, and feedback control, provide new paradigms to artists for engaging their audience in experiences that were not possible before. But how easy are these tools for usage by an artist who is not a programmer? In other words, the tool builders of multimedia community have certainly mastered data and information processing, but the leap to creating new experiences have been left to those artists who are technically astute and have been able to master the digital multimedia technology. Speaking with a friend, who is an internationally recognized composer, about the gap between multimedia tools and artistic contents, he commented that in order to bridge the gap, a phenomenon similar to what happened in the nineteenth century Europe is needed in the digital age. He explained that in those days, piano was the center of all social activities — from homes to schools and other public places — and families and friends gathered around their piano at night, to enjoy music by the best composers of the time. Those composers knew the instruments well and communicated directly with the instrument builders. On the other hand, the instrument builders knew the work of the composers well and built instruments that could make their art better. For example, by knowing the intension of the composer in using "fortissimo", the piano builders perfected certain features that enabled such variations in sound. He concluded that such a dialog between today's tool builders and artists would be of utmost importance. Well… in the 21^st century, the computer has become the centerpiece of activities in homes, schools, and many public places. The question is, do we have master artists who can effectively use computers in creating new experiences that were not possible before? Are the existing tools and techniques adequate for artists who would want to make use of the digital technology in their work, or are they limiting factors when it comes to communicating experiences? Some believe that digital arts in general are not ready to claim a leadership role in the aesthetic and emotional development of artistic endeavors across the society. Should the two groups - multimedia technologists and the artists - begin true collaboration in a manner similar to that of the 19^th century leading to the creation of piano, we may begin to see a transition… a transition away from media content that uses forms with origins in pre-digital age!

Dragutin Petkovic

Basic technologies and market trends that will significantly help the development of multimedia information systems but which will be mainly addressed by industry and market forces are as follows:

The volume of information is rapidly increasing, both in size and richness. Media data is growing much faster than other data. At the same time abilities, time and patience of potential creators, editors, and users of such information are staying constant, and the cost of human labor is not going down.
Cost/performance of all hardware is advancing rapidly: CPU, storage, memory and even network infrastructure can be assumed to be free and not an obstacle.
Capture and delivery devices as well as sensors are showing rapid advances as well. They range from digital cameras, multimedia cell phones, handheld devices, PCs, to small wireless cameras, and every year they are smaller, more powerful and cheaper.
Basic multimedia software (operating systems, compression, media players and tools) is reaching maturity.
Software interface standards are also progressing well and are largely being resolved by international standard bodies and by cooperation of academia, Government and industry. For example: MPEG 4 and MPEG 7 (related to video metadata organization), XML, industry specific metadata standards, WWW services and various wireless standards promise to enable easy connectivity between various building blocks in terms of SW compatibility.
Media, gaming and entertainment applications are part of our daily lives. New forms of entertainment (Interactive TV, real video on demand) are receiving increased attention by major media corporations
Integration of media devices, PCs, and WWW at home (or anywhere) offers yet another new frontier for PC, software, and electronic manufacturers.

However, there are some fundamental challenges inhibiting wider proliferation of multimedia information systems that must be addressed by research community:

Successful multimedia indexing, cross-indexing, summarizing and visualization are today done only by humans. This is a slow and expensive process, and it is feasible only for a few applications where the cost of this process can be amortized over large number of paying customers, such as major entertainment, gaming, and media applications.
On the usage side, people have only limited amount of time to devote to usage of multimedia information systems, while at the same time the volume of data is growing rapidly. Users are impatient, want easy to use, robust and intuitive systems, expect timely data and fairly accurate indexing, and want the ability to navigate and quickly browse multimedia, rather than use simple linear play (except for entertainment).

David Gibbon

Over the past few years, there has been much emphasis on algorithms and systems for managing structured video such as broadcast television news. While this was a logical starting point for algorithm development, we must now focus on less structured media. Media processing algorithms such as speech recognition and scene change detection have been proven to work well enough on clean, structured data, but are known to have poorer performance in the presence of noise or lack of structure. Professionally produced media amounts to only the tip of the iceberg — if we include video teleconferencing, voice communications, video surveillance, etc. then the bulk of media has very little structure that is useful for indexing in a database system. This latter class of media represents a potentially rich application space.

Accuracy of the individual media processing algorithms is steadily improving through research as well as through microprocessor advances. Multimodal processing algorithms will benefit from the improvements in these constituent algorithms and they will improve intrinsically through intelligent application of domain specific knowledge. The infrastructure for many advanced multimedia systems is available today and will be widespread in a few years. This comprises broadband networking to consumers, terminals (PCs) capable of high quality video decoding and encoding, affordable mass storage devices suitable for media applications, and, importantly, media players (some even with limited media management capability) bundled with operating systems. All of these trends favor the diffusion of multimedia and motivate multimedia research.

Shahram Ghandeharizadeh

Generally speaking, a multimedia system employs human senses to communicate information. Video communication, for example, employs human sight. Its essence is to display 30 pictures a second to fool the human visual perception to observe motion. It is typically accompanied by sound to stimulate our auditory sense. This communication might be either real-time or delayed. An example of a real-time communication is a video teleconference involving multiple people communicating at the same time. This form of communication is almost always interactive. Delayed communication refers to a pre-recorded message. While this form of communication is not necessarily interactive, it is both common place and popular with applications in entertainment, education, news-dissemination, etc. Movies, news clips, and television shows are one example of delayed communication.

Humans have five senses: in addition to our visual and auditory senses, we use our touch, taste, and smell senses. Assuming that all these five senses can be exercised by a system, it may produce a virtual world that is immersive. Touch along with human motor skills is starting to receive attention from the multimedia community. Its applications include entertainment, education, scientific applications, etc. As an example, consider students in a dental program. Their education entails training to obtain sufficient dexterity to perform a procedure such as filling a cavity. This training is typically as follows: a student observes an instructor and then tries to repeat the instructor's movements. An alternative to this is to employ a haptic glove equipped with sensors that record the movement of the instructor when performing a procedure. Next, the student wears the glove and the glove controls the movement of the student's hand to provide him or her with the dexterity of the instructor. The glove may operate in a passive mode where it records the student while performing the operation and then gives feedback as to what the student might have done incorrectly when performing the procedure. A challenge of this research area is to develop mechanical devices that do not result in injuries while training students, data manipulation techniques that compensate for different hand sizes, etc.

Taste and smell are two senses that are at the frontiers of multimedia research. At the time of this writing, it is unclear how one would manipulate these two senses. An intersection of Computer Science, Biomedical Engineering, and Neuroscience may shape the future of this area. This intersection currently offers devices such as Cochlear implants for the hearing impaired individuals, organ transplants such as human hands with nerve re-attachments, and devices that feed data directly into the visual cortex of blind people to provide them with some sight. Given advances in computer hardware technology, it is not far fetched to envision future MEMS devices that interface with our neurons to stimulate our senses in support of taste and sounds.

In passing note that video required approximately a century of research and development, e.g., Thomas Edison authored the first camcorder patent. Maybe in the 21st century, we will be able to develop true multimedia systems that exercise all our five senses.

What do you see as major present and future applications of video databases?

Nevenka Dimitrova

With the storage capacity doubling every year, current consumer storage devices should reach the terabyte range by 2005. That means that the professional range will be reaching exabytes. We should look at the generators of the content, not only the 3500 movies that Hollywood and its equivalents around the world produce every year, but also all the camera devices in surveillance, mobile communication, live event streaming, conferencing and personal home video archives.

In creating video databases we travel the round trip: from brains to bits and back. In film production, it first starts with an idea expressed in a script, and then production and capture of this idea into bits. Accessing this information in a video database requires enabling to travel from the bits back to consumption and playback. The applications are in enabling to travel this path from bits back to brains in the enterprise, home environment and accessing public information.

Forouzan Golshani

The major application areas for video databases continue to be education, law enforcement (surveillance, access control, criminal investigations), and entertainment. Here I am including such areas as medicine within the general category of education, and entertainment is broadly used to include sports as well as various forms of arts. Some other applications include: sales and marketing support, advertising, manufacturing and industrial monitoring. Another interesting application area, not tapped to its full potential, is rehabilitation.

The biggest obstacle facing the video database area is a lack of sound business model by which an enterprise can maintain its ownership of the intellectual property and have a reasonable return on its investment. Clearly, we know how to do this on traditional media, e.g., cinema and video rental outlets. Ideally, soon we will have comparable business models for the digital era.

Thomas Huang

Current real-world useful video databases, where the retrieval modes are flexible, and based on a combination of keywords and visual and audio contents, are nonexistent. We hope that such flexible systems will in the future find applications in: sport events, broadcasting news, documentaries, education and training, home videos, and above all biomedicine.

Shahram Ghandeharizadeh

Video-on-demand is an application of video databases that is long overdue. This application faces challenges that are not necessarily technical. These include how to enforce copyright laws, pricing techniques and frameworks that benefit all parties, etc. There are some signs of progress in these directions. As an example, Time Warner is offering four channels of video-on-demand with its digital cable set-top box in Southern California; TiVo enables one household to email an episode of their favorite show to another household, etc.

Another immediate application is to provide storage and bandwidth for households to store their in-home video libraries on remote servers. This service would empower users to quickly search and discover their recorded video clips. A user may re-arrange video segments similar to shuffling pictures in a picture album. A user may organize different video clips into different albums where a sequence might appear in multiple albums. Such a digital repository must provide privacy of content (authentication, and perhaps encryption of content), fast response times, and high throughputs.

Ramesh Jain

Corporate video is one area that I feel has not received as much attention from researchers, as it should. Corporate, training, and educational videos are ready for modernization. Unfortunately at this time tools to manage these are lacking.

What the field needs is a 'videoshop' that will allow people to quickly enhance and manipulate their videos. Powerful editing environment for presenting videos will also help. But the most important need is for a simple database that will allow storing and accessing videos and photographs from cameras onto our systems easily. Current research is addressing problems that may be useful about 10 years from now, and is ignoring important problems to make video databases a reality today.

David Gibbon

There are several major application areas for video databases. We must first look to the entertainment industry, if we are searching for commercially viable applications of these technologies. Video-on-demand systems have been studied and developed for a long time but we have yet to see widespread availability of these systems due to economic constraints. For several years pundits have been predicting that VoD services were just around the corner. Given this history, it would be unwise to predict this yet again. However, it is clear that storage, compression, and delivery technologies are on a steady trend which reduces the overall cost of operation of these systems. One would conclude that the economics should eventually turn favorable. Another driving force is that consumer expectations in this area are being raised by the availability of advanced PVRs (digital video recorders with personalization.)

A second major application area for video databases is in the video production process. There is a slow but steady trend to automate the production process through the use of digital asset management systems. There is clearly a need for more research here, since the bulk of the video footage is "raw" field material with little structure. In fact, the production process can be thought of as one of creating structure. Video producers will also continue to enhance the quality and availability of the archives of the results of their work to extract value and possibly to repurpose it, but the issues in this case are more of engineering and standardization rather than signal processing research.

We are also beginning to see significant application-specific video databases for academic and pedagogical purposes. Two examples of these are the Survivors of the Shoah Visual History Foundation's visual histories archive, and the Museum of Television and Radio. This trend is encouraging to and should be supported by video database researchers. One challenge here is to balance the dichotomy of the need for customization with the desire for standardization and interoperability.

Finally, it is interesting to consider peer-to-peer file sharing systems as the antithesis of video on demand systems in several respects (video quality, structure of metadata, homogeneity of content, rights management, centralization, economic model, terminal device, etc.). These systems can be thought of as loosely organized distributed video databases. Methods developed to manage this data to promote ease of use and to add features should be applicable to other application areas as well.

Dragutin Petkovic

Some of the compelling possibilities for multimedia information systems are numerous, such as:

Entertainment and gaming over TV, cable, wireless, or WWW including future radically novel forms that are now in early research stage.
Training, education, and visual communication where presentations, important talks, meetings and training sessions are captured, indexed, cross-indexed, and summarized automatically and in real or near real time.
Surveillance and security where large number of video and other sensors deliver and process huge amount of information which has to be adequately analyzed in real time for alerts and threats (including predicting them) and presented so that people can digest it effectively.
Monitoring and business or government intelligence where vast resources on the WWW and on file systems are constantly monitored in search and prediction of threats (in real or near real time).
Multimedia information that can be quickly found and delivered to the user anywhere (medical, bio-hazard, troubleshooting, etc.).
Military applications involving unmanned vehicles with variety of sensors and remote controls.
Consumer and end-user created applications using home entertainment, WWW and other systems.

We are also certain that in a few years from now we might see some totally new applications that we never even thought about today.

Alberto Del Bimbo

Video database applications will allow faster search of information units ranging from a few seconds of video to entire movies (up to 2 hours). The use of this information goes from the composition of news services by the broadcast industry to private entertainments. Private entertainment applications will probably be available soon to the masses and require fast and effective access to large quantities of information. For enterprise applications the focus will be on the possibility of automatic annotation of video streams. We must distinguish between live logging (real-time automatic annotation for short-term reuse) and posterity logging (automatic annotation for storage and successive reuse) applications. In the near future, it is possible that live logging will be much more interesting than posterity logging for many enterprises and, certainly, for the broadcast industry. Live logging will reduce the time-to-broadcast and, differently from posterity logging, has little impact on the company organization.

Al Bovik

For this question I will be somewhat less speculative! But, perhaps, equally nonspecific. My reason for this is that I think that visual databases (not just video) are going to become quite ubiquitous. In some forms, they already are. The Internet is a gigantic, completely heterogeneous visual (and other data forms) database. Most home PCs either have simple resident visual databases (files of photos people they've taken or have sent) or similar databases at commercial websites. I know that I do, and so do other members of my family. Nearly every scientific enterprise today involves imaging in one form or another, and this usually means cataloging huge amounts of data. This is especially true in such visual sciences as the remote sensing, medical, geophysical, astronomical/cosmological, and many other fields. And of course this is all obvious and what is driving the field of visual database design. In my view, the two largest problems for the farther future are, in order of challengingness: (1) visual display and communication of the data, for I assume that the data is not only to be computer-analyzed, but also viewed, by all types of people, of all ages and educational levels, with all types of computing, network, and display capabilities (implying complete scalability of all design); answers for this will, in my opinion, be found along the lines in my answer to the first question. (2) Capability for content-based visual search, meaning without using side or indexing information which, ultimately, cannot be relied upon. The answer to this second question is difficult, but again, I think the clues that are to be found reside within: humans are still better at general visual search than any algorithm promises to be in the near future, viz., ask a human to find unindexed photos containing "elephants," then ask an algorithm … yet I believe that the processes for search that exist in humans are primarily computational and logical and can be emulated and improved on, with vastly greater speed. Yet, current visual search algorithms do not even utilize such simple and efficient mechanisms as visual fixation and foveation. I believe also that the answers to be found in this direction will be found by working with experts in visual psychology.

What research areas in multimedia would you suggest to new graduate students entering this field?

Ramesh Jain

I would suggest the following areas:

How to bring domain semantics, identify all other sources of knowledge and use audio and video together to build a system? I would deemphasize the role of image processing and would emphasize closed captions, keyword spotting, and similar techniques when possible. In fact, I would make information the core, and the medium just the medium.

The assimilation of information from disparate sources to provide a unified representation and model for indexing, presentation, and other uses of 'real' multimedia data — audio, video, text, graphics, and other sensors is another interesting problem that I would like to explore.

Another very interesting research issue at the core of multimedia systems is how to deal with space and time in the context of information systems.

Rosalind Picard

I'd suggest learning about pattern recognition, computer vision and audition, machine learning, human-centered interface design, affective interfaces, human social interaction and perception, and common sense learning systems. These are areas that are likely to have a big impact not only on multimedia systems, but in the design of a variety of intelligent systems that people will be interacting with increasingly.

With respect to pattern recognition and machine learning there is particular need for research in improving feature extraction, and in combining multiple models and modalities, with the criterion of providing results that are meaningful to most people. I am happy to see a new emphasis on research in common sense learning, especially in grounding abstract machine representations in experiences that people share. The latter involves learning "common sensory information" — a kind of learned common sense — in a way that helps a machine see which features matter most to people.

Al Bovik

I would encourage them to seek opportunities for cross disciplinary research in Multimedia Processing and in Computational Psychology and Perception.

David Gibbon

In addition to being a rewarding field to pursue, multimedia is also very broad. If a student has a background and interest in signal processing, then such topics as audio segmentation (detecting acoustic discontinuities for indexing purposes) or video processing for foreground/background segmentation may be appropriate. On the other hand, if data structures and systems are of interest, then students can work to benefit the community by developing indexing and storage algorithms for generic labeled interval data, or for searching phonetic lattices output from speech recognizers. Finally, for those with a communications bent, there is clearly room for improvement in techniques for dealing with loss, congestion, and wireless impairments in multimedia communications systems.

Alberto Del Bimbo

Pattern recognition, speech analysis, computer vision, and their applications are important research areas for Multimedia. They support new man-machine interaction with multimedia information, based on gesture or motion capture or affective computing. In video databases, they can provide the means for automatic annotation of video streams, and, supported by knowledge models, the understanding of video contents.

Forouzan Golshani

As stated above, our field has been mostly about technology development. I would encourage students entering this area to consider the end results much more seriously than before. Engineering and computer science curricula are dominated by the question "HOW": how to solve a problem. We must think about the "WHY" question too! By working in interdisciplinary teams that include other students specializing in arts, bioengineering, or sports, engineering students immediately become aware of the problems that the end user faces. The best scenario will be to have the creativity from the domain (arts), and the methodology from science. An added benefit is that the entire team will be exposed to "problem solving in the large", since interdisciplinary projects bring out some real engineering issues. These issues are not generally encountered when working on a small scale project.

Thomas Huang

Important research areas include:

Video understanding, based on visual and audio contents as well as closed captions or even transcripts if available. Audio scene analysis is an important subtopic.
Interplay of video processing and pattern recognition techniques with data structure and management issues, when the database is HUGE.
Flexible user interface. More generally, how best to combine human intelligence and machine intelligence to create a good environment for video data annotation, retrieval, and exploration.

Dragutin Petkovic

Given the above trends and challenges, we suggest the following four basic areas that research community (in collaboration with industry and Government) must work on in order to fulfill the promise of multimedia information systems:

Automated Real-Time Indexing, Cross-indexing, Summarization, and Information Integration: While we can capture, store, move, and render "data bits" we still lack the ability to get the right meaning out of them, especially in case of video, sound and images. Issues of automated and real-time indexing, cross indexing of heterogeneous data items, information integration and summarization are largely unresolved. Real time or near real time performance are also critical since in many cases the value of information is highest at the time close to its creation. One particular building block that offers great promise but requires more work is robust speech recognition that would adapt to many domain specific vocabularies (e.g. computer science, medical), noise, speaker accents etc.
Search, Browse, and Visualization: On the receiving side, users have only so much time (and patience) to devote to using such systems. At the same time the volume of information they can or have to assimilate as part of their work or entertainment is rapidly growing. Therefore, the second challenge for research community is to provide solutions for easy search, and automated presentation and visualization of large volume of heterogeneous multimedia information that is optimized for the human users and related delivery devices.
Making Systems more "Intelligent": We need to make systems that are aware of general context, that can adapt, learn from previous experience, predict problems, correct errors and also have ability to manage themselves to a large extent. Systems should adapt their search and indexing algorithms based on context (i.e. topics of the talk, speaker's accent). In case of very large systems, in times of crisis, resources should be intelligently focused on parts of the data space or servers where the most likely data that is searched for could be found in shortest possible time. Systems should also learn and adapt to the users, "work" with users and present the information in such a way as to optimize information exchange with particular users and their devices. In case of failure or problems in delivery of media data, such systems should be able to adapt, correct and do the best job possible, given the constraints.
New applications: We need to explore totally new applications, usage and user interfaces for multimedia, benefiting from smaller and cheaper devices, abundant hardware and ability to have all of them interconnected all the time with sufficient bandwidth.

Nevenka Dimitrova

I would delineate the areas into content analysis, feature extraction, representation and indexing, and potential applications.

Multimodal content understanding for all above four areas is the least explored area right now. We have a fair understanding of computer vision technology its applications and limitations. We have also a fair understanding of speech recognition, but to a much less extent of audio scene content analysis and understanding. However, we still have rudimentary approaches to a holistic understanding of video based on audio, visual and text analysis, and synthesis of information.

We need to consolidate the theoretical foundations of multimedia research so that it can be considered on an equal footing with the parenting fields. We need to go beyond the simple application of pattern recognition to all the features that we can extract in compressed and uncompressed domain. We need new pattern recognition techniques that will take into account context and memory. This requires meditating more on the new domain of audio-visual content analysis, on selecting mid-level and high-level features that stem from multimodality, and on scaling up indexing to tera-, peta-, and exa-byte databases. The simplest application is retrieval, however, we need to invest brain cycles on other applications, in user interaction modalities, query languages. An alternative to active retrieval and browsing is content augmentation where the query is given by a user profile and acts as a query against all the abstracted information for personalization and filtering applications in a lean backward mode. We have to expand into applications that serve the purpose of content protection, streaming, adaptation, enhancement, using the same techniques as in multimodal content analysis. In addition, in multiple circumstances storage serves as either the source or the target for these other applications and it makes sense to investigate the synergy. For example, video streaming can use the knowledge of important scenes in order to provide variable bit-rate based on the content; Content adaptation can rely on multimodal analysis to use the best features for transcoding and downscaling higher rates video to be presented on mobile devices with limited resources.

Shahram Ghandeharizadeh

I encourage graduate students to work on challenging, high-risk research topics. Typically, research that investigates an intersection of multiple disciplines is most fruitful. As one example, consider material scientists who investigate light sensitive material that produces electrical current in response to light to stimulate an eye's retina (for visually impaired). A natural hypothesis is how should these devices evolve in order to increase a visually impaired individual's perception. One may start to investigate techniques to make recordings from these devices for subsequent playback. One may manipulate the digital recordings to guide material scientists towards an answer. One may even hypothesize about the design of MEMS-based pre-processors to personalize these devices for each individual. This is one example of an intersection of multiple disciplines to develop a high-risk research agenda. There are many others.

What kind of work are you doing now, and what is your current major challenge in your research?

Alberto Del Bimbo

In the field of video databases, we are presently working on automatic annotation of video streams, interpreting video highlights from visual information. The area of application is sports videos. In this research, we distinguish between different sports, model a priori knowledge using finite state machines, recover from the imaged view the real player position in the playground, and distinguish significant highlights - like shot on goal, corner kick, turn over, in soccer games. The ultimate goal is to obtain this semantic annotation in real-time.

We are also working on 3D image database content-based retrieval - addressing automatic recovery of the 3D shape of objects from a 2D uncalibrated view, 3D feature extraction, local and global similarity measures for 3D object retrieval. Finally, research is also being conducted on new interfaces for advanced multimedia applications, based on gestures or motion capture. In gesture-based interfaces the user points at and selects, with simple hand movements, multimedia information displayed on a large screen. With motion-capture interfaces, the 3D motion of the full human body or of its parts is captured and replicated in a virtual environment, for collaborative actions and for eventual training.

Ramesh Jain

I am trying to develop experiential environments for different applications. Experiential environments allow a user to directly utilize his senses to observe data and information of interest related to an event and to interact with the data based on his interests in the context of that event. In this different data sources, are just that - data sources.

Two major interesting problems here are correct representations to capture application semantics and assimilating data from different sources into that representation.

In addition to these technical challenges, a major socio-technical challenge is to change the current academic research culture where all the motivation is on publishing papers rather than solving problems.

Al Bovik

I am working closely with visual psychologists on the problems of (a) visual quality assessment of images and video, both full-reference and no-reference; (b) methods for foveated visual processing and analysis; (c) visual ergonomics, especially visual eyetracking, with application to visual processing of eyetracked data; and (d) analyzing human low-level visual search using natural scene statistics. We feel that our major challenge is this: "why do humans direct their gaze and attention where they do, and how can we predict it?" If we could answer this, then there are many wonderful new algorithms that we could develop that would change the face of Multimedia Processing.

Shahram Ghandeharizadeh

My research conceptualized a video clip as a stream with real-time requirements. Recently, peer-to-peer networks such as Kazaa have become popular. Once installed on a desktop, that desktop may download audio and video clips that are requested by a user. At the same time, the desktop becomes an active participant in the peer-to-peer network by advertising its content for download by other desktops. I am interested in developing streaming techniques that enable peer-to-peer networks to stream video to a desktop to minimize the observed latency. Obviously, the system must approximate certain bandwidth guarantees in order to ensure a hiccup-free display.

I am also interested in ad hoc networks and techniques to stream video clips from a source node to a destination node. An example is an inexpensive wireless device that one may purchase for a vehicle. These devices monitor the status of different road stretches and communicate with one another to provide drivers with up-to-date (a few seconds old) congestion updates. This network is ad hoc because all vehicles are mobile and the network is always changing. Given the monitors in minivans and other vehicle types, passengers of a vehicle may request a movie that is located on a remote server. The remote server must stream a video clip across this ad-hoc network to deliver the content to a specific vehicle. Strategies to accomplish such a streaming to minimize response time while maximizing the number of simultaneous displays is one of my current research interests.

I am also studying haptic devices and their data streams. Similar to a video clip, a stream produced by a haptic glove is continuous and pertains to the movement of a human joint as a function of time. I am collaborating with several USC colleagues to investigate techniques to process these streams. Our target application is a framework to translate hand-signs to spoken English for the hearing impaired. The idea is as follows. A hearing impaired person wears a haptic glove and performs hand signs pertaining to a specific sign language, say American Sign Language (ASL). A computer processes these streams and translates them into spoken English words to facilitate communication between this individual and a person not familiar with ASL. We employ a hierarchical framework that starts by breaking streams into primitive hand postures. The next layer combines several simultaneous hand postures to form a hand sign. Either a static hand sign or a temporal sequence of these signs may represent an English letter. Letters come together to form words, etc. Two key characteristics of our framework are: a) buffers at each level of the hierarchy are used in support of delayed decisions, and b) use of context to resolve ambiguity. Our framework may maintain multiple hypotheses on the interpretation of incoming streams of data at each layer of the hierarchy. These hypotheses are either refuted or verified either in a top-down or a bottom-up manner. In a bottom-up processing, future sequence of streams generated by the glove select amongst alternative hypotheses. In a top-down processing, context is used to choose between competing hypotheses.

Forouzan Golshani

I am currently working on two somewhat inter-related projects. They are: distributed media and arts, and multimedia information fusion.

The distributed media and arts is an interdisciplinary Media and Arts research and education program, where engineers and domain experts are educated collectively. It focuses on finding new paradigms for creating human-machine experiences, with the goal of addressing societal needs and facilitating knowledge. An example of what we hope to bring about is the ability to go from motion capture and analysis to:

free movements within shared space
organizing gestures into a phrase
communicating an emotion through movement.

Currently on the campus of Arizona State University, we have a highly equipped performance area called the Intelligent Stage. It was created with the aim of enabling artists to interact with their environment and each other. It is equipped with many types of sensors and actuators, Video tracking, Vicon 3D motion capture, and novel navigation and interaction mechanisms that bring the performer and the audience closer to each other.

Multimedia information fusion is about working with the semantics of data and information objects, and going beyond the low level, machine oriented modes of search and retrieval. The forty-year-old tradition of keyword (or textual descriptions) search is completely inadequate, particularly when it produces such a dismal result. The challenge in multimedia information fusion is that we do not have an effective mechanism that can cross the artificial barriers created by media-specific tools. What is needed is an ontological^[1] representation of domain knowledge, which maps features to topic-specific themes and maps themes to user-specific concepts. Once we can create domain-specific audiovisual ontologies, analogues to what exists for natural language terms, we will have great stride in fusion and integration of all types of information regardless of what type of media they come in.

Nevenka Dimitrova

My major challenge currently is to figure out how to improve quality of life in the domains of the TV/media access and personal memories preserving activities of people. My dream is to insert non-invasive technology for content access everywhere, anytime supported by multimodal video content analysis, indexing, and applications to empower people to easily access the "fiction" fiction memory and the "private" memories. This non-invasive technology requires breaking new grounds into multimodal content analysis, blending of pattern recognition, signal processing, and knowledge representation. This technology has to be pervasive, part of the environment, the "ambient," and perceived as "naturally intelligent" by the users. This requirement translates into important breakthroughs in scaling down the algorithms into silicon and affordably inserting them into recording, storage and communication devices. So, in the end, personal video databases for consumers will include a diverse, distributed multitude of content "processors" and "storers" at home and away from home.

Thomas Huang

I am practicing what I am preaching. So the problems listed before are the challenging issues we are working on currently.

Rosalind Picard

I direct research in affective computing: computing that relates to, arises from, or deliberately influences emotion. This may be surprising at first - but it actually grows out of my efforts to build better multimedia systems. This work involves several challenges: One challenge is recognizing, in real-time, if a user is interested, bored, confused, frustrated, pleased, or displeased, and figuring out how to respond intelligently to such cues. At the core of this challenge is the problem of representing and recognizing patterns of human behavior from face, voice, posture, gesture, and other direct inputs to the computer.

Another challenge is that of incorporating mechanisms of affect in the computer system in a way that improves its learning and decision-making abilities (it is now known that when people lack such mechanisms, many of their cognitive decision-making abilities become impaired.) The latter appear to be critical for enabling the computer to respond in more intelligent ways.

Human pattern recognition and learning are intimately coupled with internal affective state mechanisms that regulate attention and assign value and significance to various inputs. When you are looking for something, a human companion learns how to help with your search not merely by hearing the label you associate with the retrieved item, but also by sensing a lot of other context related to your state: how hurried you are, how serious you are, how playful you are, how pleased you are, and so forth. The label you attach to what you are looking for is only part of the information: a smart companion gathers lots of other context, without burdening you with articulating such things as logical queries. The human companion tries to infer which of the many inputs is likely to be most important; he doesn't demand that you stop what you are doing to rank them.^[2]

Affective "human state" characteristics play an important role not only in disambiguating a query, but also in reinforcing what are proper and improper responses to your quest. If your state signifies pleasure with the results, this indicates a positive reinforcement to the companion -- enabling better learning of what works. If you look displeased, then this signals negative reinforcement - and the need to learn how to do better. Such cues are important for any system that is to learn continuously. They are related to work on relevance feedback - but the latter techniques currently only scratch the surface of what is possible.

David Gibbon

In keeping with the trend toward focusing on unstructured multimedia processing, we are interested in the domain of video telecommunications. The goal of this work is to improve the ease of use of and enrich the user experience of video telecommunications systems through intelligent applications of media processing algorithms. One of the main challenges here relates to the media quality. Typical conferencing applications have poor acoustical and lighting conditions in comparison with broadcast television. The quality of the transducers (microphones and cameras) is also subpar. Further, the grammar and fluency of conference participants is far removed from that of typical professionally produced video content.

Our research center is focused on several other areas related to multimedia. There are efforts underway in the areas of multimedia data mining, indexing, and retrieval, wireless delivery of multimedia to a range of terminal devices with various capabilities, and multimedia communications over Internet protocols.

^[1]An ontology is an enriched semantic framework for highlighting relationships between concepts and correlating media-specific features to the application-specific concepts.

^[2]Just think of how many interpretations the word "plant" may have if used as a search parameter! An ontology makes explicit all of the relevant interpretations of a given concept.