Microsoft Research (MSR) | Building Intelligent .NET Applications(c) Agents, Data Mining, Rule-Based Systems, and Speech Processing

< Day Day Up >

If Microsoft has anything to say about it, we will eventually live in homes straight out of an episode of The Jetsons. The research group envisions a home where the kitchen counter displays a menu that helps you prepare dinner and the refrigerator adds items to your shopping list as you remove them. It may sound like science fiction, but it is not as far away as you think.

In fact Microsoft hopes to release many products known as smart personal objects. These are everyday items customized for the individual user and able to deliver specific information. Wristwatches, alarm clocks, and key chains are all examples.

The first of these products, a smart watch, is already available for purchase from MSN Direct. The watch currently allows you to access sports, weather, news, and stock quotes. Of course, if the trend catches on (as seemly quite likely), it may soon be possible to deliver even more customized functionality via a user's wrist.

The smart watch is just one example of how the world is becoming more mobile and of the need for better software to interface with this new world. Another is Xnav, a Prototype device that allows the user to navigate an application using one-handed touch access. This would be useful not just for busy people, but for the blind and the handicapped.

This section features projects at Microsoft Research (research.microsoft.com) that if not yet ready are expected to make their way into products within the near future. The MSR motto is "Turning ideas into reality." The group comprises dozens of subgroups, with each working on multiple projects (see Table 9.2). Many of the technical advances seen in current Microsoft products originated from this group.

Speech-Related Technologies

Speech is a major initiative for Microsoft. It is part of a larger concept referred to as the Natural User Interface, or Natural UI, which involves creating natural and expressive interactions with the user. This is primarily accomplished using speech-processing capabilities but can also involve natural language and machine learning. The Natural UI is intended to ease interaction with smart devices not merely devices like PDA's and Tablet PC's, but devices loaded in your car, Internet television, and screen phones.

Kai-Fu Lee is the corporate vice president of the Natural Interactive Services Division of Microsoft. He recently gave a presentation in which he stated:

Natural UI will arrive as an evolution. . . . But, in 10 years, Natural UI will be viewed as the largest revolution since Graphical UI.

The Speech group hopes to improve human-to-computer interaction by giving computers the ability to recognize spoken words and even to understand their meaning. Of course, this is the tricky part. The group's researcher are hard at work trying to improve speech recognition, grammar understanding, and text to speech using several different methods.

One way speech recognition can be improved is through the ability to detect emotion in speech. This is a technique that could be very useful for speech applications that interface with customers. The software will be able to respond appropriately if it can recognize the speaker's emotion.

The work from this group was the basis for the Speech Application Programming Interface (SAPI) and also for the newly released Microsoft Speech Server. In fact, some of the researchers from this group are now working in the Speech Platforms Group, which is responsible for the Microsoft Speech Server product.

Recently, I had the opportunity to speak with James Mastan, director of marketing for the Speech Platforms Group. The profile box titled "The Future of Speech at Microsoft" contains excerpts from that conversation.

One of the group's first prototypes was the MiPad, which stands for multimodal interactive notepad. This device, which was first demonstrated in 2000, combines speech-recognition technology with pen input. The user can choose to use either method when accessing e-mail, schedules, or contact information. Work in this area was the basis for multimodal application development with Speech Server.

The Future of Speech at Microsoft

In late August 2004, I spoke with James Mastan, director of marketing for the Speech Platform Group at Microsoft. I was able to ask him about some of the technologies we can expect to see coming from the group.

I began by asking about work being done in the area of personalization. For instance, what can be done to improve the recognition of speech for each individual user?

I told him that as a user of the Speech SDK, I would like to see it move away from the need to rely on the grammar. The process of creating a grammar can be quite cumbersome for applications that should anticipate a broad range of responses.

He suggested that more consideration should be given to application design in the immediate time frame. There are many well-documented techniques on this subject available on the Microsoft Speech Server Web site at www.microsoft.com/speech.

He also had the following to say:

The goal is exactly as you say, instead of having [as] in today's case, the user adapts to the system the reverse is the goal to have the system adapt to the user. So, we have technologies already in research that are pretty far along that enable self-learning.

In a follow-up to his response, I asked when we might anticipate these self-learning techniques to be implemented in the Speech Server product. His response was:

Absolutely in the next five years, potentially within the next three years.

He also informed me of a project named YODA that is currently under development. In describing the project, he said:

It is a dictation program that interacts with e-mail and makes inferences. For example, if you want to send e-mail to Joe, Tom, Fred, and Harry. If it knows for example that my e-mails in the last four weeks have gone to Joe Thomas and Mary Henry, etc., then it can infer those are the people I want to send e-mail to and populate these names in the To line. . . . It learns from your usage pattern. . . . The goal is to enable self-tuning systems.

He told me that the Speech Platforms Group has plotted out what it anticipates will be the error recognition rate over the next few years. He said that humans currently have a 2 percent error recognition rate. He expects that the error rate for speech-recognition systems will probably approach human levels by the year 2011.

I asked Mr. Mastan what new advances were expected to be implemented in the Speech Server product within the next year. He told me of the following:

International language coverage
VOIP (Voice over IP)
Improved noise filtration
Enhanced grammar authoring and debugging tools
Fine-tuned controls for authoring prompts
Enhanced dialog authoring tools

He also told me that the trend is to go to packaged speech-based applications, such as the ones being produced by Solar Software (www.solarsoftware.com). This company is featured in a profile box in Chapter 3 and currently produces a software package known as Vocal Help Desk. The software allows Windows network administrators to initiate administrative functions using speech. He added:

So, you do not have to create custom applications every time you want to develop something. So, lets make this more like the computer software market where you buy a box of speech enabled something and install it on your backend and it works.

In closing, I asked Mr. Mastan how near the Speech Platform group was to creating a fully speech-enabled computer that could be operated by a user using continuous speech. He responded:

That is a goal that Microsoft has and the odds are that the best way to do that is to speech enable Windows. You can assume that is a direction we are headed to within the next five years.

There are many exciting things coming out of Microsoft that will support enhanced computing. Speech processing is probably the most key technology in this area. It will be vital in the acceptance of other technologies that strive to improve human-to-computer interaction.

Notification Platform

Created by the Adaptive Systems and Interaction team at Microsoft Research, the Notification Platform project is based around the idea of an intelligent agent. The technology is already the basis for some of the .NET platform and should be part of other upcoming products. Someday the intelligent capabilities of this platform may even be made available to developers through an API.

One of the reasons the field of AI has had such a slow start is that there are so many things humans do that seem to be innate and do not follow the standard rules of logic. On the opposite side, there are many things that computers can do that humans find difficult. For instance, computers are sometimes better than humans at calculating large figures and at remembering things accurately. Computers, moreover, do not suffer from the same limitations as humans. For instance, computers do not have to eat or sleep.

The Notification Platform takes advantage of the things that computers can do better than humans. It consists of programs that assist users in their daily activities. For instance, one program, named Priorities, is used to assign priorities to e-mails and determine which ones the user wants to see. It uses a neural network to help it learn from the user and know which priorities to assign.

Machine Translation

The MindNet project involves building semantic networks in order to extract meaning from large amounts of data. This knowledgebase project has been utilized as a data repository by the Machine Translation project.

This project is based on machine learning and is used internally at Microsoft to process the company's huge quantities of technical documents. Documents that would take a human months to process can be processed by the Machine Translation project in a single night. The data-driven project parses sentences and assigns them to categories that are later associated. The technology has already been used in the Microsoft Word 97 grammar checker and the natural-language query function of the Microsoft Encarta 98 Encyclopedia.

< Day Day Up >