Natural language HCD systems make it possible for the user to question a server, or carry out research. For example, the user can manage his diary, his share portfolio, consult the weather forecast, etc. For all of these tasks , the dialogue relates to objects of the world  . Speaking about an object (e.g. using nominal groups) constitutes a linguistic reference to the object. In fact, in any dialogue, interlocutors talk about something, thus carrying out references. Searle, in his philosophical theory of language acts ([SEA 69]), goes further. He states that:
When an agent performs an illocutionary act  , he performs by the fact referring and predicating acts.
This means that each sentence consists of references (to objects in particular) and predications (for example to specify properties of these objects).
Many works deal with the comprehension and the generation of linguistic references in dialogue systems.
For the comprehension of referential expressions, one is confronted with the problems involved in voice recognition in vocal systems. For mono-speaker systems, a user voice training by the recognition tool is needed. For multi-speaker systems, vocabulary size is limited, and the performances of recognition are therefore restricted.
This implies the development of strategies of semantic completion of the recognised propositions [CAD 95], a technique which has its own limits.
Beyond the eventuality of speech recognition (which disappears with written natural language), the system tries to understand the reference, i.e. to identify the referent . To do so, it traditionally proceeds by satisfaction constraints (see [HEE 95]). The object descriptor refers to a whole of potential candidates. Each component of the expression brings a constraint, reducing the unit. Ideally, this process converges towards a single candidate but it is not always the case (none or several acceptable referents can appear). It is then necessary to make a clarification dialogue. Recent works [SAL 01] adopt a wider approach in modelling the mental representations of situation and the domain of reference.
For the generation of referential expressions, other problems are encountered : the choice of the descriptors to be used for the reference [APP 85], [DAL 87], [REI 90], [REI 92], the calculation of unambiguous description [CLA 86], the referential collaboration which underlines the fact that a reference is often understood only after the succession of several references during a repair dialogue [HEE 91], [EDM 94], the co-presence of agents and objects [COH 81], [HEE 95], and the management of focus in dialogue [GRO 86], [REI 92].
Moreover, some works show the problems which dialogue systems have to solve to become a little more user-friendly. The majority of the current dialogues systems do not take into account space-time evolution of the world and are based on rather fixed representations of the world. The evolutionary referents (in time, space or their own nature) cause hard to model difficulties that current technologies cannot manage [PIE 97].
We think that the assertion of Searle (cf. §3.1.1.) applies to any type of communicative act, either linguistic, monomodal or multimodal.
Thus, pointing at an object with the voice and gesture, constitutes a multimodal reference to that object. Multimodal referring should therefore bring a richness and additional expressiveness to HCD, in particular to exploit the properties of representation modalities.
The two following examples underline all the potential of expressiveness, comfort and ease of use that multimodal references can bring. The first example presents a sample dialogue in which the user is talking to a computer. The second example illustrates several turns of natural language HCD. The last statement, produced by the computer, is a multimodal one.
Comprehension of a multimodal reference by the communicating agent
Let us take the example of a multimodal statement composed of a natural language (NL) sentence and of a gestural (G) designation on the screen. The HCD system allows the online purchase of vehicles at a car dealer . A user consults on a tactile screen the list of available automobiles, presented under the form of small photographs. He wants to know the price of a car (Figure 3.2).
This multimodal utterance partly consists of references to an object, of automobile type. This one is referred, on the one hand, with a referential linguistic expression, by the demonstrative nominal group ˜ this car which refers to instance_auto_21 , a particular instance of the object category of automobile and, on the other hand, with a gestural deictic reference (gestural designation) which refers to the same object. There is therefore for this object a bimodal reference (or co-reference).
Generation of a multimodal reference by the communicating agent
If a human being is intrinsically limited by his/her means of expression: voice and gesture, he/she can at best extend the modalities to writing, drawing, using charts , etc. However, the amount of time taken to exploit them is increasingly longer than for natural modalities. The intelligent agent does not suffer from this kind of problem and benefits from very large processing and storage computer capabilities. In a situation where a person would try to make a gesture to illustrate the content of his speech, the agent can replace this illustrative gesture, difficult to build and sometimes to understand, by a graphic visual representation strongly analogue  to the original object or concept in question (a photograph for an object, and a diagram for a concept, for example).
Figure 3.3 illustrates this phenomenon in a fictitious example of dialogue between a user (U) and his/her intelligent personal electronic assistant (S).
Mr Onetel called a few moments ago. He wants to see you but did not specify why.
Who is he? Do I know him?
He's the president of 1-TeL. You met him last year!
I don't remember him. What does he looks like?
Here is a photograph of Mr Onetel and his associate. Mr Onetel is on the left of the picture. (The system displays a photograph, where one can see Mr Onetel)
Multimodal dialogue systems, as well as natural language dialogue systems, are confronted with problems directly or indirectly involved in the referring phenomena in multimodal HCD.
First of all, things are not simpler for multimodal references than for linguistic ones. If intuition leads one to think that the possible system redundancies in input will allow it to make a more robust interpretation and avoid ambiguities , the facts are very different. This is partly due to recognition systems, which do not ensure optimal recognition. Thus, when two parallel messages coming from two different modalities are contradictory, it is difficult to know which one is erroneous. Fortunately, task and context can be helpful in this process.
Then, certain problems relating to linguistic references remain within the framework of multimodal ones. It is obviously the case of the calculation of unambiguous object descriptions, identification of referent (and referential collaboration), co-presence of the agents and management of focus (all the more difficult as modalities are numerous ) [CSI 94].
Other problems appear at the comprehension level of reference, like confusion between command gestures of communicative gestures, or between unintentional gestures and deictic ones [STR 97]; metaphor of display of the real world (confusion between the object and its representation) [HE 97]; temporal synchronism and the scheduling of the events.
The problems occurring in generation of multimodal references are the choice of descriptors on the selected modalities (this selection is related to the choice of modalities  [REI 97]); the internal inference of attributes of a modality, starting from other attributes already known on other modalities or known of the categories of objects [AND 94], the metaphor of display, temporal synchronism, etc.
The phenomenon of referring should be approached in a general way and the model of reference used by the agent has to be sufficiently close to that of the human one in order to enable him/her to plan the reactions of the computer just as he/she would do with a human interlocutor so as to continue to refer as he used to. In the same way, the software agent has to be able to envisage the interpretation of its reference by the user so as to build it accordingly . The interpretation of a reference is partly related to the knowledge of the recipient agent. This is why the object representation model (and its principle of dynamic construction) has to meet this need.
Damasio [DAM 94] indicates that the mental representations of objects that humans build consist of perceptive elements acquired during sensitive experiments on these same objects.
Let us take the example of a person z "known by sight" by a person j . This person j keeps in memory a certain amount of perceptive information about z , like visual information (e.g. his face, his size, his style of clothes), like auditory information (e.g. the sound of his voice), etc. Using this information intelligently, j can build references to z in order to bring his/her interlocutor to identify him/her (i.e. z ).
The mental representations are also partly made up of linguistic object descriptors [PAI 69] that describe on the one hand the semantic category of the object, and on the other, its particular properties.
Most of the sensitive and linguistic elements are redundant, because they are encoded in a double way by the phenomenon of dual coding [PAI 69], [PAI 86]. During perception, dual coding converts sensitive ( resp . linguistic) elements into linguistic ( resp . sensitive) equivalents and stores everything in memory.
In order to formalise this organisation of object memory, we introduce, for an agent, the concept of multimodal mental representations (MMR). A MMR is a formal entity corresponding to the intuitive idea of multimodal mental representation of an object that an agent has. A MMR consists of a set of acquisitive object representations (OR), which appeared during linguistic and sensitive perceptions by the agent and of a set of productive OR produced for linguistic and sensitive references. All sensitive OR (acquisitive and productive ones) constitute the entire sensitive mental image of the object and all linguistic OR (also acquisitive and productive ones) constitute the entire linguistic mental image of the object. These two mental images constitute the entire MMR.
OR make it possible to refer to the generic and particular properties of objects. Generic properties are in fact categorical descriptors of objects. Here is an example of linguistic categorical descriptors: " animal " ’ " mammalian " ’ " dog ".
Specific properties make it possible to code the particular attributes of the object. For example: " dog " ’ " brown " ’ " Droopy ", etc.
While using these OR, the agent will be able to build references.
Our model of mental representation is to be brought closer to that of Appelt and Kronfeld [APP 87], [KRO 90] who used the term individuating set (IS) for mental representation. An IS is composed of intensional object representation(s) (IOR), which represent  the referred object, if this one exists. Appelt and Kronfeld defined two types of IOR: speech IOR which results from linguistic acts of referring in the discourse and perceptive IOR which result from perceptive acts of referring in the speech.
We do not consider this model sufficiently precise for multimodal reference. The definition of IOR remains too vague, does not detail enough various possible natures of perceptive IOR (in terms of various modalities of interaction).
As we just saw, we defined two types of OR, acquisitive OR (input of the agent) and productive OR (produced by the agent).
Acquisitive OR occurs after an act of perception. This act of perception can be of two types. The first one corresponds to the sensitive perception of real objects. The vision of an object for example makes people perceive its form, its size, its colour, its aspect, etc. These sensory descriptors of the object will constitute acquisitive sensitive OR. The second one corresponds to the perception of linguistic references in the speech. These linguistic descriptors of the object will constitute acquisitive linguistic OR.
Productive OR appear when the agent tries to build a reference. In this case, there are also two possibilities.
The first possibility is the computerised method of dual coding (cf. § 3.2.1.). However, there is a small difference which is that this method will only be triggered by need, i.e. whether for reasoning or referring, not during perception. In order to do that, we use a set of categorical and semantic associations between linguistic and sensitive descriptors.
The second possibility makes it possible to generate new traits or properties of the object on the same modality, starting from generic knowledge either on the category of objects or on the domain, in a deductive way.
Both methods can generate linguistic and sensitive OR.
Our approach makes it possible to combine OR related to potentially different communication modalities in a new multimodal OR (MOR). That thus gives the agent the capacity to use several OR in one MOR to refer to an object in a multimodal way. Formally, this combination carries out the semantic sum of OR components. Some rules specify certain characteristics of MOR (e.g. temporal layout of OR components ). In order to model this capacity of combination, we propose the formal predicates Mor_combine (Figure 3.4).
Mor_combine(mor, or 1, , or n ) is true if and only if the MOR mor is the combination of every OR from or 1 to or n which are all related to different modalities.
The preceding prerequisite enables us to introduce a formal model of act of referring which can be integrated into a theory of rational interaction [SAD 91], on which dialoguing agents developed in FT R&D are based.
This theory of rational interaction is founded on an integrated formal model of the mental attitudes and rational action, which makes it possible to take into account various components and capacities implied in communication. It is the case of rational balance, a relation established on the one hand, between the various mental attitudes of an agent (belief, uncertainty, intention ) and, on the other hand, between its mental attitudes and actions. The communicative actions fit within this framework. They can be recognised and planned like traditional actions by the primitive principles of rational behaviour. Sadek proposed some models of communicative acts in his theory of rational interaction. They characterise, on the one hand, the reasons for which the act was selected (called rational effects) and, on the other, the preconditions of feasibility having to be satisfied so that the act can be planned (we will reconsider these concepts a little further).
In this theory, the various models of communicative acts (like the inform act for example), are made operational using logical principles of rationality. These principles for example will make it possible for the agent to select the actions which lead to its goals.
The model of act of referring we propose integrates the theory of rational interaction. This act can be planned and carried out by the same principles of rational behaviour. To define this model of act, we partly take the one proposed in [BRE 95] adapted from definition of Appelt and Kronfeld [APP 87], and of Maida [MAI 92]:
An agent refers to an object whenever the agent has a mental representation which represents what the agent believes to be a particular object, and when the agent intends the hearer to reach a mental representation which represents the same object.
We define an act of referring for each set of relevant multimodal reference  . The act presented below (Figure 3.5) enables one to produce a reference of the same type as the reference to the car in Figure 3.2, i.e. to refer to a MMR using a conclusive nominal group and a deictic gesture .
The heading of the act  expresses that an agent i performs a referring act to an agent j using the MOR mor 1 to refer to the MMR mmr 1 .
The preconditions of relevance to the context (ConP) of the act express the conditions depending on the context which must be true so that the act is accomplished. If they are false, the act, unrealisable, will not be selected by the agent. For example, if the system needs a physical connection to WWW it does not have, it will not seek to have one. The preconditions of relevance to the context presented in Figure 3.5 mean that to perform a linguistic ( nat_lang ) and gestural ( gesture ) reference, these two modalities must be available ( Available(m) ).
The preconditions of capacity (CapP) relate to the capacity of the agent to perform the act. If the conditions are false, the agent can plan the actions which will make them true. The preconditions of capacity presented here mean the following things. The conditions (1) are present in every act of referring: the MOR used to refer to a MMR must belong to this MMR ( Belong ), an agent i can refer only to one of its MMR. The rational effect (ER) is also common to all acts of referring (we will return to this point further). The conditions (2) specify that or 1 and or 2 are OR and must belong to the MMR mmr 1 . The conditions (3) express that the category of modality ( Modality_Cat ) of or 1 is gesture , that its semiotic function (SemioticFnct_isa) is deictic , that loc 1 is the site ( Location ) of the object represented by the MMR mmr 1 , that loc 1 is visible and that the destination of the deictic gesture is loc 1 (location of the object to be referred). The conditions (4) mean that the category of modality of or 2 is natural language ( nat_lang ) and that its textual category  ( Txt_isa ) is a conclusive nominal group ( nom_dem_gr ). The conditions (5) express that orm 1 is the combination of or 1 and or 2 . The Desc_ident predicate expresses that MOR orm 1 is an identifying description of the MMR mmr 1 . (or 2 ) specifies additional properties on or 2 which we do not detail here (see [PAN 96]).
To illustrate the fact that the preconditions of capacity can be planned, let us take the example of the visibility of the location loc 1 . If the location loc 1 is not visible, the agent can plan a succession of actions to make it visible (e.g. according to the context: to physically move the object, to move its representation, to move (in) the scene of discourse).
The rational effect (RE) (the expected effect) of this act is that its addressee, the agent j , will have a MMR mmr 1 ' representing the same object as the one represented by mmr 1 .
Let us illustrate our act with an example, while partly taking the one of Figure 3.2. The user asks: ˜ What is the price of this car but this time, the user points at a spot on the screen where there is no vehicle. This spot is actually located between the photographs of two cars .
Thanks to the model of act of reference shown in Figure 3.5, the agent will note that the reference is a bimodal one with a natural language OR and a gestural OR (white part of Figure 3.2). Natural language OR is of conclusive nominal group type and gestural OR is of deictic type. The agent will think that there is a MOR combining the preceding two OR. Thus, these two OR are the realisation of only one co-reference to a single object.
Conclusive nominal group OR will make it possible for the agent to identify the category of the object as being an instance of the automobile object category. Unfortunately, the gesture deictic pointing on no vehicle, the agent will not be able to identify the referent. A possible reaction of the agent (directed by the primitive principles of the rational behaviour) will be for example to undertake a clarification dialogue in order to obtain an identifying description of the object. The agent can also, thanks to its knowledge manager, elect several possible candidates for this referent. Then he will be able to answer the user by using the act of referring: ˜ This vehicle costs 12500 US Dollars (by indicating the vehicle located above the spot indicated by the user) ˜ and this vehicle costs 15000 US Dollars (by indicating the vehicle located below the spot indicated by the user).
 Object in the broad sense, i.e. physical, conceptual or virtual entity. For example, an Email address is an object of the world, as well as the car of the neighbor opposite .
 An illocutionary act is the act achieved by the production of a succession of signs in a social relation context, which expresses an intention ("to inform", "to request" are illocutionary acts).
 See [BER 97] for an overview of analogue modalities.
 It is known for example that a piece of geographical information is better conveyed by graphic than a piece of abstract information for which the text will be more appropriate.
 We take the term of representation used by Maida [MAI 92], which we prefer at the original term of denotation employed by Appelt and Kronfeld.
 These types of multimodality are so numerous that we eliminated those considered to be irrelevant or useless. They were built from a taxonomy of unimodal modalities, based on [BER 97].
 For reasons of clarity, the model presented here is simplified. In particular, the distinction between productive and acquisitive OR does not appear.
 See [PAN 96] for the definition of textual category.