In the field of human communication, emotions in general and in the voice in particular have two fundamental roles. The first is to regulate communication, in the same way as a large number of non vocal signs emitted by the interlocutors (in particular, mimicry and gestures) do. For example, if one of the interlocutors shows himself to be astonished or annoyed, he will express this through the intonation of his voice, which will immediately trigger a responding reaction in his interlocutor. Moreover, a particular intonation will signal the end of the intervention of one of the two interlocutors and his desire to give way to someone else.
The second role of emotion in the voice is to enable the meaning of the verbal content to be conveyed precisely by the pragmatic content which allows access to connotation. As has been shown in the introduction, one can, for example, distinguish several ways of interpreting the same sentence through the intonations in the voice, which will enable listeners to understand whether the sentence uttered is a question or an affirmation and if this affirmation is meant seriously or ironically. Thus, emotion in the voice in this case enables a great deal of ambiguity in the interpretation of the message to be resolved.
What we have just said with regard to communication between human beings may just as relevantly be applied to man-machine communication, as long as the machines are endowed with interpretation and the generation of emotion in the voice. In fact, a number of studies including those bearing on the CASA ( computers are social actors ) paradigm, Nass et al . [NAS 94] have in the last few years stressed the fact that computer users in the wider sense frequently behave with these machines as they do with humans ; they use emotional signs to express content or discontent, stress or fatigue without having any return reaction. This is a form of release, gratuitous affect with no consequence but this may address the human context if there are people around the person.
Based on the fact we have ascertained, that man-machine communication has a strong tendency to anthropomorphism, that the technologies which enable these same machines to speak (voice synthesis), hear (word recognition) and reason ( agents that dialogue) are reaching maturity and may be transferred to a wide range of software platforms, and finally that the CASA paradigm is a reality, it appears that future developments in the field of interfaces and voice technologies will have a major bearing on the production and interpretation of emotional content.
Thus, one might envisage, in five or ten years' time, interfaces, material or not, which will be sensitive to emotions in vocal commands and which might be able to react in accordance with the command itself, of course, and also in accordance with the way it has been uttered. With the prospect of the increasingly strong resemblance between man-man and man-machine communication we have just evoked, a development like this would logically improve the usability of systems (in particular by reducing the number of errors of interpretation) while improving their user friendliness (by making communication more natural) through drawing on the two fundamental roles played by emotions in the voice which we have referred to above.
It may be considered that there are two major fields of study covering emotions in the voice. The first concerns the production, perception and analysis of emotion in the natural voice and the second the generation through voice synthesis of expressions of emotions which can be found in the natural voice. We will be presenting these two fields briefly later.
Studies on emotions in the natural voice have been mainly centred on the identification of physical correlations with the various expressions of emotions in words (DAV 64] [SCH 86] and [FRA 00]. With this approach, the vocal signal is analysed with a view to explaining the emotional state of the speaker as it is perceived by his audience.
In his detailed review of literature, Scherer [SCH 86] put forward twelve basic emotions which may be distinguished in words (happiness/ pleasure , joy/gladness, displeasure/disgust, scorn/disdain, sadness/despondency, grief /despair, uneasiness/anxiety, fear/terror, irritation/icy anger, rage/temper, boredom/indifference and timidity/culpability) and reports for each of these the principal acoustic parameters identified as being strongly correlated with them. These studies have enabled the relationship between the vocal signal emitted by a speaker and the emotion he expressed through his voice to be modelled.
In a recent study, Maffiolo and Chateau [MAF 01] have shown that the emotions perceived in a vocal signal depend closely on the semantic content of the sentence uttered. Thus, it may be imagined that by coupling such models with a word recognition system and a system of artificial intelligence (used by dialogue agents), it would be possible to identify automatically the various emotions expressed by a speaker in a given semantic field.
Once the emotional content of the voice of a user has been correctly identified, it is a matter of responding to it with ad hoc emotional content. This raises the principal and prior problem of fact: does the person desire symmetry, dissymmetry, wellmeaning neutrality, etc? As part of using a synthetic voice, it is necessary to dispose of algorithms which will make it possible to "breathe into" the acoustic signal being constructed the characteristics of the intonations of the human voice [PIN 89], [CAH 90], [MUR 93]. That might be done a priori by using the bases of specific acoustic data (for example, those recorded with styles of elocution which call upon a variety of pragmatic content) and also by applying particular patterns of prosody post facto to an "emotionally neutral" signal. In this case, re-exploiting the models for analysing emotions in the natural voice might be envisaged "in order for them to supply the target values of the acoustic parameters which the synthetic voice has to achieve if it is to imitate the natural voice".
MIT has proposed the Kismet robot (http://www.ai.mit.edu/projects/humanoid- robotics - group /kismet/kismet.html) which uses a synthetic voice in order to express the six basic emotions of the Ekman model. However these emotions are still prototypes and a more subtle approach is required in order to obtain a synthetic voice that is more natural, with more realistic and less caricatural intonations.