PERCEPTUAL IMPACT OF VARYING QOS | Mobile Commerce Applications

As the pressure to add value to customers' shopping experience increases , so will the usage of multimedia in m-commerce environments. However, the increased data sizes associated with even relatively low presentation quality multimedia (compared to that of voice and text traffic) as well as the relatively low bandwidth available to m-commerce applications will make the underlying communication system struggle to provide an optimum QoS, resulting in unwanted features such as congestion, as well as data loss and errors.

However, this does not necessarily imply that there is an incompatibility between m-commerce applications and their usage of multimedia. Thus, although a user might be slightly annoyed at the lack of synchronisation between audio and video streams in an advertising clip, say, it is highly unlikely that (s)he will notice, for instance, the loss of a video frame out of the 25 which could be transmitted during a second of footage, especially if the multimedia video in question is one in which the difference between successive frames is small. In fact there has been a rich body of work which has studied the perceptual impact of varying multimedia QoS, which shall now be presented, highlighting their relevance to mobile communications.

Media Synchronisation Perception

Media synchronisation refers to the temporal relationship between two or more kinds of media or separate data streams. In a multimedia context this definition can be extended such that synchronisation in multimedia systems comprises content, spatial and temporal relations between media objects.

The most comprehensive results on the perceptual impact of synchronisation skews between media were reported by Blakowski and Steinmetz (1996) and Steinmetz (1996). Firstly, we shall take a look at the synchronisation between the audio and video streams when a human speaks, also known as lip synchronisation .

In the lip synchronisation experiment (Steinmetz, 1996), the test subjects viewed 30-second clips of a speaker in a TV news environment featured in head, shoulder, and body shots. Such shots enabled the viewers not to be disturbed by background information and to concentrate their attention on the gesture, eyes, and lip movement of the speaker. Moreover, the fact that the test scenes had high temporal redundancy makes them ideal for transmission at low frame rates, with consequently relatively low bandwidth requirements, characteristic of wireless transmissions. In an m-commerce environment, such shots could be used, for example, in the case of a tennis pro explaining the features and qualities of a tennis racquet that the customer is considering buying.

Skews were artificially introduced between the video and audio streams of the clip and were shown to the viewers, together with the original recording. Users who did notice something wrong with the synchronisation were asked to quantify their level of annoyance on a 3-point scale ( acceptable, annoying, not sure if acceptable or annoying ). The main results obtained are the following:

There is an "in sync" region comprised between 80ms and +80ms where lip synchronisation was not detected by most of the test subjects, with very few saying that if there was an error it affected the quality of presentation. Sometimes, some "out of sync" clips were classified as being "in sync". The conclusion that can be drawn here is that lip synchronisation is acceptable within these limits.
The "out of sync" portion comprises skews beyond ‚ 160ms and 160ms. Here the lack of lip synchronisation was detected by almost everyone. The annoyance that users felt with such presentations often had adverse effects, such as the users becoming distracted by the "out of sync" effect and not focusing on the content of the clip.
In the "transient" area where audio was ahead of video, the closer the speaker was, the easier it was to detect the synchronisation errors and the more likely for them to be described as disturbing . Similar considerations applied to the overall resolution; the better the resolution was, the more obvious the errors in lip synchronisation became.
Lastly, similar considerations to the above apply to the "transient" area where audio was behind video. An interesting result that surfaced was that audio behind video could be tolerated much better than the opposite case.

A comparison using languages other than English (the language of the newscast) revealed no difference in the results. Similarly, there were no variations between people with different habits regarding the amount of TV and films watched. Lastly, no difference was detected between the same person speaking in a fast, normal, or slow manner.

Synchronisation between audio and animation is also of potential importance to m-commerce applications, as can be exemplified by audio commentary of an animated representation of a product of interest to the user. Here, the perceptual tolerance limits identified were not as stringent as those in lip synchronisation, with a skew of ‚ ±80ms being tolerable.

A slide show is the most obvious combination of audio and images . Here a skew of around 1s, equivalent to the time required to advance the projector, is tolerable. Synchronisation of audio with text , also known as audio annotation, was found to have a permissible skew closely related to the pronunciation duration of short words, which is about 500 ms, leading to an experimentally verified skew of 240ms.

In the case of synchronisation between video and text or video and images , two cases can be distinguished. In the overlay mode, the image or text offers additional information to the displayed video sequence, as is the practice of having subtitles placed close to the topic of discussion in a multimedia video. Irrespective of video content, a skew of around 240ms has been shown to be sufficient in this case. When no overlay occurs, skew is less serious. In this case one could imagine a drawing detailing assembly instructions of a product being displayed together with a low-frame-rate video detailing the product's appearance when assembled . Here, a synchronisation of around 500ms between the video and image or the video and text is deemed sufficient, which is half the value of the roughly 1s required for human perception of simple images.

Frame Rate Perception

The first recorded work on human perception of different frame rates was done by Apteker, Fisher, Kisimov, and Neishlos (1995). They coined the term "human receptivity" to mean not only just how the human user perceives multimedia video shown at diverse frame rates, but also more distinct aspects of a user's acceptance of a video message. These include clarity and acceptability of audio signals, continuity of visual messages, lip synchronisation during speech, and the general relationship between visual and auditory message components .

Human receptivity is expressed as a percentage measure, with 100% indicating complete user satisfaction with the multimedia data. The authors derived a set of categories of data resulting from a common video classification scheme (VCS). These VCS curves were obtained on the basis of the temporal nature of the data (i.e., its dynamic nature) as well as the importance of the auditory and visual components of the video message. Various video clips were thus classified into eight main categories. A video from each of the eight categories was shown to users in a windowed multitasking environment. Each multimedia video clip in turn was presented at three different frame rates (15, 10, and 5 frames per second, fps) in a randomised order. The end users in these experiments had, however, no knowledge of the particular frame rate at which they were watching the video. The users rated the quality of the multimedia videos on a 7-point scale. A total of 60 people were tested for the 24 types of clips.

The most relevant result to come out of their work was that the dependency between human receptivity and the required bandwidth of multimedia clips is nonlinear. Consequently, for certain ranges of human receptivity, a small variation of it leads to a much larger relative variation of the required band -width ‚ a feature referred to as the asymptotic property of the VCS curves.

Bearing in mind the low-frame-rate videos used in their experiments (5fps, with a bit rate well within the reach of 2.5G and 3G mobile technology), this perceptual property can be exploited in a bandwidth-constrained mobile communication environment by sacrificing a small amount of human receptivity in return for the release of a much larger relative amount of bandwidth, which could then be potentially used by future m-commerce sessions.

Media Loss Perception

In contrast to earlier work done by Apteker and Steinmetz which assumed that the underlying network communication system provided lossless multimedia streams, Wijesekera and Srivastava (1996) and Wijesekera, Srivastava, Nerode, and Foresti (1999) have carried out a series of experiments which evaluate user perception in the presence of media losses. Their work is of particular importance bearing in mind the noisy , bandwidth-constrained environment characteristic of mobile communications and the QoS handoff issues which arise in this context.

One of their initial results was that missing a few media units will not be negatively perceived by a user, as long as too many such units are not missed consecutively and this occurrence is infrequent. This is of especial importance in a noisy, bandwidth-constrained environment. They also found out that media streams could drift in and out of synchronisation without considerable human annoyance.

They further evaluated human tolerance of transient continuity and synchronisation losses with respect to audio and video. Media loss was of two types in their study:

Consecutive ‚ This refers to the maximal number of consecutive media data units (e.g., frames or audio packets) the loss of which could be tolerated.
Aggregate ‚ Here, the number of consecutive media units loss was kept fixed but was replicated at random intervals during the multimedia presentation.

Rate variation was a parameter which was also studied in their experiments. It refers to the ideal rate of a media flow and the maximum permissible deviation from it and was examined both from an intra-stream as well as from an interstream point of view.

Only two video clips of 30s duration were used in their experiments. Both of them featured bust views of a speaker (different in each of the clips) explaining didactic- and academic-related matters. Users were asked to give their opinions on a 10-point Likert scale (where 1 was poor , while 10 was excellent ). Similar to Steinmetz's earlier work, participants were also asked to categorise each clip as Do not mind the effect if there is one, I dislike it and it's annoying , and I'm not sure . Their main results can be summarised by the following points (Wijesekera et al., 1999):

The pattern of user sensitivity varies depending on the type of media defect.
Viewer discontent for aggregate video losses increases gradually with the amount of losses, while for other types of losses and synchronisation defects there is an initial sharp rise in viewer annoyance which afterwards plateaus out.
Video rate variations are tolerated much better than rate variations in audio.
Because of the bursty nature of human speech (i.e., talk periods inter-spersed with intervals of silence), audio loss in this case is tolerated quite well by humans as it results merely in silence elimination (21% audio loss did not provoke user discontent).

Quality of Perception

A different approach to evaluating the perceptual impact of varying QoS was adopted by Ghinea and Thomas (1998). Recognising multimedia's infotainment duality, the authors proposed to enhance the traditional view of QoS with a user-level defined quality of perception (QoP). This is a measure which encompasses not only a user's satisfaction with multimedia clips but also his/her ability to perceive, synthesise, and analyse the informational content of such presentations. They subsequently investigated the interaction between QoP and QoS and its implications from both a user perspective as well as from a networking angle.

The approach to evaluating QoP was mainly empirical, as is dictated by the fact that its primary focus is on the human side of multimedia computing. Users from diverse backgrounds and ages (12-58 years old) were presented with a set of 12 short (30-45s duration) multimedia clips. As detailed in Table 1, these were chosen to be as varied as possible, ranging from a relatively static news clip to a highly dynamic rugby football sequence. All of them depicted excerpts from real-world programmes and thus represent informational sources, which an average user might encounter in everyday life. Each clip was shown with the same set of QoS parameters, unknown to the user. After each clip, the user was asked a series of questions ( ranging from 10 to 12) based on what had just been seen, and the experimenter duly noted the answers. Lastly, the user was asked to rate the quality of the clip that had just been seen on a scale of 1-6 (with scores of 1 and 6 representing the worst and, respectively, best perceived qualities possible).

Table 1: Video categories used in QoP experiments
VIDEO CATEGORY
1 - Action Movie
2 - Animated Movie
3 - Band
4 - Chorus
5 - Commercial
6 - Cooking
7 - Documentary
8 - News
9 - Pop Music
10 - Rugby
11 - Snooker
12 - Weather Forecast

Because of the relative importance of the audio stream in a multimedia presentation (Kawalek, 1995) as well as the fact that it takes up an extremely low amount of bandwidth compared to the video, it was decided to transmit audio at full quality during the experiments. Parameters were, however, varied in the case of the video stream. These include both spatial parameters (such as colour depth) and temporal parameters (frame rate). Accordingly, two different colour depths were considered (8- and 24-bit), together with three different frame rates (5, 15, and 25 frames per second, fps). A total of 12 users were tested for each ( frame_rate, colour_depth ) pair. In summary, the results obtained in the QoP experiments showed that:

A significant loss of frames (that is, reducing the frame rate) does not proportionally reduce the user's understanding and perception of the presentation. In fact, in some instances, (s)he seemed to assimilate more information, thereby resulting in more correct answers to questions. This is because the user has more time to view a frame before the frame changes (at 25fps, a frame is visible for only 0.04s, whereas at 5fps, a frame is visible for 0.2s), hence absorbing more information. This observation has implications on resource allocation.
User assimilation of the informational content of clips is characterised by the wys<>wyg (what you see is not what you get) relation. What this means is that often users, while still absorbing information correctly, do not notice obvious cues in the clip. Instead the reasoning process by which they arrive at their conclusions is based a lot on intuition and past experience.
Users have difficulty in absorbing audio, visual, and textual information concurrently. Users tend to focus on one of these media at any one moment, although they may switch between the different media. This implies that critical and important messages in a multimedia presentation should be delivered in only one type of medium or, if delivered concurrently, should be done so with maximal possible quality.
The link between perception and understanding is a complex one; when the cause of the annoyance is visible (such as lip synchronisation), users will disregard it and focus on the audio message if that is considered to be contextually important.
Highly dynamic scenes, although expensive in resources, have a negative impact on user understanding and information assimilation. Questions in this category obtained the least number of correct answers. However the entertainment value of such presentations seems to be consistent, irrespective of the frame rate at which they are shown. The link between entertainment and content understanding is therefore not direct and this was further confirmed by the second observation above.

All these results indicate that quality of service, typically specified in technical terms such as end-to-end delay, must also be specified in terms of perception, understanding, and absorption of content ‚ quality of perception, in short ‚ if multimedia presentations are to be truly effective.

Perceptual Impact of Audio Reconstruction Methods

A different angle on the perceptual impact of media loss was taken by Watson and Sasse (1996, 1997), who examined the effect on users of three different audio reconstruction methods. These were: silence substitution (here, any missing audio packets are replaced with silence so as to preserve playback order), waveform substitution (a missing packet is replaced by a copy of the last correctly received packet), and linear predictive coding (LPC, a synthetic quality speech coding algorithm which preserves about 60% of the informational content of the speech signal).

Twenty-four participants took part in the experiments and read out phonetically balanced sentences as given by the Institute of Electrical and Electronics Engineers (IEEE, 1969). These essentially are made up of short syntactically varied sentences containing five key words each. Up to 40% loss was randomly generated in the list of words considered, while packet sizes were of 20, 40, or 80ms duration.

The results obtained showed that waveform substitution and LPC overall performed better than silence substitution irrespective of packet size . LPC outperformed waveform substitution for high loss rates and large packet sizes since speech characteristics change with the lost packets of sound.

Perceptual Impact of Multimedia Presentations in Low-Bandwidth Environments

Low-bandwidth environments, such as the ones typical of mobile communications, need not necessarily imply their unsuitability for multimedia presentations, especially if perceptual considerations are taken into account. As was shown in the previous subsections, low video frame rates do not significantly impact upon the perceived quality of multimedia applications ‚ this was true for both the human receptivity and QoP measures of perceptual multimedia quality. Moreover, media loss, as long as it is infrequent and of short sizes, is well within the limits of perceptual tolerance.

The question arises, however, of what the cutoff rate is beyond which the quality of transmitted audio and video becomes unacceptable to human users. While Blakowski and Steinmetz (1996) did not explicitly consider the impact of low frame rates on lip synchronisation, Kouvelas, Hardman, and Watson (1996) showed that audio and video are not perceived as being synchronised for frame rates of less than 5fps. This corroborates the findings of Watson and Sasse (1996, 1997), where it is reported that the lack of lip synchronisation at especially low frame rates (2-3fps), although helpful to the user, has mainly a psychological benefit.

Moreover, the results presented by Kouvelas et al. (1996) also corroborate the finding (Jardetzky, Sreenan, & Needham, 1995) that the jitter associated with video frame presentation times (20ms) produces skew that is well within the limits of human perception.

The impact of low frame rates on speech intelligibility was also investigated by Anderson and Blockland (1997), who showed that when speakers can see each other on a low-frame-rate video screen, they articulate more clearly than the case where they cannot see each other and are communicating only over an audio link. This contrasts with the case when speakers can see each other face-to-face, when their speech is less clear.

So, while a frame rate of 5fps would seem to be a perceptual cutoff point beyond which quality is no longer acceptable, in practice this might actually be lower. The reason for this is that the perceptual impact of multimedia quality as experienced by users is task dependent, as identified in Kawalek (1995), where although the general trend of video losses being much more easily tolerated than audio ones irrespective of the task at hand is confirmed, the actual thresholds are heavily task dependent. In fact, a loss of 99% of video frames is still regarded as acceptable quality if the users engage in task solving, a situation not unlike ones encountered in m-commerce applications. Thus, multimedia presentations, even when subjected to extremely high losses and presented at very low frame rates, need not necessarily impact negatively on the user multimedia-enhanced m-commerce experience.