5. Validation of Quality Assessment Metrics

Validation is an important step towards successful development of practical image and video quality measurement systems. Since the goal of these systems is to predict perceived image and video quality, it is essential to build an image and video database with subjective evaluation scores associated with each of the images and video sequences in the database. Such a database can then be used to assess the prediction performance of the objective quality measurement algorithms.

In this section, we first briefly introduce two techniques that have been widely adopted in both the industry and the research community for subjective evaluations for video. We then review the quality metric comparison results published in the literature. Finally, we introduce the recent effort by the video quality experts group (VQEG) [81], which aims to provide industrial standards for video quality assessment.

5.1 Subjective Evaluation of Video Quality

Subjective evaluation experiments are complicated by many aspects of human psychology and viewing conditions, such as observer vision ability, translation of quality perception into ranking score, preference for content, adaptation, display devices, ambient light levels etc. The two methods that we will present briefly are single stimulus continuous quality evaluation (SSCQE) and double stimulus continuous quality scale (DSCQS), which have been demonstrated to have repeatable and stable results, provided consistent viewing configurations and subjective tasks, and have consequently been adopted as parts of an international standard by the international telecommunications union (ITU) [84]. If the SSCQE and DSCQS tests are conducted on multiple subjects, the scores can be averaged to yield the mean opinion score (MOS). The standard deviation between the scores may also be useful to measure the consistency between subjects.

Single Stimulus Continuous Quality Evaluation

In the SSCQE method, subjects continuously indicate their impression of the video quality on a linear scale that is divided into five segments, as shown in Figure 41.13. The five intervals are marked with adjectives to serve as guides. The subjects are instructed to move a slider to any point on the scale that best reflects their impression of quality at that instant of time, and to track the changes in the quality of the video using the slider.

Figure 41.13: SSCQE sample quality scale.

Double Stimulus Continuous Quality Scale

The DSCQS method is a form of discrimination based method and has the extra advantage that the subjective scores are less affected by adaptation and contextual effects. In the DSCQS method, the reference and the distorted videos are presented one after the other in the same session, in small segments of a few seconds each, and subjects evaluate both sequences using sliders similar to those for SSCQE. The difference between the scores of the reference and the distorted sequences gives the subjective impairment judgement. Figure 41.14 demonstrates the basic test procedure.

click to expand
Figure 41.14: DSCQS testing procedure recommended by VQEG FR-TV Phase-II test.

5.2 Comparison of Quality Assessment Metrics

With so many quality assessment algorithms proposed, the question of their relative merits and demerits naturally arises. Unfortunately, not much has been published in comparing these models with one another, especially under strict experimental conditions over a wide range of distortion types, distortion strengths, stimulus content and subjective evaluation criterion. This is compounded by the fact that validating quality assessment metrics comprehensively is time-consuming and expensive, not to mention that many algorithms are not described explicitly enough in the literature to allow reproduction of their reported performance. Most comparisons of quality assessment metrics are not broad enough to be able to draw solid conclusions, and their results should only be considered in the context of their evaluation criterion.

In [3] and [65], different mathematical measures of quality that operate without channel decompositions and masking effect modelling are compared against subjective experiments, and their performance is tabulated for various test conditions. Li et al. compare Daly's and Lubin's models for their ability to detect differences [85] and conclude that Lubin's model is more robust than Daly's given their experimental procedures. In [86] three metrics are compared for JPEG compressed images: Watson's DCT based metric [87], Chou and Li's method [50] and Karanusekera and Kingsbury's method [49]. They conclude that Watson's method performed best among the three.

Martens and Meesters have compared Lubin's metric (also called the Sarnoff model) with the root mean squared error (RMSE) [9] metric on transformed luminance images. The metrics are compared using subjective experiments based on images corrupted with noise and blur, as well as images corrupted with JPEG distortion. The subjective experiments are based on dissimilarity measurements, where subjects are asked to assess the dissimilarity between pairs of images from a set that contains the reference image and several of its distorted versions. Multidimensional scaling (MDS) technique is used to compare the metrics with the subjective experiments. MDS technique constructs alternate spaces from the dissimilarity data, in which the positions of the images are related to their dissimilarity (subjective or objective) with the rest of the images in that set. Martens and Meesters then compare RMSE and Lubin's method with subjective experiments, with and without MDS, and report that "in none of the examined cases could a clear advantage of complicated distance metrics (such as the Sarnoff model) be demonstrated over simple measures such as RMSE" [9].

5.3 Video Quality Experts Group

The video quality experts group [81] was formed in 1997 to develop, validate and standardize new objective measurement methods for video quality. The group is composed of experts from various backgrounds and organizations around the world. They are interested in FR/RR/NR quality assessment for various bandwidth videos for television and multimedia applications.

VQEG has completed its Phase I test for FR video quality assessment for television in 2000 [10,11]. In Phase I test, 10 proponent video quality models (including several well-known models and PSNR) were compared with the subjective evaluation results on a video database, which contains video sequences with a wide variety of distortion types and stimulus content. A systematic way of evaluating the prediction performance of the objective models was established, which is composed of three components:

Prediction accuracy — the ability to predict the subjective quality ratings with low error. (Two metrics, namely the variance-weighted regression correlation [10] and the non-linear regression correlation [10], were used.)
Prediction monotonicity — the degree to which the model's predictions agree with the relative magnitudes of subjective quality ratings. (The Spearman rank order correlation [10] was employed.)
Prediction consistency — the degree to which the model maintains prediction accuracy over the range of video test sequences. (The outlier ratio [10] was used.)

The result was, in some sense, surprising, since except for 1 or 2 proponents that did not perform properly in the test, the other proponents performed statistically equivalent, including PSNR [10,11]. Consequently, VQEG did not recommend any method for an ITU standard [10]. VQEG is continuing its work on Phase II test for FR quality assessment for television, and RR/NR quality assessment for television and multimedia.

Although it is hard to predict whether VQEG will be able to supply one or a few successful video quality assessment standards in the near future, the work of VQEG is important and unique from a research point of view. First, VQEG establishes large video databases with reliable subjective evaluation scores (the database used in the FR Phase I test is already available to the public [81]), which will prove to be invaluable for future research on video quality assessment. Second, systematic approaches for comparing subjective and objective scores are being formalized. These approaches alone could become widely accepted standards in the research community. Third, by comparing state-of-the-art quality assessment models in different aspects, deeper understanding of the relative merits of different methods will be achieved, which will have a major impact on future improvement of the models. In addition, VQEG provides an ideal communication platform for the researchers who are working in the field.