11.4. Evaluating Previous ResearchThe majority of academic papers published on keystroke biometrics systems since 1980 have presented independent studies, each collecting its own samples from unique sets of individuals. These samples were collected through diverse methods, varying widely in the mechanics of user input, the granularity of measured data, the amount of input required to train the system and authenticate users, the number of test subjects employed, and the diversity of the typing experience of these subjects. Such nonuniformity alone makes comparison among different studies difficult; add to this difficulty the diversity of keystroke pattern classification approaches and the application of these technologies to different domains, and the task becomes more complex. In this section, we will compare results from the field with commonly used metrics for measuring accuracy, and propose new metrics for measuring the usability of keystroke systems. Unfortunately, much of the literature in this area lacks sufficient reported data to measure all of the preceding features. We therefore restrict results reported in the following sections to the subset of reports reviewed in the earlier section titled "Overview of Previous Research" that do provide the necessary numbers. Notably missing in all these comparisons is the only commercial offering, for which published numbers are scant (although public documents at the BioNet web site[21] do claim that performance is on par with some of the earliest published results).
11.4.1. Classifier AccuracyThree metrics are commonly used to describe biometrics classifier performance with regard to accuracy:
Although ERR is a desirable metric in terms of its ability to condense FAR and FRR into one value, the amount of data needed to turn FAR and FRR into curves (usually through the introduction of an independent variable) is usually prohibitive. Few researchers report ERR in their published results. For this reason, we present an alternative approach to combining FAR and FRR: averaging the two values. We will call this value:
In 2000 and 2002, the U.K.'s Biometrics Working Group produced guidelines for "Best Practices in Testing and Reporting Performance of Biometric Devices."[23] Going forward, we hope that researchers in the field of keystroke typing patterns will consider these guidelines when reporting results. The main corpus of results we review herein, however, did not have the advantage of access to these standards, and therefore did not present data that we could use to produce such useful evaluation criteria as Receiving Operator Characteristic (ROC) or Detection Error Trade-off (DET) curves.[24] In this light, AFR can be viewed as a useful stopgap for comparing overall classifier accuracy.
Figure 11-2 graphs the performance of each of the highlighted systems in terms of AFR, FRR, and FAR. Although we make no claim about the validity of a system designed to favor either FAR/FRR over the other (such as password hardening[25]), we feel that in the absence of reported ERR, AFR is a good descriptor of the overall accuracy of a given classifier in terms of discriminating between users.
This figure demonstrates that the very best reported results are able to achieve an AFR of less than 1%, and roughly one-third are capable of AFR near 2%values generally considered to be acceptable for this type of system. The worst performers have average AFR values between 8% and 27%, and are not likely to provide sufficient accuracy for common usage. 11.4.2. UsabilityTwo other commonly used metrics in the realm of biometrics are:
These metrics are proposed primarily as a way to measure classifier accuracy, and not as a means of measuring system usability,[26] although they can provide some limited insight into system usability. We found that FTR and FTA are seldom reported; this may stem
Figure 11-2. False Rejection Rate, False Acceptance Rate, and Average False Rate for each of several approaches; Average False Rate is the average of FRR and FAR, and is shown by the top axis; systems with lower FRR, FAR, and AFR are more accurate in discriminating between users, and are thus capable of being more securefrom the fact that most studies do not use thresholds for rejecting users during enrollment, which in turn may stem from the relatively small groups of users studied in most reports (see "Confidence in Reported Results" later in this chapter). False Rejection Rate (FRR) also partially addresses usability, as it is a measure of how often a user may have to reauthenticate after being misidentified by the system. But none of these metrics addresses usability directly. We therefore propose two new metrics that do quantify usability explicitly:
CUE and CUA arise from the need to measure usability in terms of how much work an individual must perform when accessing a system successfully, rewarding classifiers that perform well with less input from the user. Unlike FTR/FTA, these metrics ignore the extra work required of users as a result of classifier failures, instead focusing on the usability costs associated with successful enrollment and access. If FTR/FTA data was available from the studies cited in this survey, we could combine these with CUE/CUA data to give a more complete picture of overall system usability. Figure 11-3 plots CUE and CUA for each of several approaches.[27] The graphs show a wide range of requirements both for enrollment (from 24 required keystrokes to nearly 3,500) and for authentication. Authentication costs in this figure fall into three categories: those that require on the order of 10 keystrokes (these are almost universally the systems that monitor only password patterns), those that require tens of keystrokes, and those that require several hundred keystrokes. Note that if we were to eliminate those systems that required more than 1,000 keystrokes to enroll and/or more than 100 keystrokes to authenticate, we would eliminate a few of the better performers from "Classifier Accuracy," but would retain a large number of systems that perform accurately and have low usability costs.
Figure 11-3. The cost to a user (in keystrokes) to enroll and to authenticate for a given approach; systems that can enroll and authenticate with fewer keystrokes are easier to use11.4.3. Confidence in Reported ResultsThere is wide variance in the amount of data researchers collect to perform their studies and to demonstrate the effectiveness of their systems. Is a system that is able to determine the identities of 5 users with 100% accuracy better than a system that is able to determine the identities of 300 users with 99% accuracy? To measure the amount of confidence we can place in reported results, we compare the various studies according to:
The results shown in Figure 11-4 are not very encouraging; only two of the published studies used more than 50 test subjects, with the majority using fewer than 25. The lack of extensive test data demonstrates an important deficiency: keystroke biometrics will almost certainly be used for groups of users larger than 25, yet only two of the approaches compared in the figure demonstrated competence on samples large enough to validate their results on large systems. Large sample sets are particularly important for web-based systems, the largest of which may scale to millions of users. To be fair, harnessing a significant number of human test subjects is a difficult task. Perhaps these numbers indicate the need for a central repository of input data for keystroke biometrics analysis. Such a repository would also serve as a source for common benchmarking to compare various approaches. Alternatively, researchers could independently make their own data available upon publication. Figure 11-4. The involvement of more users and more valid/imposter logins lends credence to reported results, but even the largest studies in the keystroke patterns field to date fall short of proving competence on large systemsFigure 11-4 also shows the number of valid and imposter accesses attempted on each system. Valid attempts fall roughly into three categories: those with fewer than 100 attempts, those with close to 200 attempts, and one[28] with close to 500 attempts. Imposter attempts have four divisions: a few with no imposter attempts, many with 100 or 1,000 imposter attempts, and one[29] with over 70,000 imposter attempts. Larger numbers provide more convincing proof of workable, secure systems. It is also worth pointing out that a small handful of the best combined performers from the earlier sections, "Classifier Accuracy" and "Usability," such as Lin,[30] maintain reasonableif not stellarperformance in relation to the current body of work.
|