Section 11.4. Evaluating Previous Research | Security and Usability: Designing Secure Systems That People Can Use

11.4. Evaluating Previous Research

The majority of academic papers published on keystroke biometrics systems since 1980 have presented independent studies, each collecting its own samples from unique sets of individuals. These samples were collected through diverse methods, varying widely in the mechanics of user input, the granularity of measured data, the amount of input required to train the system and authenticate users, the number of test subjects employed, and the diversity of the typing experience of these subjects. Such nonuniformity alone makes comparison among different studies difficult; add to this difficulty the diversity of keystroke pattern classification approaches and the application of these technologies to different domains, and the task becomes more complex.

In this section, we will compare results from the field with commonly used metrics for measuring accuracy, and propose new metrics for measuring the usability of keystroke systems. Unfortunately, much of the literature in this area lacks sufficient reported data to measure all of the preceding features. We therefore restrict results reported in the following sections to the subset of reports reviewed in the earlier section titled "Overview of Previous Research" that do provide the necessary numbers. Notably missing in all these comparisons is the only commercial offering, for which published numbers are scant (although public documents at the BioNet web site^[21] do claim that performance is on par with some of the earliest published results).

^[21] Net Nanny Software International, Inc., "Biopassword Keystroke Dynamics," Technical Report, 2000-2001.

11.4.1. Classifier Accuracy

Three metrics are commonly used to describe biometrics classifier performance with regard to accuracy:

FRR. False Rejection Ratethe percentage of valid (genuine) user attempts identified as imposters
FAR. False Acceptance Ratethe percentage of imposter access attempts identified as valid users^[22]
^[22] In much of the literature regarding keystroke patterns, the terms Imposter Pass Rate (IPR) and False Alarm Rate (FAR) are used interchangeably with FAR and FRR as defined here, respectively. The existence of two terms whose meanings are opposite but which are both denoted by FAR can cause some confusion. We have adopted the FAR/FRR terminology throughout this article.
ERR. Equal Error Ratethe crossover point at which FRR = FAR, given some independent variable(s) that can be adjusted to produce curves for FRR and FAR

Although ERR is a desirable metric in terms of its ability to condense FAR and FRR into one value, the amount of data needed to turn FAR and FRR into curves (usually through the introduction of an independent variable) is usually prohibitive. Few researchers report ERR in their published results. For this reason, we present an alternative approach to combining FAR and FRR: averaging the two values. We will call this value:

AFR. Average False Ratethe average of FAR and FRR. Empirically, AFR closely approximates ERR for those few papers that did report ERR.

In 2000 and 2002, the U.K.'s Biometrics Working Group produced guidelines for "Best Practices in Testing and Reporting Performance of Biometric Devices."^[23] Going forward, we hope that researchers in the field of keystroke typing patterns will consider these guidelines when reporting results. The main corpus of results we review herein, however, did not have the advantage of access to these standards, and therefore did not present data that we could use to produce such useful evaluation criteria as Receiving Operator Characteristic (ROC) or Detection Error Trade-off (DET) curves.^[24] In this light, AFR can be viewed as a useful stopgap for comparing overall classifier accuracy.

^[23] A. J. Mansfield and J. L. Wayman, "Best Practices in Testing and Reporting Performance of Biometric Devices," Technical Report Version 2.0, U.K. Government Biometrics Working Group (Aug. 2002).

^[24] These curves plot pairs of error rates measured at different algorithm parameters.

Figure 11-2 graphs the performance of each of the highlighted systems in terms of AFR, FRR, and FAR. Although we make no claim about the validity of a system designed to favor either FAR/FRR over the other (such as password hardening^[25]), we feel that in the absence of reported ERR, AFR is a good descriptor of the overall accuracy of a given classifier in terms of discriminating between users.

^[25] Monrose, Reiter, and Wetzel.

This figure demonstrates that the very best reported results are able to achieve an AFR of less than 1%, and roughly one-third are capable of AFR near 2%values generally considered to be acceptable for this type of system. The worst performers have average AFR values between 8% and 27%, and are not likely to provide sufficient accuracy for common usage.

11.4.2. Usability

Two other commonly used metrics in the realm of biometrics are:

FTR. Failure to Enroll Ratethe percentage of users who lack enough quality in their input samples to enroll in the system
FTA. Failure to Acquire Ratethe percentage of users for whom the system lacks sufficient power to classify, once enrolled

These metrics are proposed primarily as a way to measure classifier accuracy, and not as a means of measuring system usability,^[26] although they can provide some limited insight into system usability. We found that FTR and FTA are seldom reported; this may stem

^[26] The authors of "Best Practices in Testing and Reporting Performance of Biometric Devices" go so far as to explicitly exclude human factors from consideration when defining FTR and FTA.

Figure 11-2. False Rejection Rate, False Acceptance Rate, and Average False Rate for each of several approaches; Average False Rate is the average of FRR and FAR, and is shown by the top axis; systems with lower FRR, FAR, and AFR are more accurate in discriminating between users, and are thus capable of being more secure

from the fact that most studies do not use thresholds for rejecting users during enrollment, which in turn may stem from the relatively small groups of users studied in most reports (see "Confidence in Reported Results" later in this chapter).

False Rejection Rate (FRR) also partially addresses usability, as it is a measure of how often a user may have to reauthenticate after being misidentified by the system. But none of these metrics addresses usability directly. We therefore propose two new metrics that do quantify usability explicitly:

CUE. Cost to a User to Enrollthe number of keystrokes a user needs to submit to the system before enrolling as a valid user
CUA. Cost to a User to Authenticatethe number of keystrokes a user needs to submit to the system each time he needs to authenticate

CUE and CUA arise from the need to measure usability in terms of how much work an individual must perform when accessing a system successfully, rewarding classifiers that perform well with less input from the user. Unlike FTR/FTA, these metrics ignore the extra work required of users as a result of classifier failures, instead focusing on the usability costs associated with successful enrollment and access. If FTR/FTA data was available from the studies cited in this survey, we could combine these with CUE/CUA data to give a more complete picture of overall system usability.

Figure 11-3 plots CUE and CUA for each of several approaches.^[27] The graphs show a wide range of requirements both for enrollment (from 24 required keystrokes to nearly 3,500) and for authentication. Authentication costs in this figure fall into three categories: those that require on the order of 10 keystrokes (these are almost universally the systems that monitor only password patterns), those that require tens of keystrokes, and those that require several hundred keystrokes. Note that if we were to eliminate those systems that required more than 1,000 keystrokes to enroll and/or more than 100 keystrokes to authenticate, we would eliminate a few of the better performers from "Classifier Accuracy," but would retain a large number of systems that perform accurately and have low usability costs.

^[27] Many of the password-based approaches failed to publish average password length. The data presented in this article assumes a password length of eight characters in that case.

Figure 11-3. The cost to a user (in keystrokes) to enroll and to authenticate for a given approach; systems that can enroll and authenticate with fewer keystrokes are easier to use

11.4.3. Confidence in Reported Results

There is wide variance in the amount of data researchers collect to perform their studies and to demonstrate the effectiveness of their systems. Is a system that is able to determine the identities of 5 users with 100% accuracy better than a system that is able to determine the identities of 300 users with 99% accuracy? To measure the amount of confidence we can place in reported results, we compare the various studies according to:

Sample size. The number of test subjects taking part in the study
Valid access attempts. The number of valid authentications attempted
Imposter access attempts. The number of imposter authentications attempted

The results shown in Figure 11-4 are not very encouraging; only two of the published studies used more than 50 test subjects, with the majority using fewer than 25. The lack of extensive test data demonstrates an important deficiency: keystroke biometrics will almost certainly be used for groups of users larger than 25, yet only two of the approaches compared in the figure demonstrated competence on samples large enough to validate their results on large systems. Large sample sets are particularly important for web-based systems, the largest of which may scale to millions of users. To be fair, harnessing a significant number of human test subjects is a difficult task. Perhaps these numbers indicate the need for a central repository of input data for keystroke biometrics analysis. Such a repository would also serve as a source for common benchmarking to compare various approaches. Alternatively, researchers could independently make their own data available upon publication.

Figure 11-4. The involvement of more users and more valid/imposter logins lends credence to reported results, but even the largest studies in the keystroke patterns field to date fall short of proving competence on large systems

Figure 11-4 also shows the number of valid and imposter accesses attempted on each system. Valid attempts fall roughly into three categories: those with fewer than 100 attempts, those with close to 200 attempts, and one^[28] with close to 500 attempts. Imposter attempts have four divisions: a few with no imposter attempts, many with 100 or 1,000 imposter attempts, and one^[29] with over 70,000 imposter attempts. Larger numbers provide more convincing proof of workable, secure systems. It is also worth pointing out that a small handful of the best combined performers from the earlier sections, "Classifier Accuracy" and "Usability," such as Lin,^[30] maintain reasonableif not stellarperformance in relation to the current body of work.

^[28] Ibid.

^[29] Bergedano et al.

^[30] Lin.

TIPS FOR PRODUCING USEFUL AND COMPARABLE RESULTS IN KEYSTROKE BIOMETRICS STUDIES

In our review of user studies involving keystroke biometrics, we noticed that a clear explanation of experimental procedures greatly enhanced the clarity and comparability of presented results. The following list of attributes culled from successful experiments is intended as a checklist for future researchers:

Include, as a minimum:

Describe the testing methodology. What system did the users interact with? What were they told about the experiment? How often did they access the system? Under what circumstances did they access the system? How long did the experiment last?
If possible, make the raw collected data available. This allows other researchers to independently verify your results, as well as to try the same datasets with new classifiers in order to perform direct comparisons.

We didn't encounter any researchers who omitted their final results. However, there are several things to keep in mind when reporting future findings:

For each result, make it clear what supporting experimental setup was used, including the facts from the above list.
If you report CUE and CUA (which we heartily encourage), avoid the temptation to omit reporting the number of valid and imposter users, and the input sample length.
Consider using Receiving Operator Characteristic (ROC) or Detection Error Trade-off (DET) curves. ^[a]

^[a] Mansfield and Wayman.