9.3. Picture Perfect?In this section, we summarize several analyses that have been performed to evaluate the security and usability of graphical passwords. While the old adage that a picture is worth a thousand words might indeed be true, it's not entirely clear that user-chosen graphical passwords of the type suggested to date offer additional security over text passwords. An alternative may be to utilize only system-chosen passwords, although we might expect that this would sacrifice some degree of memorability; we do not explore this avenue here, as we are unaware of empirical results to evaluate this conjecture in general. 9.3.1. SecurityAs with usability analyses, which we examine later in this chapter, the most compelling security analyses for graphical password schemes permitting user password selection are those performed on the basis of extensive user experiments. After all, the security weaknesses of text passwords were revealed only by their use in practice. That said, to date there are few such user studies, and so graphical password design efforts have appealed to surrogate analyses in an effort to reason about the security of particular proposals. 9.3.1.1 Key generationThe graphical password scheme that has been the topic of most such analyses is the Draw-a-Secret (DAS) scheme.[19] In the paper that originally proposed this scheme, Jermyn et al. reason about the size of the memorable password space, giving a counting argument that the number of memorable DAS passwords (i.e., those having a simple algorithm to generate them) quickly outpaces the number of text passwords that are commonly chosen, as measured by the size of dictionaries commonly applied to break them. As discussed previously in this chapter, the password space is particularly important when considering the use of this scheme to generate cryptographic keys.
Recently, however, Thorpe and van Oorschot have postulated that the memorable DAS passwords are those that exhibit mirror symmetry. If true, they show that the security of DAS against dictionary attacks may be far less than originally hypothesized.[20] They argue that a similar weakness results if users select DAS passwords that are simple by various pattern complexity measures[21]for example, selecting only a small number of strokes.[22] If the DAS passwords selected by users in practice are consistent with the hypotheses of Thorpe and van Oorschot, then these works may point to ways to strengthen DAS passwords, perhaps by implementing restrictions on DAS passwords similar to those levied on text passwords today.
9.3.1.2 AuthenticationTo our knowledge, the only significant user study on the security of graphical passwords for authentication was performed by Davis and the present authors.[23] In that work, we studied the security of two schemes based on image recognition, denoted "Face" and "Story," which are described shortly. This study focused specifically on the impact of user selection of passwords in these schemes, and the security of the passwords that resulted. We recount some of the notable results from this study, and the methodologies used to reach them, as an illustration of some of the challenges that graphical passwords can face. In particular, this study demonstrated that graphical password schemes can be far weaker than textual passwords when users are permitted to choose their passwords.
In the Face scheme, the password is a collection of k faces, each selected from a distinct set of n > 1 faces; for our evaluation we used k = 4 and n = 9. So, while choosing her password, the user is shown four successive 3 x 3 grids containing randomly chosen images (see Figure 9-4(a), for example), and for each, she selects one image from that grid as an element of her password. Images are unique and do not appear more than once for a given user. During the authentication phase, the same sets of images are shown to the user, but with the images permuted randomly. In the Story scheme, a password is a sequence of k unique images selected by the user to make a "story," from a single set of n > k images, each derived from a distinct category of image types. The images are drawn from categories that depict everyday objects, food, automobiles, animals, children, sports, scenic locations, and male and female models. A sample set of images for the story scheme is shown in Figure 9-4(b). We chose to study the Face scheme, in particular, because of a depth of psychological literature that revealed factors that could potentially be sources of bias in password selection. For example, the scientific literature abounds with studies that show that people tend to agree about attractiveness even across cultures,[24] and psychologists have argued for decades that the old adage that "beauty is in the eye of the beholder" may be largely false. A natural question is whether general perceptions of beauty (e.g., facial symmetry, youthfulness, averageness)[25], [26] might influence graphical password choices. Similarly, the "race effect" refers to the innate ability of people to better recognize faces from their own race than faces of people from other races.[27], [28], [29] Again, this raises the question as to whether race might influence a user's choice for graphical passwords in the Face scheme.
Figure 9-4. (a, left) In the Face scheme, a user's password is a sequence of k faces, each chosen from a distinct set of n > 1 faces; (b, right) in the Story scheme, a user's password is a sequence of k unique images selected from one set of n images to depict a "story"; in the above examples, n = 9, and images are placed randomly in a 3 x 3 gridTo study both the Story scheme and the Face scheme, we collected user data during the fall semester (roughly the four-month period of late August through early December) of 2003, of graphical password usage by three computer engineering and computer science classes at two universities. Each student used one of the graphical password schemes for access to content including his grades, homework, homework solutions, course reading materials, etc., via standard Java-enabled web browsers. For the purposes of the experiment, facial images were classified into nonoverlapping categories, namely: To simplify the analysis, we made the assumption that images in a category are equivalentthat is, the specific images in a category that are available do not significantly influence a user's choice in picking a specific category. If we simply consider the set of images chosen by men and women using the Face scheme (see Figure 9-5), some differences are apparent immediately: for one, different populations exhibit strong differences in their password choices. Figure 9-5. Category selection based on gender and race for the Face scheme; the graph shows the distribution of choices from sets of images consisting of typical Asian males, typical Asian females, typical black males, typical black females, typical white males, typical white females, Asian male models, Asian female models, black male models, black female models, white male models, and white female modelsInsight into what different groups tend to choose as their passwords in the Face scheme is shown in Tables 9-1 and 9-2, which characterize selections by gender and race, respectively. As can be seen in Table 9-1, both males and females chose females in Face significantly more often than males, and when males chose females, they almost always chose models (roughly 80% of the time). Moreover, perceptual differences were also observed when we examined image selection across racial categories. In that case, the "race effect" described earlier seemingly influenced the selection of passwords. As depicted in Table 9-2, Asian females and white females chose from within their races roughly 50% of the time; white males chose whites over 60% of the time.
The categories of images chosen by each gender and race in the Story scheme are shown in Figure 9-6. The most significant deviations between males and females is that females chose animals twice as often as males did, and males chose women twice as often as females did. Less pronounced differences are that males tended to select nature and sports images somewhat more than females did, and females tended to select food images more often. Figure 9-6. Category selection based on gender and race for the Story scheme; the graph shows the distribution of choices from sets of images representing the nine categories: animals, cars, women, food, children, men, objects, nature, and sports
Given these differences across populations in both the Face scheme and the Story scheme, we set out to measure the ability of an attacker to guess the password of a user in each scheme. We summarize our findings in the following discussions. In our analysis, we let p denote a password selected in either the Face scheme or the Story scheme. Then, Pr[p] for any p denotes the probability that the scheme yields the password p, where the probability is taken over both user choice and random choices in the scheme. Given accurate values for Pr[p] for each p, a measure that indicates the ability of an attacker to guess passwords, is the guessing entropy of passwords.[30] Informally, guessing entropy measures the expected number of guesses an attacker with perfect knowledge of the probability distribution on passwords would need in order to guess a password chosen from that distribution. Guessing entropy supposes that the attacker examines his guesses in an optimal order to minimize his expected number of guesses. So, if we enumerate passwords p1, p2,...in nonincreasing order of Pr[pi], then the guessing entropy is simply:
Because guessing entropy intuitively corresponds closely to the attacker's task in which we are interested (guessing a password), we will mainly consider measures motivated by the guessing entropy. The direct use of the preceding formula to compute guessing entropy is problematic for two reasons:
To account for the first of these issues, we use the probabilities only to determine an enumeration p = (p1,p2,...) of passwords in nonincreasing order of probability.[31] This enumeration is far less sensitive to parameter variations than are the numeric probabilities, and leads to a more robust use. We use this sequence to conduct tests with our data set in which we randomly select a small set of "test" passwords from our dataset (20% of the data set), and use the remainder of the data to compute the enumeration p.
We then guess passwords in order of p until each test password is guessed. To account for the second issue identified earliernamely, the set of available categories during password selectionwe first filter from p the passwords that would have been invalid given the available categories when the test password was chosen, and obviously do not guess them. By repeating this test with nonoverlapping test sets of passwords, we obtain a number of guesses per test password. Tables 9-3 and 9-4 present results for the Face scheme and the Story scheme, respectively. Populations with less than 10 passwords are excluded from these tables. The results in these tables should be considered in light of the number of available passwords. In particular, the Face scheme (in the configuration we tested) has 94 = 6,561 possible passwords (for fixed sets of available images), for a maximum guessing entropy of 3,281. However, our results show that for Face, if the user is known to be a male, then the worst 10% of passwords can be guessed easily on the first or second attempt. This observation is sufficiently surprising as to warrant restatement: an attacker can succeed in merely two guesses for 10% of male users. Similarly, if the user is Asian and his gender is known, then the worst 10% of passwords can be guessed within the first six tries.
The Story-based scheme offers far fewer possible passwords, namely 9 x 8 x 7 x 6 = 3,024, yielding a maximum possible guessing entropy of 1,523. Nevertheless, Table 9-4 shows that it is more secure, in that the biases observed in the Face scheme do not tend to be as prominent in the Story scheme.
It is also interesting to note that for both schemes the average number of guesses to find a test password is always higher than the median number, implying that there are several good passwords chosen that significantly increase the average number of guesses an attacker would need to perform, but that do not affect the median. The most dramatic example of this is for white males where the average is 1260 versus a median of 81 using the Face scheme, and the average is 844 versus a median of 394 for the Story scheme. This seems to imply that with better user education, the passwords selected by users of these schemes might be hardened against online attacks. We hope that larger-scale studies will better evaluate that claim. 9.3.2. UsabilityTo date, a handful of studies have analyzed, albeit on small populations, the recall rate of authentication systems based on image recognition. For the most part, these studies have shown that memorability is indeed far better for these types of graphical passwords than for their textual counterparts. For instance, Brostoff and Sasse report the results of a three-month trial investigation with 34 students that shows that fewer login errors were made when using Passfaces™ (a commercial scheme based on image recognition) compared to textual passwords, even given significant periods of inactivity between logins.[32] Similarly, other recent studies have confirmed the memorability of other schemes based on image recognition.[33], [34], [35]
For our study, we also evaluated the effect of user choice on the memorability of the chosen passwords. Figure 9-7 shows the percentage of successful logins versus the time since that user's last login attempt. A trend that emerges is that while memorability of both schemes is strong, Story passwords appear to be somewhat harder to remember than Face passwords. One potential reason for users' relative difficulty in remembering their Story passwords is that apparently few of them actually chose stories, despite our suggestion to do so. This contributed very significantly to incorrect password entries resulting from misordering their selections. For example, of the 236 incorrect password entries in Story, over 75% of them consisted of the correct images selected in an incorrect order.
As such, it seems advisable in constructing graphical password schemes to avoid having users remember an ordering of images. For example, we expect that a selection of k images, each from a distinct set of n images (as in the Face scheme, although with image categories not necessarily of only persons), will generally be more memorable than an ordered selection of k images from one set. If a scheme does rely on users remembering an ordering, then the importance of the story should be reiterated to users, because if the Figure 9-7. Memorability versus time since last login attempt; each data point represents the average of 90 login attempts; of the 236 incorrect password entries in Story, over 75% of them consisted of the correct images selected in an incorrect ordersequence of images has some semantic meaning, it is more likely that the password will be memorable (assuming, of course, that the sequences are not too long[36]).
9.3.3. DiscussionThese results demonstrate that graphical password schemes can suffer from drawbacks similar to those of textual password schemes; most notably, they exhibit similar biases in human tendencies to select memorable passwords. Moreover, forthcoming evaluations of which we are aware may further elucidate the depth of this problem. For example, Sasse is exploring the susceptibility of graphical passwords to the spouse-test, where spouses play the role of informed impostors.[37] Early evidence suggests that graphical password schemes of the type we consider here may indeed be vulnerable to such "adversaries." One alternative to strengthen graphical passwords is to prohibit user selection of passwords, so that each user's password is system-generated. However, it is widely considered that such measures have failed in the case of textual passwords as a result of usability concerns, and more research is needed in the context of particular graphical password schemes to ascertain whether this is a reasonable measure.
The range of memorability studies undertaken thus far have all been limited in a number of respects. For one, the effects of image confusion has yet to be evaluated, and it is unclear whether the impressive recall rates observed thus far (albeit in small user trials) will be adversely affected as we become more exposed to graphical authentication systems. To illustrate the risk of image confusion, let's imagine that graphical password schemes became widespread. Furthermore, suppose for argument's sake, that the Story scheme becomes the de facto scheme of choice. It remains entirely plausible that as a user becomes exposed to more instances of the scheme (say, for example, for accessing web-based email, online banking services, news subscriptions, etc.) the user will confuse stories used to access each of these services, particularly if the image categories selected by the user for one of her stories appear as distractors during the authentication stage for a different service. To date, no evaluation of graphical password schemes of which we are aware has taken this effect into considerationfor example, by forcing users to create new graphical passwords every so often. In that regard, we believe that exploring the impact of this effect on long-term recall rates remains an area of research that warrants further investigation. |