Section 9.3. Picture Perfect?

9.3. Picture Perfect?

In this section, we summarize several analyses that have been performed to evaluate the security and usability of graphical passwords. While the old adage that a picture is worth a thousand words might indeed be true, it's not entirely clear that user-chosen graphical passwords of the type suggested to date offer additional security over text passwords. An alternative may be to utilize only system-chosen passwords, although we might expect that this would sacrifice some degree of memorability; we do not explore this avenue here, as we are unaware of empirical results to evaluate this conjecture in general.

9.3.1. Security

As with usability analyses, which we examine later in this chapter, the most compelling security analyses for graphical password schemes permitting user password selection are those performed on the basis of extensive user experiments. After all, the security weaknesses of text passwords were revealed only by their use in practice. That said, to date there are few such user studies, and so graphical password design efforts have appealed to surrogate analyses in an effort to reason about the security of particular proposals.

9.3.1.1 Key generation

The graphical password scheme that has been the topic of most such analyses is the Draw-a-Secret (DAS) scheme.^[19] In the paper that originally proposed this scheme, Jermyn et al. reason about the size of the memorable password space, giving a counting argument that the number of memorable DAS passwords (i.e., those having a simple algorithm to generate them) quickly outpaces the number of text passwords that are commonly chosen, as measured by the size of dictionaries commonly applied to break them. As discussed previously in this chapter, the password space is particularly important when considering the use of this scheme to generate cryptographic keys.

^[19] Jermyn et al.

Recently, however, Thorpe and van Oorschot have postulated that the memorable DAS passwords are those that exhibit mirror symmetry. If true, they show that the security of DAS against dictionary attacks may be far less than originally hypothesized.^[20] They argue that a similar weakness results if users select DAS passwords that are simple by various pattern complexity measures^[21]for example, selecting only a small number of strokes.^[22] If the DAS passwords selected by users in practice are consistent with the hypotheses of Thorpe and van Oorschot, then these works may point to ways to strengthen DAS passwords, perhaps by implementing restrictions on DAS passwords similar to those levied on text passwords today.

^[20] J. Thorpe and P. C. van Oorschot, "Graphical Dictionaries and the Memorable Space of Graphical Passwords," Proceedings of the 13th USENIX Security Symposium (Aug. 2004).

^[21] F. Attneave, "Complexity of Patterns," American Journal of Psychology 68 (1955), 209222.

^[22] J. Thorpe and P. C. van Oorschot, "Towards Secure Design Choices for Implementing Graphical Passwords," Proceedings of the 20th Annual Computer Security Applications Conference (Dec. 2004).

9.3.1.2 Authentication

To our knowledge, the only significant user study on the security of graphical passwords for authentication was performed by Davis and the present authors.^[23] In that work, we studied the security of two schemes based on image recognition, denoted "Face" and "Story," which are described shortly. This study focused specifically on the impact of user selection of passwords in these schemes, and the security of the passwords that resulted. We recount some of the notable results from this study, and the methodologies used to reach them, as an illustration of some of the challenges that graphical passwords can face. In particular, this study demonstrated that graphical password schemes can be far weaker than textual passwords when users are permitted to choose their passwords.

^[23] D. Davis, F. Monrose, and M. K. Reiter, "On User Choice in Graphical Password Schemes," Proceedings of the 13th USENIX Security Symposium (Aug. 2004), 151164.

In the Face scheme, the password is a collection of k faces, each selected from a distinct set of n > 1 faces; for our evaluation we used k = 4 and n = 9. So, while choosing her password, the user is shown four successive 3 x 3 grids containing randomly chosen images (see Figure 9-4(a), for example), and for each, she selects one image from that grid as an element of her password. Images are unique and do not appear more than once for a given user. During the authentication phase, the same sets of images are shown to the user, but with the images permuted randomly.

In the Story scheme, a password is a sequence of k unique images selected by the user to make a "story," from a single set of n > k images, each derived from a distinct category of image types. The images are drawn from categories that depict everyday objects, food, automobiles, animals, children, sports, scenic locations, and male and female models. A sample set of images for the story scheme is shown in Figure 9-4(b).

We chose to study the Face scheme, in particular, because of a depth of psychological literature that revealed factors that could potentially be sources of bias in password selection. For example, the scientific literature abounds with studies that show that people tend to agree about attractiveness even across cultures,^[24] and psychologists have argued for decades that the old adage that "beauty is in the eye of the beholder" may be largely false. A natural question is whether general perceptions of beauty (e.g., facial symmetry, youthfulness, averageness)^[25], ^[26] might influence graphical password choices. Similarly, the "race effect" refers to the innate ability of people to better recognize faces from their own race than faces of people from other races.^[27], ^[28], ^[29] Again, this raises the question as to whether race might influence a user's choice for graphical passwords in the Face scheme.

^[24] J. Langlois, L. Kalakanis, A. Rubenstein, A. Larson, M. Hallam, and M. Smoot, "Maxims and Myths of Beauty: A Meta-Analytic and Theoretical Review," Psychological Bulletin 126 (2000), 390423.

^[25] T. Alley and M. Cunningham, "Averaged Faces Are Attractive, But Very Attractive Faces Are Not Average," Psychological Science 2 (1991), 123125.

^[26] A. Feingold, "Good-Looking People Are Not What We Think," Psychological Bulletin 111 (1992), 304341.

^[27] D. Levin, "Race as a Visual Feature: Using Visual Search and Perceptual Discrimination Tasks to Understand Face Categories and the Cross Race Recognition Deficit," Quarterly Journal of Experimental Psychology: General 129 (4), 559574.

^[28] T. Luce, "Blacks, Whites and Yellows: They All Look Alike to Me," Psychology Today 8 (1974), 105108.

^[29] P. Walker and W. Tanaka, "An Encoding Advantage for Own-Race Versus Other-Race Faces," Perception 23 (2003), 11171125.

Figure 9-4. (a, left) In the Face scheme, a user's password is a sequence of k faces, each chosen from a distinct set of n > 1 faces; (b, right) in the Story scheme, a user's password is a sequence of k unique images selected from one set of n images to depict a "story"; in the above examples, n = 9, and images are placed randomly in a 3 x 3 grid

To study both the Story scheme and the Face scheme, we collected user data during the fall semester (roughly the four-month period of late August through early December) of 2003, of graphical password usage by three computer engineering and computer science classes at two universities. Each student used one of the graphical password schemes for access to content including his grades, homework, homework solutions, course reading materials, etc., via standard Java-enabled web browsers.

For the purposes of the experiment, facial images were classified into nonoverlapping categories, namely:

To simplify the analysis, we made the assumption that images in a category are equivalentthat is, the specific images in a category that are available do not significantly influence a user's choice in picking a specific category.

If we simply consider the set of images chosen by men and women using the Face scheme (see Figure 9-5), some differences are apparent immediately: for one, different populations exhibit strong differences in their password choices.

Figure 9-5. Category selection based on gender and race for the Face scheme; the graph shows the distribution of choices from sets of images consisting of typical Asian males, typical Asian females, typical black males, typical black females, typical white males, typical white females, Asian male models, Asian female models, black male models, black female models, white male models, and white female models

Insight into what different groups tend to choose as their passwords in the Face scheme is shown in Tables 9-1 and 9-2, which characterize selections by gender and race, respectively. As can be seen in Table 9-1, both males and females chose females in Face significantly more often than males, and when males chose females, they almost always chose models (roughly 80% of the time).

Moreover, perceptual differences were also observed when we examined image selection across racial categories. In that case, the "race effect" described earlier seemingly influenced the selection of passwords. As depicted in Table 9-2, Asian females and white females chose from within their races roughly 50% of the time; white males chose whites over 60% of the time.

Table 9-1. Gender and attractiveness selection in Face; the results show that "beauty" appeared to play a significant role in the choices of images selected by both genders, albeit more so for males
Population	Female model	Male model	Typical female	Typical male
Female	40.0%	20.0%	28.8%	11.3%
Male	63.2%	10.0%	12.7%	14.0%

Table 9-2. Evidence of the "race effect" can be seen in the selection of images for the Face scheme; this effect is quite startling in the case of black males, although the reader is cautioned that there were only three black males in the study, and so any conclusion along those lines requires greater validation
Population	Asian	Black	White
Asian female	52.1%	16.7%	31.3%
Asian male	34.4%	21.9%	43.8%
Black male	8.3%	91.7%	0.0%
White female	18.8%	31.3%	50.0%
White male	17.6%	20.4%	62.0%

The categories of images chosen by each gender and race in the Story scheme are shown in Figure 9-6. The most significant deviations between males and females is that females chose animals twice as often as males did, and males chose women twice as often as females did. Less pronounced differences are that males tended to select nature and sports images somewhat more than females did, and females tended to select food images more often.

Figure 9-6. Category selection based on gender and race for the Story scheme; the graph shows the distribution of choices from sets of images representing the nine categories: animals, cars, women, food, children, men, objects, nature, and sports

USERS' COMMENTS ON THEIR PASSWORD CHOICES

"In order to remember all the pictures for my login (after forgetting my "password" four times in a row), I needed to pick pictures I could easily rememberkind of the same pitfalls when picking a lettered password. So, I chose all pictures of beautiful women."
"I started by deciding to choose faces of people in my own race...specifically, people that looked at least a little like me. The hope was that knowing this general piece of information about all of the images in my password would make the individual faces easier to remember."
"I simply picked the best-looking girl on each page."

Given these differences across populations in both the Face scheme and the Story scheme, we set out to measure the ability of an attacker to guess the password of a user in each scheme. We summarize our findings in the following discussions. In our analysis, we let p denote a password selected in either the Face scheme or the Story scheme. Then, Pr[p] for any p denotes the probability that the scheme yields the password p, where the probability is taken over both user choice and random choices in the scheme.

Given accurate values for Pr[p] for each p, a measure that indicates the ability of an attacker to guess passwords, is the guessing entropy of passwords.^[30] Informally, guessing entropy measures the expected number of guesses an attacker with perfect knowledge of the probability distribution on passwords would need in order to guess a password chosen from that distribution. Guessing entropy supposes that the attacker examines his guesses in an optimal order to minimize his expected number of guesses. So, if we enumerate passwords p1, p2,...in nonincreasing order of Pr[pi], then the guessing entropy is simply:

^[30] J. L. Massey, "Guessing and Entropy," Proceedings of the 1994 IEEE International Symposium on Information Theory (1994).

Because guessing entropy intuitively corresponds closely to the attacker's task in which we are interested (guessing a password), we will mainly consider measures motivated by the guessing entropy.

The direct use of the preceding formula to compute guessing entropy is problematic for two reasons:

Pr[p] for each p can be estimated only from the data observed in our experiments. In our experience, the use of these probabilities was sensitive to various parameter settings in our methodology.
An attacker guessing passwords will be offered additional information when performing a guess, such as the set of available categories from which the next image can be chosen. For example, in Face, each image choice is taken from nine images that represent nine categories of images, chosen uniformly at random from the twelve categories. This additional information constrains the set of possible passwords, and the attacker would have this information when performing a guess in many scenarios.

To account for the first of these issues, we use the probabilities only to determine an enumeration p = (p1,p2,...) of passwords in nonincreasing order of probability.^[31] This enumeration is far less sensitive to parameter variations than are the numeric probabilities, and leads to a more robust use. We use this sequence to conduct tests with our data set in which we randomly select a small set of "test" passwords from our dataset (20% of the data set), and use the remainder of the data to compute the enumeration p.

^[31] Davis, Monrose, and Reiter.

We then guess passwords in order of p until each test password is guessed. To account for the second issue identified earliernamely, the set of available categories during password selectionwe first filter from p the passwords that would have been invalid given the available categories when the test password was chosen, and obviously do not guess them. By repeating this test with nonoverlapping test sets of passwords, we obtain a number of guesses per test password.

Tables 9-3 and 9-4 present results for the Face scheme and the Story scheme, respectively. Populations with less than 10 passwords are excluded from these tables. The results in these tables should be considered in light of the number of available passwords. In particular, the Face scheme (in the configuration we tested) has 9⁴ = 6,561 possible passwords (for fixed sets of available images), for a maximum guessing entropy of 3,281. However, our results show that for Face, if the user is known to be a male, then the worst 10% of passwords can be guessed easily on the first or second attempt. This observation is sufficiently surprising as to warrant restatement: an attacker can succeed in merely two guesses for 10% of male users. Similarly, if the user is Asian and his gender is known, then the worst 10% of passwords can be guessed within the first six tries.

Table 9-3. Per-password guesses required to find test paswords in the Face scheme; the results show that while the chosen configurations yield 6,561 possible passwords, an attacker requires only two guesses to find the passwords of 10% of male users
Population	Average number of guesses	Median number of guesses	Guesses to find weakest 25%	Guesses to find weakest 10%
Overall	1374	469	13	2
Male	1234	218	8	2
Female	2051	1454	255	12
Asian male	1084	257	21	5.5
Asian female	973	445	19	5.2
White male	1260	81	8	1.6

The Story-based scheme offers far fewer possible passwords, namely 9 x 8 x 7 x 6 = 3,024, yielding a maximum possible guessing entropy of 1,523. Nevertheless, Table 9-4 shows that it is more secure, in that the biases observed in the Face scheme do not tend to be as prominent in the Story scheme.

Table 9-4. Guessing entropy for the Story scheme; the results show that if the attacker knows the target user is an Asian male, then an outline dictionary attack would succeed in 20 guesses for 10% of these users
Population	Average number of guesses	Median number of guesses	Guesses to find weakest 25%	Guesses to find weakest 10%
Overall	790	428	112	35
Male	826	404	87	53
Female	989	723	125	98
White male	844	394	146	76
Asian male	877	589	155	20

It is also interesting to note that for both schemes the average number of guesses to find a test password is always higher than the median number, implying that there are several good passwords chosen that significantly increase the average number of guesses an attacker would need to perform, but that do not affect the median. The most dramatic example of this is for white males where the average is 1260 versus a median of 81 using the Face scheme, and the average is 844 versus a median of 394 for the Story scheme. This seems to imply that with better user education, the passwords selected by users of these schemes might be hardened against online attacks. We hope that larger-scale studies will better evaluate that claim.

9.3.2. Usability

To date, a handful of studies have analyzed, albeit on small populations, the recall rate of authentication systems based on image recognition. For the most part, these studies have shown that memorability is indeed far better for these types of graphical passwords than for their textual counterparts. For instance, Brostoff and Sasse report the results of a three-month trial investigation with 34 students that shows that fewer login errors were made when using Passfaces™ (a commercial scheme based on image recognition) compared to textual passwords, even given significant periods of inactivity between logins.^[32] Similarly, other recent studies have confirmed the memorability of other schemes based on image recognition.^[33], ^[34], ^[35]

^[32] S. Brostoff and M. A. Sasse, "Are PassfacesTM More Usable Than Passwords? A Field Trial Investigation," Proceedings of Human Computer Interaction (2000), 405424.

^[33] R. Dhamija and A. Perrig, "Déjà Vu: A User Study Using Images for Authentication," Proceedings of the 9thUSENIX Security Symposium (Aug. 2000).

^[34] Stubblefield and Simon.

^[35] M. Zviran and W. J. Haga, "A Comparison of Password Techniques for Multilevel Authentication Mechanisms," The Computer Journal 36:3 (1993), 227237.

For our study, we also evaluated the effect of user choice on the memorability of the chosen passwords. Figure 9-7 shows the percentage of successful logins versus the time since that user's last login attempt. A trend that emerges is that while memorability of both schemes is strong, Story passwords appear to be somewhat harder to remember than Face passwords. One potential reason for users' relative difficulty in remembering their Story passwords is that apparently few of them actually chose stories, despite our suggestion to do so. This contributed very significantly to incorrect password entries resulting from misordering their selections. For example, of the 236 incorrect password entries in Story, over 75% of them consisted of the correct images selected in an incorrect order.

IMPACT OF ORDERING ON MEMORABILITY

"I had no problem remembering the four pictures, but I could not remember the original order."
"No story, though having one may have helped me to remember the order of the pictures better."
"...but on the third try I found a sequence that I could remember: fish-woman-girl-corn. I would screw up the fish and corn order 50% of the time, but I knew they were the pictures."

As such, it seems advisable in constructing graphical password schemes to avoid having users remember an ordering of images. For example, we expect that a selection of k images, each from a distinct set of n images (as in the Face scheme, although with image categories not necessarily of only persons), will generally be more memorable than an ordered selection of k images from one set. If a scheme does rely on users remembering an ordering, then the importance of the story should be reiterated to users, because if the

Figure 9-7. Memorability versus time since last login attempt; each data point represents the average of 90 login attempts; of the 236 incorrect password entries in Story, over 75% of them consisted of the correct images selected in an incorrect order

sequence of images has some semantic meaning, it is more likely that the password will be memorable (assuming, of course, that the sequences are not too long^[36]).

^[36] G. A. Miller, "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information," Psychological Review 63 (1956), 8197.

9.3.3. Discussion

These results demonstrate that graphical password schemes can suffer from drawbacks similar to those of textual password schemes; most notably, they exhibit similar biases in human tendencies to select memorable passwords. Moreover, forthcoming evaluations of which we are aware may further elucidate the depth of this problem. For example, Sasse is exploring the susceptibility of graphical passwords to the spouse-test, where spouses play the role of informed impostors.^[37] Early evidence suggests that graphical password schemes of the type we consider here may indeed be vulnerable to such "adversaries." One alternative to strengthen graphical passwords is to prohibit user selection of passwords, so that each user's password is system-generated. However, it is widely considered that such measures have failed in the case of textual passwords as a result of usability concerns, and more research is needed in the context of particular graphical password schemes to ascertain whether this is a reasonable measure.

^[37] M. A. Sasse, personal communications, Dimacs Workshop on Usable Privacy and Security (July 2004).

The range of memorability studies undertaken thus far have all been limited in a number of respects. For one, the effects of image confusion has yet to be evaluated, and it is unclear whether the impressive recall rates observed thus far (albeit in small user trials) will be adversely affected as we become more exposed to graphical authentication systems. To illustrate the risk of image confusion, let's imagine that graphical password schemes became widespread. Furthermore, suppose for argument's sake, that the Story scheme becomes the de facto scheme of choice. It remains entirely plausible that as a user becomes exposed to more instances of the scheme (say, for example, for accessing web-based email, online banking services, news subscriptions, etc.) the user will confuse stories used to access each of these services, particularly if the image categories selected by the user for one of her stories appear as distractors during the authentication stage for a different service. To date, no evaluation of graphical password schemes of which we are aware has taken this effect into considerationfor example, by forcing users to create new graphical passwords every so often. In that regard, we believe that exploring the impact of this effect on long-term recall rates remains an area of research that warrants further investigation.