How to Analyze Survey Responses | Real-World .NET Applications

Survey analysis and interpretation is as much of an art as a science. Although it deals with numbers, proportions, and relationships, it also measures statements, and the relationship between statements and actions is notoriously difficult to understand. From the ambiguities of recruiting bias, to misunderstandings in question wording, to the tendency of people to exaggerate, the whole process involves approximations and estimates. Ultimately, every person is different, and the final analysis will miss the subtleties of any single person's perceptions, behavior, or experience. However, this is all right. In most cases, results can be valuable and useful without complete certainty, and often it's important to only know the odds in order to make an informed decision.

Thus the analysis of survey data should strive for accuracy and immediate utility. Although sophisticated techniques can sometimes extract important subtleties in data, simpler analysis is preferable in most cases. Simpler methods reduce the possibility of error and labor and are sufficient to answer the majority of questions that come up in typical product development situations.

The two common analysis techniques can be summarized simply as counting and comparing.

Counting

The easiest, and often the only, thing that you can do with results is to count them (to tabulate them, in survey-speak).When basic response data are counted, it can reveal simple trends and uncover data entry errors. It consists of counting all the response values to each question.

You should start by looking through the raw data. The raw information can give you ideas of trends that may be present in the data before you start numerically abstracting the results. How are the results distributed? Is there an obvious way that some responses cluster? Are there any really clearly bogus or atypical responses (such as teenagers with $150,000 personal incomes or skateboarding octogenarians)? Spending time with the raw data can give you a gutlevel feeling for it, which will prove useful later.

Once you've looked through the raw data, a simple count of all the answers to a given question can be useful. For example, adding up the answers to the question "Which of the following categories includes your household income?" could yield the following table:

Less than $20,000	0
$20,001–$29,999	2
$30,000–$39,999	3
$40,000–$49,999	10
$50,000–$59,999	12
$60,000–$69,999	20
$70,000–$79,999	25
$80,000–$99,999	28
$100,000–$119,999	22
$120,000–$149,999	17
$150,000 or over	5
No answer	10

Displayed as a simple histogram, this reveals some interesting information about how your audience's income is distributed.

From Figure 11.3, it looks like the audience's income peaks somewhere between $80,000 and $100,000, and that people with incomes between $60,000 and $150,000 make up the majority of the users. If your site was aimed at lower-middle income participants, then it's clear that your site isn't attracting those people in the way you had hoped.

click to expand
Figure 11.3: A normal distribution.

Taking the data from the table, it's possible to calculate the mean and mode of the data. The mean is the average of the values, in the traditional algebraic sense. It's calculated by adding all the values of the responses to a given question and then dividing by the number of responses. In the case of ranges of numbers, such as the one above, you can use the midpoint of every range as your starting point. For this example, the mean would be about $86,000, calculated as such:

It is "about" $86,000 because the ranges are broad, and the highest range of "$150,000 or over" is unbounded at the upper end (the lower bound was used in calculations). For practical purposes, this is generally enough.

The mean, however, can be easily skewed by a small number of extreme results. It's the "billionaire" problem: if you're sampling the actual values of annual salaries and you happen to survey Bill Gates, your "average" value is likely to be significantly higher than what the majority of people in your sample make. This is where looking at the raw data is important since it will give you a gut-level expectation for the results. Your gut could still be wrong, but if you looked at a bunch of responses where people had $40,000 and $50,000 incomes and your mean turns out to be $120,000, then something is likely pushing the value up. You should start looking for outliers, or responses that are well outside the general variation of data since a few extreme values may be affecting the mean.

The mode, the most common value, can be compared to the mean to see if the mean is being distorted by a small number of extreme values (in our example, it's "$80,000-$99,000," which has 28 responses).When your responses fall into a normal distribution, where the data rise up to a single maximum and then symmetrically fall off (forming the so-called bell curve), the mean and mode are the same (as they are in the example). The larger the sample, the more likely you are to have a normal distribution. However, sometimes you don't. If for some reason, your site manages to attract two different groups of people, the mean and mode may be different numbers. Take, for example, a site that's used extensively by practicing doctors and medical school students. The income distributions may look something like this.

Less than $20,000	20
$20,001–$29,999	17
$30,000–$39,999	14
$40,000–$49,999	6
$50,000–$59,999	10
$60,000–$69,999	12
$70,000–$79,999	18
$80,000–$99,999	22
$100,000–$119,999	20
$120,000–$149,999	15
$150,000 or over	9
No answer	3

The mean of incomes based on this table is about $70,000, but the mode is about $90,000. This is a large enough difference that it says that the distribution of responses is not a balanced bell curve (in fact, it's what's called a bimodal distribution), so it's a tip-off that additional analysis is necessary. A histogram (Figure 11.4) shows this clearly.

click to expand
Figure 11.4: A bimodal distribution.

Since it's important to know whether you have a single homogeneous population or if you have multiple subgroups within your group of users, looking at the difference between the mode and mean can be an easy, fast check.

Likewise the median, the value at the halfway point if you sort all the results, can also tell you if your mean is being affected by extreme values. The median of Example 2 is about $75,000 and the mode is $90,000, which tells you that the mean value of $72,000 is being affected by a large cluster of lower numbers, which the histogram clearly shows. Because it's less affected by outliers than the mean, the median is the standard typically cited and compared when discussing standard demographic descriptors such as income and age.

How to Deal with Missing Data

Not everyone will answer every question. How should you deal with that? The simplest common method is to report the missing elements when tabulating variables and eliminate those responses from calculations that use those variables in comparisons. Elimination creates the problem that different calculations and conclusions will be based on different numbers of responses. If the sample is sufficiently large and the number of eliminated responses is relatively small, the calculations should still be usable and comparable, but it becomes an issue when the amount of missing data overwhelms the margin of error. Regardless, when the number of responses differs, always list the actual number of responses used. This is generally reported as N = x, where x is the number of responses used in the calculation.

Comparing

Tabulating single variables can be informative and useful, but the real power of survey research lies in comparing the contents of several variables to each other. For example, you may be interested in how the frequency with which people use your site affects what kinds of features they use. Do people who use the site all the time use a different set of features than people who use it occasionally? Knowing this could allow you to better emphasize features and create introductory help. Just looking at the data, this type of relationship is difficult to discern, so you need to start using a comparison technique. The most common comparison technique is cross-tabulation. Cross-tabulation uncovers the relationship between two variables by comparing the value of one to the value of another.

Although there are a number of ways to create a cross-tab, a typical technique works as follows:

Start by identifying the independent variable. This is the factor that you feel is doing the "affecting" and the one that is the subject of your question. In "How is the frequency of visitation affecting the kinds of features our users use?" the independent variable is the frequency of visitation since that is likely affecting the features that people are using (rather than the other way around, where using certain features causes people to visit the site more—this is possible, but not as likely, based on what you know of people's use of the site).
Group the responses to the question according to the values of the independent variable. For example, if your question asked, "How often do you use [the site]?" and the multiple-choice answers were "less than once a month, once a month, several times a month, and so on," then grouping the responses according to the answers is a good place to start.
Tabulate the answers to the other variable, the dependent variable, individually within each independent variable group. Thus, if another survey question said, "Which of the following features did you use in your last visit to [the site]?" then people's answers to it would form the dependent variable. If the answers to this question were "the Shopping Cart, the News Page, and the Comparison Assistant" you would tabulate how many people checked off one of those answers for each group.
Create a table with the tabulated values. For example, the following table compares the features that various groups of people report using in their last visit to the site:

Less Than Once a Month

Once a Month

Several Times a Month

Etc.

Shopping Cart

5%

8%

20%

News Page

20%

25%

15%

Comparison Assistant

2%

10%

54%

At this point, it should be possible to see simple relationships between the two variables, if any are there to be seen. For example, people who use the site multiple times a month use the Comparison Assistant significantly more than people who use the site less frequently, which likely means that the more people use the site, the more that feature becomes valuable to them. (Why? That's a question that surveys can't easily answer.) Likewise, the News Page seems to be somewhat less important to frequent users than to others, which isn't surprising considering they visit enough so that less is new to them with each visit. Additional relationships can be found by comparing other variables to each other (for example, length of use to frequency of use: do people use the site more frequently the longer they've been using it?).

The following table compares the answers to the question "How often do you visit this Web site?" with "Why are you visiting this site today?" in order to understand whether frequency of visitation affects the reasons why people come to a site. It summarizes only the responses to "Why are you visiting this site today" that have 500 responses or more (because that was determined as the minimum number necessary to be statistically significant).

	This Is My First Time	Less Than Once a Month	Once a Month	Once a Week	More Than Once a Week	Row Total
Looking for information about a specific radio program	260	229	167	129	115	900
Other	220	159	104	56	78	617
Want to listen to a radio program	344	245	251	298	630	1768
Want to read news or information	140	120	106	96	109	571
Column total	964	753	628	579	932	3856

Displayed as percentages, it's a little more informative.

	This Is My First Time	Less Than Once a Month	Once a Month	Once a Week	More Than Once a Week	Row Total

Looking for information about a specific radio program	29%	25%	19%	14%	13%	100%
Other	36%	26%	17%	9%	13%	100%
Want to listen to a radio program	19%	14%	14%	17%	36%	100%
Want to read news or information	25%	21%	19%	17%	19%	100%
Mean response	25%	20%	16%	15%	24%	100%

A proportion chart, shown in Figure 11.5, however, tells the most complete story.

click to expand
Figure 11.5: Example cross-tabulation chart.

Just glancing at the chart reveals several observations.

Regular visitors visit to listen more than more casual visitors. This implies that one of the driving "stickiness" factors may be the fact that the site offers audio streaming capability.
Infrequent users tend to look for program information more than regular users. Maybe this is because they don't know that there's more there. If so, this could impact both the site's deusers also have a tendency to look for "other" things, which may be a further indicator that the site insufficiently communicates what is available since that value drops only a small bit for people who have visited more than once.

Note

If you are using Microsoft Excel for tabulation, much of the grunge work of doing cross-tabs is eliminated by PivotTables. PivotTables allow you to take lists of raw survey data, one complete set of responses per row, and automatically cross-tab one variable against another. This can be a great time-saver, but be careful: it's easy to get lost in comparisons and easy to compare things that—logically—should not be compared.

Of course, there are other conclusions that can be drawn and many more sophisticated ways of manipulating and displaying relationships between variables, but these topics are beyond the scope of this book (for that see the excellent information visualization books by Edward Tufte).

When constructing the table, you should always make it clear how many total responses there are to each independent variable group. The customary way to do this is to add a "total" column (though that's less useful when discussing percentages, in which case the total is usually 100%; in those cases, use the n = notation, where n is the number of responses).

This is also the time when the calculation for a minimum number of responses comes into play. If the total number of responses in any independent variable group is less than the minimum sample size you calculated at the beginning of the survey, you should not draw any conclusions about that variable. The results are insignificant and should be marked as such, or left out of the report entirely. You can, however, merge groups to create larger groups that have the requisite sample size. When you do that, you should label the new "super group" clearly. Thus, if there weren't enough results in "18–24 years old" to draw any conclusions, you could leave it out of the report entirely, or you could merge it with "25–30 years old" and create an "18–30 years old" group. This, of course, works only with groups where it makes sense to combine the categories.

Estimating Error

Since every survey uses a sample of the whole population, every measurement is only an estimate. Without doing a census, it's impossible to know the actual values, and a lot of confusion can come from the apparent precision of numerical data. Unfortunately, precision does not mean accuracy. Just because you can make a calculation to the sixth decimal place does not mean that it's actually that accurate. Fortunately, there are ways of estimating how close your observed data are to the actual data and the precision that's significant. This doesn't make your calculations and measurements any better, but it can tell you the precision that matters.

Standard error is a measurement of uncertainty. It's a definition of the blurriness around your calculated value and a measure of the precision of your calculations. The smaller the standard error, the more precise your measurement—the larger, the less you know about the exact value. Standard error is calculated from the size of your survey sample and the proportion of your measured value to the whole. It's calculated as

where P is the value of the percentage relative to the whole, as expressed as a decimal, Q is (1 - P), and n is the number of samples.

So if you sample 1000 people, 400 of whom say that they prefer to shop naked at night to any other kind of shopping, your standard error would be calculated as the square root of (0.4 0.6)/1000, or 0.016. This means that the actual value is probably within a 1.6% spread in either direction of the measured value ("plus or minus 1.6% of the measured value"). This is sufficiently accurate for most situations.

Standard error is also useful for figuring out how much precision matters. Thus, if your calculation is precise to six decimal places (i.e., 0.000001), but your standard error is 1% (i.e., 0.01), then all that precision is for naught since the inherent ambiguity in your data prevents anything after the second decimal place from mattering.

The easiest way to decrease the standard error of your calculations is simply to sample more people. Instead of sampling 1000 people, as in the example above, sampling 2000 gives you a standard error of 1.1%, and asking 5000 people reduces it to 0.7%. However, note that it is never zero—unless you ask everybody, there will always be some uncertainty.

Standard deviation is a measure of confidence. It tells you the probability that the real answer (which you can never know for sure) is found within the spread defined by the standard error. With a normal "bell curve" distribution, the numbers are standard: the real value has a 68% chance of being within one standard deviation (one standard error spread) on either side of the measured value, a 95% chance of being within two standard deviations, and a 99% chance of being within three.

The range that a standard deviation specifies around the measured value is called the confidence interval (see Figure 11.6). It's how standard error and standard deviation are related. Standard error defines the width of the range where you can expect the value to fall, whereas standard deviation gives you the odds that it's in there at all.

click to expand
Figure 11.6: Confidence intervals.

Say, for example, a survey measures that 50% of the population is made with a 3% standard error. This means that you can have 68% confidence (one standard deviation, as shown in Figure 11.7) that the actual male percentage of the population is somewhere between 47% and 53%, 95% confidence that the actual percentage is between 44% and 56% (two standard deviations), and 99% confidence that it's between 41% and 59% (three standard deviations). You can't know where in that range it is (maybe it's 54.967353%), but if you need to make decisions based on that amount, at least you'll know how close your guess is to reality.

click to expand
Figure 11.7: Confidence interval example.

Measurement Errors

By calculating the standard error and confidence level of a sample, you can get some idea of how close your measured data are to the (often unknowable) objective truth. That doesn't mean, however, that your data actually represent what they're supposed to represent. It's still possible to systematically collect data in a logical, careful, statistically accurate way and still be completely wrong.

As an example of problems that cannot be compensated or predicted with statistics, witness the crash of NASA's Mars Climate Orbiter. The Orbiter was a state-of-the art piece of equipment built by some of the smartest people in the world. It was launched from Earth in December 1998 and was scheduled to reach Martian orbit in the fall of 1999. The margin of error for such a trip is exceedingly small since the craft carries almost no fuel for corrections should something go wrong. For most of the journey, everything looked great. It flew the whole way flawlessly, its systems regularly reporting its position and velocity back to Earth. The last critical phase before it began its scientific mission was its entry into Martian orbit, which was to be carried out automatically by its systems. As it approached the planet, it began its automatic deceleration sequence. At first it got closer and closer to the planet, as it should have, but then it began to go lower and lower, dropping first below its designated orbit, then below any moderately stable orbit, then below any survivable altitude, and finally into the atmosphere, where it disappeared forever. In the investigation that followed, it was discovered that although one development team was using the English measure of force to measure thrust, pounds per second, another was using the metric measure, newtons per second, which is four times weaker. So, although both sets of software were working as designed, the spacecraft thought that it was using one measurement system when, in fact, it was using another, causing a multi-hundred million–dollar spacecraft to crash into Mars rather than going into orbit around it.

Something similar can easily happen in survey research. A financial site may want to know how often people make new equity investments. The question "How often do you buy new stocks?" may make sense in light of this, but if it's asked to a group of shop-keepers without first preparing them for a financial answer, it may be interpreted as referring to their inventory. So although the analyst thinks it's a measurement of one thing, the participants think it's a measure of something entirely different.

This is called systematic error since it affects all the data equally. The Mars Climate Observer suffered from an extreme case of systematic error. No matter how accurate the measurements were made, the measurements weren't measuring what the engineers thought they were. However, there is also random error, the natural variation in responses. Measurements of standard error are, in a sense, a way to compensate for random error; they tell you roughly how much random error you can expect based on the number of samples you've collected. Since random errors can appear in any direction, they can cancel each other out, which is why standard error shrinks as the number of samples grows.

Drawing Conclusions

The conclusions you draw from your results should be focused on answering the questions you asked at the beginning of the research, the questions that are most important to the future of the product. Fishing through data for unexpected knowledge is rarely fruitful.

Before you begin making conclusions, you need to refresh your understanding of the tables you put together at the beginning and that you are filling out as part of the analysis. What variables do they display? What do those variables measure? Why are those measurements important? Redefine your tables as necessary if your priorities have changed over the course of your analysis.

When you are comparing data, you may want to use numerical tests to determine whether the differences in responses between two groups of responses are significant. In the radio example, the difference between people who came to read news once a week and those who came to read news more than once a week is 2%. Is that a significant difference? Chi-square and the Z-test are two tests that can be used to determine this, though explaining the math behind them is beyond the scope of this book.

When making conclusions from data, there are a number of common problems that you should avoid.

Confusing correlation and causation. Because two things happen close together in time does not mean that one causes the other. A rooster crowing at dawn doesn't make the sun come up though it does usually precede it. This is one of the most commonly made mistakes (look for it and you'll find it all over the media and in bad research) and is probably one of the strongest reasons for surveys and statistics getting a bad name. It's a simple problem, but it's insidious and confusing. Just because a group of people both like a product and use it a lot doesn't mean that liking the product makes people use it more or that frequent use makes people like it better. The two phenomena could be unrelated.
Not differentiating between subpopulations. Sometimes what looks like a single trend is actually the result of multiple trends in different populations. To see if this may be the case, look at the way answers are distributed rather than just the composite figures. The distributions can often tell a different story than the summary data. For example, if you're doing a satisfaction survey and half the people say they're "extremely satisfied" and the other half say they're "extremely dissatisfied," looking only at the mean will not give you a good picture of your audience's perception.
Confusing belief with truth. Survey questions measure belief, not truth. When you ask, "Have you ever seen this banner ad?" on a survey, you'll get an answer about what people believe, but their beliefs may not have any relationship to reality. This is why questions about future behavior rarely represent how people actually behave: at the time they're filling out the survey, they believe that they will act in a certain way, which is rarely how they actually behave.

Even if you draw significant distinctions between responses and you present them with the appropriate qualifications, there are still a number of issues with the nature of people's responses that have to be taken into account when interpreting survey data.

People want everything. Given a large enough population, there's going to be a group of people who want every possible combination of features, and given a list of possible features abstracted from an actual product, everyone will pretty much want everything. And why not? What's wrong with wanting it cheap, good, and fast, even though you may know it's impossible? Thus, surveys can't be used to determine which features no one wants—there's no such thing—but a survey can tell you how people prioritize features and which ones they value most highly.
People exaggerate. When presenting ourselves—even anonymously—we nearly always present ourselves as we would like ourselves to be rather than how we actually are. Thus, we exaggerate our positive features and deemphasize our failings. Taking people's perspectives on their opinions and their behavior at face value almost always paints a rosier picture than their actual thoughts and actions.
People will choose an answer even if they don't feel strongly about it. There's a strong social pressure to have an opinion. When asked to choose from a list of options, even if people feel that their feelings, thoughts, or experiences lie outside the available options, they'll choose an answer. This is one of the failings of the multiple-choice survey, and it's why the choices to a question need to be carefully researched and written, and why providing "None," "Does Not Apply," or "Don't Care" options is so important.
People try to outguess the survey. When answering any question, it's common to try to understand why the person asking the question is asking it and what he or she expects to hear. People may attempt to guess the answer that the author of the survey "really" wants to hear. This phenomenon is why it's important to avoid leading questions, but it should also be kept in mind when interpreting people's answers. Pretesting and interviewing survey respondents is a good way to avoid questions that exhibit this kind of ambiguity.
People lie. Certainly not all people lie all the time about everything, but people do exaggerate and falsify information when they have no incentive to tell the truth or when they feel uncomfortable. For example, if you ask for an address that you can send a prize to, it's unlikely that people will lie about being able to receive mail there, but if you ask about their household income and they feel that it doesn't benefit them to answer honestly, they're less likely to be as truthful.

Ultimately, the best way to analyze a survey is to hire a professional statistician who has appropriate survey research experience and to work with him or her to answer your questions about your product. Ideally, you can begin working with the statistician before you even write the survey. The more you work with a statistician, the more you will realize what kinds of questions can be asked and what kinds of answers can be obtained.

But don't shy away from running surveys if you don't have access to a statistician. Without a pro, it's still possible to field surveys that produce valid, useful, significant results, but it's important to stay with straightforward questions and simple analyses. A limited survey with a small number of questions fielded to a well-understood group of customers can reveal a lot about your user base, enough to form a foundation from which you can then do other research.

	Less Than Once a Month	Once a Month	Several Times a Month	Etc.
Shopping Cart	5%	8%	20%
News Page	20%	25%	15%
Comparison Assistant	2%	10%	54%