If you want to find something out about every single customer or employee in your business, you could talk to every single one of them. If you are concerned about the quality of the beer you serve at your bar, you could taste every one before serving. Or, to save time, money, and brain cells, "sample" efficiently instead.
Management thrives on knowing the characteristics of every widget produced, every transaction conducted, and every client helped. Of course, the whole set of all of these widgets, interactions, and people can never be brought together under one microscope and observed and evaluated. No specimen slide is big enough.
The same is true for those of us in social scienceresearchers interested in people simply cannot measure everybody. As much as we'd like to probe, shock, inject, hassle, embarrass, and generally bother everyone in the world, we just can't do it. We don't have the time, space, or money, and, frankly, no one really wants to get to know so many people.
The problem is, "How can you know about everything, without being able to look at everything?" As is the case with all hacks in this book, the solution is provided by statistics. There are scientifically sound ways to accurately describe any whole set of things by just looking at a small subset of those things.
Using Samples to Make Inferences
Inferential statistics allows us to generalize to a larger population, based on data from a smaller sample. For these generalizations to be valid, though, the sample has to represent the population fairly.
A good sample represents a population. This means that the distribution of every important characteristic in a population must be distributed, proportionately, in the same way in the sample. Much of this hack is about how to construct a good sample, so let's look at a good sample.
Imagine a population of squares, diamonds and triangles, as shown in Figure 2-4.
Figure 2-4. A sample within a population
A fair sample taken from a population of squares, diamonds, and triangles would contain those shapes in the same proportion as in the population. In our diagram, the outer oval represents a population, and the different shapes are distributed as 40 percent squares, 20 percent triangles, and 40 percent diamonds. The inner oval is the sample, which contains a subgroup of those elements in the population. The shapes in the sample are distributed in the exact proportions as in the population: 40 percent squares, 20 percent triangles, and 40 percent diamonds.
This sample is fair. It represents the population well, at least in terms of the characteristic of shape. When sampling people or things, samples typically represent a variety of traits. People and things are not entirely triangles or squares, so a sample of people is representative when its mean level of traits matches well with the population levels. Each person will have some level of all the characteristics, and won't be entirely one trait, unlike our shape example. (Though my Uncle Frank is pretty much entirely square, according to my Aunt Heloise.)
If you knew that the sampling methods used to produce this sample (the elements in the inner oval) were correct, you could infer something about the population by just looking at the sample. The procedure is simple and intuitive:
Instead of abstract triangles in a theoretical population, imagine you are interested in checking the quality of the beer you sell in your bar. To get an idea of the beer population, construct a good sample of the beers you sell and taste each of them:
Inference is pretty easy to do, but it works well only when the sample is good. Constructing a good sample is the key.
Constructing the Best Random Sample
A good sample represents the population. Representative sampling begins with defining the universe, or, in other words, the population of things from which a researcher wishes to sample. There are a variety of ways to conceptualize these elements and various levels of grouping that are explicitly or implicitly identified when choosing a population and selecting a sample. You have to know about these ways of organizing your population; otherwise, you cannot create a good sample:
The best sampling strategy, without question, is to sample randomly from a valid sampling frame. Random selection will do the best job of creating a sample that represents all the traits of interest in the population. The real power of random selection, though, is that you are also representing all sorts of variables you haven't even considered that might otherwise have an impact on your observations.
Technically, the term random describes a sampling process that gives every member of a population an equal and independent chance of being selected. Equal means that every sampling unit in the sampling frame has as good a chance as anyone else. Independent means that a person's or thing's chances of being selected are unrelated to whether any other particular person or thing has been selected.
So, suppose a selection process calls customers on a client list to ask for participation but stops trying to contact people if they aren't home or in the office during the first attemptthis does not give all possible participants an equal chance of being selected. People who aren't easily available are less likely to be chosen, and if people are not solicited to participate when someone in their office has already been chosen, each member of the population does not have an independent chance of being chosen.
Random sampling can be done by numbering all names on the sampling frame list and using some method of choosing a random number to pick each participant.
Sampling Strategies for the Real World
In the real world, it is often difficult or impossible to sample randomly. Here are some sampling strategies that aren't quite as good as random sampling, but are more realistic outside of some imaginary scientific laboratory:
Choosing a Sample Size
If you are able to construct a good sample, as we have defined it, even a small sample can be effective. As with chocolate chip cookies, though, bigger is better. The larger the sample, the more representative of the population it is. Consequently, the observations are more generalizable and you can better trust their accuracy.
Also, if there is some interesting relationship between variables in your observations, you are more likely to find that relationship and be sure that it did not occur by chance when you have observed many elements in your sample than when you have looked at just a few.
Finally, if you do have some social science purpose for your sampling, there are certain technical statistical characteristics that must be met to perform certain analyses. These standards are easier to meet in larger samplessuch as, say, samples consisting of 30 or more widgets.