Chapter 36: Summarizing Data by Using Histograms


Overview

  • People often say that a picture is worth a thousand words. Can I use Excel to create a picture (called a histogram) that summarizes the values in a data set?

  • What are some common shapes of histograms?

  • What can I learn by comparing histograms from different data sets?

The ability to summarize a large data set is important. The three tools used most often to summarize data in Microsoft Office Excel 2007 are histograms, descriptive statistics, and PivotTables. In this chapter, I’ll discuss the use of histograms for summarizing data. I cover descriptive statistics in Chapter 37, “Summarizing Data by Using Descriptive Statistics,” and PivotTables in Chapter 38, “Using PivotTables to Describe Data.”

  • People often say that a picture is worth a thousand words. Can I use Excel to create a picture (called a histogram) that summarizes the values in a data set?

  • A histogram is a commonly used tool to summarize data. Essentially, a histogram tells us how many observations (another term for data points) fall in various ranges of values. For example, a histogram created from monthly Cisco stock returns might show how many monthly returns Cisco had from 0 percent through 10 percent, 11 percent through 20 percent, and so on. The ranges in which you group data are referred to as bin ranges.

  • Let’s look at how to construct and interpret histograms that summarize the values of monthly returns for Cisco and GM stock in the years 1990–2000. You’ll find this data (and returns for other stocks) in the file Stock.xlsx. Figure 36-1 on the next page shows a subset of the data (in the Stockprices worksheet). During March 1990, for example, Cisco stock increased in value by 1.075 percent.

    image from book
    Figure 36-1: Monthly stock returns

  • When constructing histograms with Excel, you can let Excel define the bin ranges or you can define the bin ranges yourself. If Excel defines the bin ranges, you could end up with weird-looking bin ranges, such as –12.53 percent to 4.52 percent. For this reason, I prefer to define the ranges myself.

  • A good way to start defining bin ranges for a histogram (you can think of defining bin ranges as setting boundaries) is to divide the range of values (between the smallest and largest) into 8 to 15 equally spaced categories. All the monthly returns for Cisco are from –30 percent through 40 percent, so I chose bin range boundaries of –30 percent, –20 percent, –10 percent, 0 percent, and so on up to 40 percent.

  • To create our bin ranges, I first enter CSCO, .4, .3, .2,, –.2, –.3 (the boundaries of the bin ranges) in cells H54:H62. Next, on the Data tab of the Ribbon, in the Analysis group, I click Data Analysis to open the Data Analysis dialog box. The dialog box lists the functions of the Analysis Toolpak, which contains many of the statistical capabilities of Excel.

    Note 

    If the Data Analysis command doesn’t appear on the Data tab, click the Microsoft Office Button followed by Excel Options. In the Manage box, click Excel Add-Ins and click Go. In the Add-Ins dialog box, check the Analysis ToolPak box, and then click OK. Now you can access the Analysis ToolPak functions by clicking Data Analysis in the Analysis group on the Data tab.

  • By clicking Histogram in the Data Analysis dialog box, we open the Histogram dialog box shown in Figure 36-2.

    image from book
    Figure 36-2: Histogram dialog box for the Cisco histogram

  • Here’s how to fill in the dialog box as it’s shown:

    • Select the Input Range (F51:F181). (To select the range F51:F181, you can select cell F51 and then press Ctrl+Shift+Down Arrow key. This takes you to the bottom of the column.) This range includes all the data we want to use to create our histogram. I included the label CSCO from cell F51 because when you do not include a label in the first row, the x-axis of the histogram is often labeled with a number, which can be confusing.

    • The Bin Range (H54:H62) includes the boundaries of our bin ranges. Excel will create bins of –30 percent through –20 percent, –20 percent through –10 percent, and so on up to 30–40 percent.

    • I checked the Labels option because the first rows of both our bin range and input range contain labels.

    • I chose to create the histogram in a new worksheet (named histo).

    • Select Chart Output, or Excel will not create a histogram.

  • Click OK in the Histogram dialog box. Our Cisco histogram will look like the one shown in Figure 36-3.

    image from book
    Figure 36-3: Cisco histogram created by using an Excel Analysis ToolPak function

  • When you create the histogram, you’ll see gaps between the bars. To remove these gaps, right-click any bar on the graph and choose Format, Data Series. On the Series Options tab, drag Gap Width to 0%. You will also see that a label does not appear for each bar. If all the labels do not appear, you can make all labels appear. Select the graph and drag any handle that has two arrows to widen the graph. You can also reduce the font size to make a label appear. To reduce the font size, right-click the graph axis, and then right-click Font. Change the font size to 5. You can also change the title of the chart by selecting the text and entering the title you want. After making these changes, the histogram appears as its shown in Figure 36-4.

    image from book
    Figure 36-4: You can change the format of different elements in the chart.

  • Notice that Cisco returns are most likely between 0 and 10 percent per month, and the height of the bars drops off as the graph moves away from the tallest bar. When we create the histogram, we also obtain the bin-range frequency summary shown in Figure 36-5.

    image from book
    Figure 36-5: Cisco bin-range frequencies

  • From the bin-range frequencies, we learn, for example, that for two months, Cisco’s return was greater than –30 percent and less than or equal to –20 percent; in 13 months, the monthly return was greater than –20 percent and less than or equal to –10 percent.

  • What are some common shapes of histograms?

  • For most data sets, a histogram created from the data will be classified as one of the following:

    • Symmetric

    • Skewed right (positively skewed)

    • Skewed left (negatively skewed)

    • Multiple peaks

  • Let’s look at each type in more detail. See the file Skewexamples.xlsx.

    • Symmetric distribution. A histogram is symmetric if it has a single peak and looks approximately the same to the left of the peak as to the right of the peak. Test scores (such as IQ tests) are often symmetric. For example, the histograms of IQs (see cell W42) might look like Figure 36-6. Notice that the height of the bars one bar away from the peak bar are approximately the same, the height of the bars two bars away from the peak bar are approximately the same, and so on. The bar labeled 105 represents all people with an IQ greater than 95 and less than or equal to 105, the bar labeled 65 represents all people having an IQ less than or equal to 65, and so on. Also note that Cisco monthly returns are approximately symmetric.

      image from book
      Figure 36-6: Symmetric histogram

    • Skewed right (positively skewed). A histogram is skewed right (positively skewed) if it has a single peak and the values of the data set extend much farther to the right of the peak than to the left of the peak. Many economic data sets (such as family or individual income) exhibit a positive skew. Figure 36-7 (see cell T24) shows an example of a positively skewed histogram created from a sample of family incomes.

      image from book
      Figure 36-7: A positively skewed histogram created from data about family income.

    • Skewed left (negatively skewed). A histogram is skewed left (negatively skewed) if it has a single peak and the values of the data set extend much farther to the left of the peak than to the right of the peak. Days from conception to birth are negatively skewed. An example is shown in cell Q7 of Figure 36-8 on the next page. The height of each bar represents the number of women whose time from conception to birth fell in the given bin range. For example, two women gave birth fewer than 180 days after conception.

      image from book
      Figure 36-8: A negatively skewed histogram of data plotting days from conception to birth.

    • Multiple peaks. When a histogram exhibits multiple peaks, it usually means that data from two or more populations are being graphed together. For example, suppose the diameter of elevator rails produced by two machines yields the histogram shown in Figure 36-9. See cell Q11 of the file Twinpeaks.xlsx.

      image from book
      Figure 36-9: A multiple-peak histogram

      In this histogram, the data clusters into two separate groups. In all likelihood, each group of data corresponds to the elevator rails produced by one of the machines. If we assume that the diameter we want for an elevator rail is .55 inches, we can conclude that one machine is producing elevator rails that are too short, whereas the other machine is producing elevator rails that are too long. We should follow up with our interpretation of this histogram by constructing a histogram charting the elevator rails produced by each machine. This example shows why histograms are a powerful tool in quality control.

  • What can I learn by comparing histograms from different data sets?

  • We’re often asked to compare different data sets. For example, we might wonder how the monthly returns on GM and Cisco stock differ. To answer a question such as this, you can construct a histogram for GM by using the same bin ranges as for Cisco, and then place one histograms above the other, as shown in Figure 36-10. See the Histograms worksheet of file Stock.xlsx.

    image from book
    Figure 36-10: Using histograms that include the same bin ranges to compare different data sets

  • By comparing these two histograms, we can draw two important conclusions:

    • Typically, Cisco performed better than GM. We know this because the highest bar for Cisco is one bar to the right of the highest bar for GM. Also, the Cisco bars extend farther to the right than the GM bars.

    • Cisco had more variability, or spread about the mean, than GM. Note that GM’s peak bar contains 59 months, whereas Cisco’s peak bar contains only 41 months. This shows that for Cisco, more of the returns are outside the bin that represents the most likely Cisco return. Cisco returns are more “spread out” than GM returns.

    In Chapter 37, we’ll look at more details about the differences between the monthly returns on Cisco and GM.




Microsoft Press - Microsoft Office Excel 2007. Data Analysis and Business Modeling
MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)
ISBN: 0735623961
EAN: 2147483647
Year: 2007
Pages: 200

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net