Chapter 40: Filtering Data and Removing Duplicates


Overview

Joolas is a small company that manufactures makeup. They track each sales transaction in a Microsoft Office Excel spreadsheet. There are times when they want to extract, or “filter out,” a subset of their sales data. For example, they might want to identify sales transactions that answer the following questions:

  • How can I identify all transactions in which Jen sold lipstick in the East region?

  • How can I identify all transactions in which Cici or Colleen sold lipstick or mascara in the East or South region?

  • How can I copy all transactions in which Cici or Colleen sold lipstick or mascara in the East or South region to a different worksheet?

  • How do I clear filters from a column or database?

  • How can I identify all transactions which involved sales of >$280 and >90 units?

  • How can I identify all sales occurring in 2005 or 2006?

  • How can I identify all transactions in the last three months of 2005 or the first three months of 2006?

  • How can I identify all transactions where the salesperson’s first name starts with C?

  • How can I identify all transactions where the cell containing the product name was colored in red?

  • How can I identify all transactions in the Top 30 revenue values where Hallagan or Jen was the salesperson?

  • How can I easily obtain a complete list of salespeople?

  • How can I view every combination of salesperson, product, and location that occurs in the database?

  • If my data changes, how can I reapply the same filter?

  • How can I extract all foundation transactions in the first six months of 2005 for which Emilee or Jen was the salesperson and the average per-unit price was more than $3.20?

Microsoft Office Excel 2007 has filtering capabilities that make identifying any subset of data a snap. Excel also makes it easy to remove duplicate records from a list. Our work for this chapter is in the file Makeupfilter.xlsx. For the 1891 sales transactions listed in this file, we have the following information. (Figure 40-1 shows a subset of the data.)

  • Transaction number

  • Name of the salesperson

  • Date of the transaction

  • Product sold

  • Units sold

  • Dollar amount of the transaction

  • Transaction location

image from book
Figure 40-1: Makeup sales data

Each column (C through I) of our “database” (cell range C4:I1894) is called a field. Each row of the database that contains data is called a record. (Thus, the records in our database are contained in the cell range C5:I1894.) The first row of each field must contain a field name. For example, the name of the field in column F is Product. By using the Excel AutoFilter, you can “query” a database using AND criteria to identify a subset of records. This means that you can use queries of the form “Find all records where Field 1 satisfies certain conditions, and Field 2 satisfies certain conditions, and Field 3 satisfies certain conditions.” Our examples illustrate the capabilities of the Excel AutoFilter.

  • How can I identify all transactions in which Jen sold lipstick in the East region?

  • To begin, we place our cursor anywhere in the database and select Filter in the Sort & Filter group on the Data tab of the Ribbon. As shown in Figure 40-2, each column of the database now has an arrow in the heading row.

    image from book
    Figure 40-2: AutoFilter heading arrows

  • After clicking the arrow for the Name column, you will see the choices shown in Figure 40-3 on the next page.

    image from book
    Figure 40-3: Choices for filtering or sorting the Name column

  • We see that we could choose Text Filters which allow us to filter based on characteristics of the person’s name (more on this later). For now, we just want to work with data for Jen, so we first clear the Select All box, then check the Jen box and click OK. We now see only those records where Jen was the salesperson. Next, we go to the Product column and check the lipstick box, and then to the Location column and check the East box. We now see only those transactions where Jen sold lipstick in the East region (see Figure 40-4 on the next page). Note the arrow has changed to a cone in the columns where we have set up filtering criteria.

    image from book
    Figure 40-4: Jen sells lipstick in the East region.

  • How can I identify all transactions in which Cici or Colleen sold lipstick or mascara in the East or South region?

  • We simply select Cici and Colleen from the Name column list, lipstick and mascara from the Product column list, and East and South from the Location column list. The records meeting our filtering criteria are shown in Figure 40-5.

    image from book
    Figure 40-5: Transactions where Cici or Colleen sold lipstick or mascara in the East or South region

  • How can I copy all transactions in which Cici or Colleen sold lipstick or mascara in the East or South region to a different worksheet?

  • The trick here is to first press F5, click Special, and then select the Visible Cells Only option. Now when you copy, Excel will only include the visible rows (in this case, the rows selected by the filtering criteria). Now we select all the filtered cells in the worksheet (Cici, Colleen, lipstick, and mascara) and paste them into a blank worksheet. We can create a new blank worksheet in our workbook by right-clicking any worksheet tab, clicking Insert, selecting Worksheet, then clicking OK. The worksheet Visible Cells Copied contains the records where Cici or Colleen sold lipstick or mascara in the East.

  • How do I clear filters from a column or database?

  • Clicking Filter on the Data tab will remove all filters. Clicking the cone for any column where you have created a filter displays an option to clear the filter from that column.

  • How can I identify all transactions which involved sales of >$280 and >90 units?

  • After clicking Filter on the Data tab, we first click the Units column arrow to display the options shown in Figure 40-6 on the next page.

    image from book
    Figure 40-6: Filtering options for a numerical column

  • We can check any subset of numerical unit values (for example, all transactions where sales were –10 or –8 units). We click Number Filters to display the choices in Figure 40-7.

    image from book
    Figure 40-7: Number Filters options

  • Most of these are self-explanatory. We choose the Greater Than option and then fill in the dialog box as shown in Figure 40-8.

    image from book
    Figure 40-8: Selecting all records where units sold >90.

  • Next, we go to the Dollars column and click to include only records where $ amount is >$280. We obtain the records shown in Figure 40-9.

    image from book
    Figure 40-9: Transactions where >90 units were sold for a total of >$280.

  • How can I identify all sales occurring in 2005 or 2006?

  • After clicking Filter on the Ribbon, we click the arrow for the Date column and see the choices in Figure 40-10 on the next page.

    image from book
    Figure 40-10: Possible filtering options for the Date column

  • After selecting 2005 and 2006, we obtain only those records involving sales in 2005 or 2006, as shown in Figure 40-11.

    image from book
    Figure 40-11: Sales during 2005 and 2006

  • Note we could also have selected Date Filters to display the options shown in Figure 40-12.

    image from book
    Figure 40-12: Date Filters options

  • Most of these are self-explanatory. The Custom Filter option allows you to select any range of dates as your filtering criteria.

  • How can I identify all transactions in the last three months of 2005 or the first three months of 2006?

  • After clicking the arrow for the Date column, we see the list of years shown in Figure 40-10. Clicking the + sign to the left of the year displays a list of months. We checked October-December of 2005 and then January-March of 2006 to show all sales during those months (see Figure 40-13 on the next page).

    image from book
    Figure 40-13: All sales during October 2005-March 2006

  • How can I identify all transactions where the salesperson’s first name starts with C?

  • Simply click the Name column arrow and choose Text Filters. Next, select Begins With, and from the dialog box shown in Figure 40-14, choose begins with C.

    image from book
    Figure 40-14: Custom AutoFilter dialog box to select all records where salesperson's name begins with C

  • How can I identify all transactions where the cell containing the product name was colored in red?

  • Simply click the Product column arrow and choose Filter By Color. You can now select the Color on which to filter. As shown in Figure 40-15, we chose to include only the rows where the Product is colored in red. Figure 40-16 shows the resulting records.

    image from book
    Figure 40-15: Dialog box for filtering by cell color

    image from book
    Figure 40-16: All records where the product cell color is red

  • How can I identify all transactions in the Top 30 revenue values where Hallagan or Jen was the salesperson?

  • After clicking Filter on the Ribbon, we click the Name column arrow and then check the boxes for Hallagan and Jen. Then we click the Dollar arrow and choose Number Filters. From the Number Filters list, we select the Top 10 option and fill in the dialog box as shown in Figure 40-17. We have now filtered all records in the Top 30 revenue values where Jen or Hallagan was the salesperson. See the results in Figure 40-18 on the next page. Note only 5 of the Top 30 $ sales had Hallagan or Jen as the salesperson. Note we may also select Top 5 percent, Bottom 20 percent, and so on, for any numerical column.

    image from book
    Figure 40-17: Dialog box to select the Top 30 records by $ value

    image from book
    Figure 40-18: Top 30 records by $ value

  • How can I easily obtain a complete list of salespeople?

  • We want a list of all the salespeople, without anybody’s name repeated. We begin by selecting Remove Duplicates on the Data tab of the Ribbon. This displays the Remove Duplicates dialog box shown in Figure 40-19. We check only the Name box and click OK. This will filter out the first record involving each salesperson’s name. See the results in Figure 40-20.

    image from book
    Figure 40-19: Remove Duplicates dialog box

    image from book
    Figure 40-20: List of salespersons' names

  • How can I view every combination of salesperson, product, and location that occurs in the database?

  • Again, we click Remove Duplicates on the Data tab of the Ribbon. Then we fill in the Remove Duplicates dialog box as shown in Figure 40-21.

    image from book
    Figure 40-21: Finding unique salesperson, product, and location combinations

  • Figure 40-22 lists the first record for each combination of person, product, and location occurring in the database. We find that 180 unique combinations occurred.

    image from book
    Figure 40-22: List of unique salesperson, product, and location combinations

  • If my data changes, how can I reapply the same filter?

  • Simply right-click any cell in your filtered results, point to Filter, and then click Reapply. Any changes to your data will now be reflected in the filtered data.

  • How can I extract all foundation transactions in the first six months of 2005 for which Emilee or Jen was the salesperson and the average per-unit price was more than $3.20?

  • The AutoFilter feature (even with the Custom option) is limited to AND queries across columns. This means, for example, that we cannot find all transactions for lipstick sales by Jen during 2005 OR foundation sales by Zaret during 2004. To perform more complex queries such as this one, we need to use the Advanced Filter feature. To use

  • Advanced Filter, we set up a criteria range that specifies the records we want to extract. (This process is described in detail in Chapter 37, “Summarizing Data with Descriptive Statistics.”) After specifying the criteria range, we tell Excel whether we want the records extracted to the current location or to a different location. To identify all foundation transactions in the first six months of 2005 for which Emilee or Jen was the salesperson and for which the average per-unit price was more than $3.20, we use the criteria range shown in cell range O4:S6 in Figure 40-23. See the file Advancedfilter.xlsx.

    image from book
    Figure 40-23: Setting up a criteria range to use with an advanced filter

  • In cells R5 and R6, I entered the formula =(L5/K5)>3.2. Recall from Chapter 39, “Summarizing Data with Database Statistical Functions,” that this formula creates computed criteria that flags each row in which the per-unit price is more than $3.20. Also remember that our heading for a computed criteria must not be a field name, so I used Price as the field heading. Our criteria in O5:S5 flags all records in which Jen is the salesperson, the date is between 1/1/2005 and 6/30/2005, the product sold is foundation, and the per-unit price is more than $3.20. Our criteria in O6:S6 flags all records in which Emilee is the salesperson, the date is between 1/1/2005 and 6/30/2005, the product sold is foundation, and the per-unit price is more than $3.20. The criteria range of O4:S6 flags exactly the records we want. Remember that criteria in different rows are joined by OR.

  • We can now select any cell within the database range and, on the Data tab of the Ribbon, select Advanced from the Sort & Filter group. We fill in the dialog box as shown in Figure 40-24.

    image from book
    Figure 40-24: Advanced Filter dialog box settings

  • With these settings, we are telling Excel to extract all records in the database (cell range G4:M1895) that satisfy the criteria specified in O4:S6. These records should be copied to a range whose upper-left corner is cell O14. The records extracted are shown in the cell range O14:U18. Only four records meet the criteria we defined.

  • If we check the Unique Records Only box in the Advanced Filter dialog box, no duplicate records are returned. For example, if Jen had another foundation transaction in the East region on 3/19/2005 for one unit for $4.88, only one of these transactions would be extracted.




Microsoft Press - Microsoft Office Excel 2007. Data Analysis and Business Modeling
MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)
ISBN: 0735623961
EAN: 2147483647
Year: 2007
Pages: 200

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net