Taxonomies: Ordering a Vocabulary


A taxonomy is a form of organized vocabulary. Classically the organization is hierarchic, based on some attributes of the things being classified. The most common example is the organization of living things into kingdoms, phyla, classes, orders, families, genera, and species, based on similarity of physiologic traits as shown in Figure 4.5.

click to expand
Figure 4.5: The biologic taxonomy showing some related species.

Most complex vocabularies have several taxonomies that help users find unfamiliar terms. For example, SNOMED is a medical taxonomy that organizes more than 100,000 terms into a taxonomy that first distinguishes topography (concepts related to functional anatomy), morphology (changes in shape and structure of biology), function (physiology), living organisms, chemicals, and so on. Each major heading then breaks down into subheadings specific to the major head (topography breaks down into the major organs, etc.). We seem to be innately drawn to creating taxonomies. Something about the way we are wired suggests that we want to create hierarchic categories to organize things into simple maps of the world.

Although some of these taxonomies are useful, many just get in the way. Here are a few examples that will give you an idea of the useful and the dysfunctional in taxonomies:

  • North American Industry Classification System (NAICS)—This taxonomy replaces Standard Industrial Codes (SIC).

  • Chart of accounts—Every organization has a chart of accounts.

  • Service codes—A taxonomy of types of professional services used by a procurement group.

NAICS (Formerly SIC)

NAICS is a six-digit code for categorizing kinds of business by product or process used. NAICS is a successor to SIC, which had become dysfunctional: "The present SIC, although designed as a hierarchy, in fact does not provide a hierarchical structure useful for analysis. Past revisions to the SIC seem to have focused on adding (or eliminating) 4 digit SIC's rather than reviewing the overall structure of the classification structure."[13] This underscores the fact that the people who restructured the SIC codes knew what they were up against; alas, the new structure, although an improvement, still has areas where the structure is not useful for analysis. For example, "Reproduction of software" is subsumed under Sector 33, "Manufacturing," which was meant to group things that had similar use of resources and other aspects so that they could be combined.

Chart of Accounts

A chart of accounts separates business expenditures into categories of revenue, expense, asset, and liability so that managers, investors, and tax authorities can review business activity. Other than a default chart of accounts that comes with some accounting systems, there is no standard chart of accounts.

At the highest level the chart of accounts is usually meant to be a taxonomy: All of the accounts under the revenue are revenue accounts, all those under expense are expense accounts, and so on. Most large organizations have charts of accounts that are made up of five to eight independent (orthogonal) taxonomies, which gives them a coding block that is usually 60 to 150 characters long. There are potentially billions of combinations, but usually only a few hundred thousand are used.

Orthogonal

The dictionary definition of orthogonal is "at right angles." In computer and taxonomy circles it has come to mean expressing two (or more) things such that they can vary independently of each other. For example, if we were going to create a car taxonomy, "make" and "model" would be part of the same taxonomy, because they are not orthogonal. "Fuel efficiency" and "safety" might be two orthogonal dimensions that could be used for taxonomies.

The problems show up three to five levels into the hierarchies (this seems to be a pattern with taxonomies). The first few levels may break expenses into a few broad categories: costs of sales, selling costs, general costs, and administrative costs. As you get deeper into the hierarchy, a tension occurs between the things that seem to be most closely related (e.g., should the printing costs for brochures be near the salary of the graphic artist who designed them?). Eventually the account structure becomes highly compromised. This is because the meaning of the subcoding relationship is not held constant as you descend the hierarchy. (It is held constant in the phylum/order/family hierarchy, which is probably why it is held up as a canonical example.)

Service Codes

Both the NAICS and the chart of accounts are taxonomies that people have put a lot of time and effort into making as useful as possible. However, most of the taxonomies we come across are casually assembled by one or more individuals who think that a particular hierarchy might be useful. From my observations, such taxonomies rarely take long to outgrow their original purpose. We recently came across a hierarchy that had 10 pages of documentation to group professional services into a dozen categories for the purpose of slotting vendors into the type of services they offered. The longer you studied the taxonomy, the less certain you were about what each category meant. One category was "technical architecture," but as you read the subcategories and descriptions, you noticed that they had added in specific technologies that were covered in other categories, such as project management. As a result the category headings were only vague indicators of highly overlapped categories. Most vendors tried to get in as many categories as they could, because if someone needed their services they couldn't be sure which category they would pick for the search.

A smaller but more typical example of a code set, Figure 4.6 is an example of a taxonomy of order status codes from a well-established office supply vendor.

Status Key

CAN—This item is cancelled.

CBO—This item is on backorder.

CMP—This item has been transmitted to a retail location for delivery.

DLV—This item has been delivered.

INV—This item has been received and is awaiting final in-stock verification.

OPN—This item is in stock and transmitting for fulfillment.

PIP—This item is being packaged for delivery.

STG—This item has been packed and is staged for delivery.

TCL—This item will be delivered by a local retail location.


Figure 4.6: Order codes.

When you first read it, it appears to be sensible. We'll return to this later in the chapter to find out why what sounds like a good taxonomy isn't always very useful. For now, just consider this as a casual taxonomy.

The Trouble with Taxonomies

We build a lot into the analogy between taxonomies and genus/species. The main reason the analogy breaks down and taxonomies don't work as well as we expect them to is that biology has two interesting characteristics not shared by other domains to which we apply taxonomies:

  • Evolution—All biologic organisms descended from common ancestors. The closer any two species are to each other in the descent tree, the more similar they are likely to be. (Modern cladistic analysis[14] has adjusted Lineas's original categorization, but not drastically; the similarity of biologic function and morphologic structure tends to re-create their evolutionary history well.)

  • Unique location—There is little ambiguity about where a species belongs. Both bats and birds fly, but a brown bat about the same size as a sparrow is no more closely related to it than a duck is to a dog. However, our made-up taxonomies rarely have this kind of rigor.

The biologic taxonomy is consistent in its use of the link meaning "is a kind of." A taxonomy that is consistent in this way is powerful, because the application can imbue items lower in the taxonomy with attributes higher in the taxonomy.

Often the most useful taxonomies are the simplest. My experience is that, with the exception of the biologic taxonomy, as a taxonomy grows its integrity shrinks. So one suggestion is to keep taxonomies small, single purposed, and orthogonal (see definition earlier in this chapter).

A powerful taxonomy has the following characteristics:

  • An item categorized to a lower-level category is also a member of all its category parents.

  • An item categorized to a category may also be treated as one of its supertypes (although it needn't be).

  • There should exist some set of rules that would allow a classifier to determine whether any of the subcategories are appropriate (we call these "rule in/rule out" criteria).

In the linguistic study of semantics, words or concepts that are related by the "inclusion" relationship (the "isa" relationship) are called hyponyms. Hyponymy is a good basis for taxonomies, because all of the subsumed concepts can still be treated as the parent.

However, taxonomies do not have a rich enough semantic for most of our uses. That is where ontologies come in.

[13]"Economic Classification Policy Committee Issues Paper No. 2," Feb. 8, 1993. Available at http://www.census.gov/epcd/naics/issues.

[14]Cladistic analysis is a technique for determining the relatedness of species from their DNA and other clues, rather than from their physiologic features alone.




Semantics in Business Systems(c) The Savvy Manager's Guide
Semantics in Business Systems: The Savvy Managers Guide (The Savvy Managers Guides)
ISBN: 1558609172
EAN: 2147483647
Year: 2005
Pages: 184
Authors: Dave McComb

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net