How Categorization Informs Us | Semantics in Business Systems: The Savvy Managers Guide (The Savvy Managers Guides)

Let's consider something closely related to taxonomies and ontologies, but in practice not quite the same thing: categorization. We categorize things all the time. Sometimes we use preexisting taxonomies to categorize things, as in the previous examples, but often we just make up ad hoc categories ("That was a ‘boring’ movie," or "It was a ‘chick flick’"). More interesting is when this happens and is codified in an application.

Application systems are littered with categories. Although a typical application system might have dozens of essential business entities, each business entity will have hundreds to thousands of data entities, many of which represent different categories. The real explosion of categories, though, is not formally documented. Almost every "if" statement in a program implies a category of some sort.

The fragment of code in Figure 4.7 states that we have two categories of purchase orders (cheap and expensive, say) and that we treat them differently (one has to be approved, the other does not). Often we have multiple tests to complete before we take some action. For example, we might test whether a task's actual start date is greater than 0 (and maybe less than today, but let's assume that validation was already in place) and whether the task's actual complete date is 0; if so, we categorize this task as "in progress" and deal with it in a particular way. We may make further tests; for example, if the estimated completion date is less than today, we may categorize it as being "late." If the earliest completion time equals the latest completion time, we may categorize the task as being "critical."

 If (purchase_order_amt > $100,000) then hold_for_approval = "Yes"

Figure 4.7: Code that implies a category.

Check this out yourself. Almost the only if statements that don't represent hidden categories are those that are testing whether a field is blank (in order to move it to another field) or those testing the end of an array or collection.

A complex application will have tens of thousands of such categorizations. This means that the business system has a vocabulary, if you will, of tens of thousands of distinctions, things that we treat slightly differently. This, by the way, is the main reason that large organizations have more complex systems than small organizations have. At one level, all businesses make stuff and sell it, or deliver service. You may have wondered why a large company needs a system that by any measure is one or two orders of magnitude more complex (they have more screens, more reports, more lines of code, etc.). Some of that is to handle volume, performance, and reliability, but by far the greatest percentage is due to the complexity introduced by making more and more distinctions of this sort.

Categories and Taxonomies

We can categorize things without taxonomies, as we alluded to in the previous section. However, categorizing and taxonomies are complementary because, rather than having a flat structure of terms for which we must find some sort of match, a well-constructed taxonomy allows us to categorize things a layer at a time, which compensates for our innate ability to hold only a limited number of things in our mind simultaneously.

Categories and Inheritance

One of the things that object-oriented methodologies do well is to replace "if… then…else" logic with subtypes, which is a major improvement. The categorization comes out of the code, where it is largely hidden, and into the class hierarchy, where it at least has some visibility. However, there are still some major problems with using class-based inheritance as a categorization mechanism.

Object-oriented systems do not deal well with multiple inheritance (a network hierarchy instead of a strict tree). Many of the languages don't support it, and even those that do get caught because once a class is inserted into a multiple inheritance location, it can no longer avail itself of its parents' specializations.

An even more pressing problem with object-oriented systems is that objects don't change classes. Once an object is created (instantiated), it will remain an instance of the type of class that created it until it is destroyed. As we will see in our discussion of categories, this is not how we want our categorization schemes to work.

What Categories Tell Us about Items in the System

If all we know about an "item" is that it is a "part" (a manufactured item), we can't make many assumptions about it. We can't even ask many intelligent questions. We can, for example, ask what it weighs. We can ask where it is. Maybe we can ask for its description, or its cost, but not much else.

However, if we categorize it as a "bolt" we suddenly have many more questions we can ask or values we could set. What is its length? Diameter? What material is it made of? How many threads per inch? What is its tensile strength? Head shape? And so on.

If we categorize it as a carriage bolt, we now know even more. We know it has a round head and a square shaft. We know it is at least 2 inches long. How do we know all this? We use a few properties to classify an item into a category, and then from that category we obtain additional information "for free."

Categorizing things accomplishes two fundamental purposes:

It allows us to infer some information (based on the category) that we didn't access directly.
It allows us to pose and answer new sets of questions relative to the item.

Prototype Theory

Another problem with categories is that they don't have the nice edges we wish they did. As Eleanor Rosch and George Lakoff point out, the way we typically create categories is based on what they call "prototypes" (unfortunately, this has almost nothing to do with what application developers call "prototypes.").^[18]

People create categories (Rosch and Lakoff dealt extensively with what they call "folk categories"—things that almost everyone categorizes, such as birds and dogs), and each category has some members that are exemplars of the category and others that are fringe members. For example, most people have robins and sparrows as exemplars (prototypes) of their "bird" category. They accept that penguins and ostriches are members, but they do not share all the ascribed properties that have accreted to their "bird" category (fly, eat worms and insects, make nests, etc.). They deal with these fringe members using a sort of "fuzzy logic" in which members on the fringe either do not have certain properties or have unknown properties (i.e., the less central a member is, the less certain we are about the properties we ascribe to it).

One area where practitioners and systems developers have made some interesting progress is medical diagnostics.

How We Use Rule In/Rule Out to Adjust Categories

A technique that is popular with diagnostic physicians is the concept of differential diagnosis, or rule in/rule out. For example, a patient may show a certain set of symptoms that may suggest a disease assignment, such as gestational diabetes. (Disease, by the way, is a crude medical categorization of the patient's health status.) As they refine a diagnosis, doctors look for tests that would "rule in" (confirm) or "rule out" (disqualify) the diagnosis. For example, chest pain, shortness of breath, and pain in the left arm may indicate heart attack; a positive thallium stress test, however, rules out heart attack as a cause of the symptoms. As Figure 4.8 indicates, there may be contraindicating symptoms to severe acute respiratory syndrome (SARS).

click to expand
Figure 4.8: Diagnosis can be perplexing. (Original artwork created by Brian Loner, Fort Collins, CO, 2003.)

I believe this is the general case, and not the special case. We see it on the business side of health care when certain combinations of diagnosis and procedure are ruled out by the insurance companies. They are, in effect, telling doctors that although they claim to have performed a qualifying operation, it was contraindicated (ruled out) by diagnosis or other tests. I am advocating that all categorizations have this characteristic. In the short term it may still be human-only interpretation of these rules, but eventually we expect that systems will be able to look through a patient's data, rule in or rule out certain categorizations, and suggest tests that would more precisely subcategorize them.

^[18]Eleanor Rosch and B. B. Lloyd, "Principles of Categorization," Cognition and Categorization, Hillsdale, NJ: Lawrence Erlbaum, 1976; George Lakoff, Women, Fire and Dangerous Things, New York: Basic Books, 1990.