Relational Databases versus Digital Soup | About Face 2.0(c) The Essentials of Interaction Design

Software that uses database technology makes two simple demands of its users: First, the user must define the form of the data in advance; second, the user must then conform to that definition. There are also two facts about human users of software: First, they never know what they are going to want in advance; and second, even if they did, more often than not they change their minds.

Organizing the unorganizable

Now that we live in the Internet age, we find ourselves more and more frequently confronting information systems that fail the relational database litmus: We can neither define information in advance, nor can we reliably stick to any definition we might conjure up. In particular, two phenomena, both experiencing exponential growth, exemplify this dilemma.

The first phenomenon is electronic mail. Whereas a record in a database has a specific identity, and thus belongs in a table of objects of the same type, an e-mail message doesn't fit this paradigm very well. We can divide our e-mail into incoming and outgoing, but that doesn't help us much. For example, if you receive a piece of e-mail from Jerry about Sally, regarding the Ajax Project, and how it relates to Jones Consulting and your joint presentation at the board meeting. You can file this away in the "Jerry" folder, or the "Sally" folder, or the "Ajax" folder, but what you really want to do is to file it in all of them. In six months, you might try to find this message for any number of unpredictable reasons, and you'll want to be able to find it, regardless of your reason.

The second phenomenon is the Web. Like an infinite, chaotic, redundant, unsupervised hard disk, the Web defies being structured. Enormous quantities of information are available on the Internet, but its sheer quantity and heterogeneity guarantee that no regular system could ever be imposed on it (we'll see where the Semantic Web initiatives engaged in by the W3C take us; perhaps there is hope). Even if the Web could be organized, the method would likely have to exist on the outside, because its contents are owned by millions of individuals, none of whom are subject to any authority. Unlike records in a database, we cannot expect to find a predictable identifying mark in a record on the Internet.

Problems with databases

There's a further problem with databases: All database records are of a single, predefined type, and all instances of a record type are grouped together. A record may represent an invoice or a customer, but it never represents an invoice and a customer. Similarly, a field within a record may be a name or a social security number, but it is never a name and a social security number. This is the fundamental concept underlying all databases—it serves the vital purpose of allowing us to impose order on our storage system. Unfortunately, it fails miserably to address the realities of retrieval for our e-mail problem: It is not enough that the e-mail from Jerry is a record of type "e-mail". Somehow, we must also identify it as a record of type "Jerry", type "Sally", type "Ajax", type "Jones Consulting", and type "Board Meeting". We must also be able to add and change its identity at will, even after the record has been stored away. What's more, a record of type "Ajax" may refer to documents other than e-mail messages—a project plan for example. Because the record format is unpredictable, the value that identifies the record as pertaining to Ajax cannot be stored reliably within the record itself. This is in direct contradiction to the way databases work.

Databases do provide us with retrieval tools with a bit more flexibility than matching simple record types. They allow us to find and fetch a record by examining its contents and matching them against search criteria. For example, we search for invoice number "77329" or for the customer with the identifying string "Goodyear Tire and Rubber". Yet, this still fails for our e-mail problem. If we allow the user to enter the keywords "Jerry", "Sally", "Ajax", "Jones Consulting", and "Board Meeting" into the message record, we must define such fields in advance. But as we've said, defining things in advance doesn't guarantee that the user will follow that definition later. He may now be looking for messages about the company picnic, for example. Besides, adding a series of keyword fields leads you into one of the must fundamental and universal conundrums of data processing: If you give users ten fields, someone is bound to want eleven.

The attribute-based alternative

So, if relational database technology isn't right, what is? If users find it hard to define their information in advance as databases require, is there an alternative storage and retrieval system that might work well for them?

Once again, the key is separating the storage and retrieval system. If an index were used as the retrieval system, the storage technique could still remain a database. We can imagine the storage facility as a sort of digital soup where we could put our records. This soup would accept any record we dumped into it, regardless of its size, length, type, or contents. Whenever a record was entered, the program would return a token that could be used to retrieve the record. All we have to do is give it back that token, and the soup instantly returns our record. This is just our storage system, however; we still need a retrieval system that manages all those tokens for us.

Attribute-based retrieval thus comes to our rescue: We can create an index that stores a key value along with a copy of the token. The real magic, though, is that we can create an infinite number of indices, each one representing its own key and containing a copy of the token. For example, if our digital soup contained all our e-mail messages, we could establish an index for each of our old friends, "Jerry", "Sally", "Ajax", "Jones Consulting", and "Board Meeting". Now, when we need to find e-mail pertinent to the board meeting, we don't have to paw manually and tediously through dozens of folders. Instead, a single query brings us everything we are looking for.

Of course, someone or something must fill those indices, but that is a more mundane exercise in interaction design. There are two components to consider. First, the system needs to be able to read e-mail messages and automatically extract and index information like proper names, Internet addresses, street addresses, phone numbers, and other significant data. Second, the system must make it very easy for a user to add ad hoc pointers to messages. He should be able to explicitly specify that a given e-mail message pertains to a specific value, whether or not that value is quoted verbatim in the message. Typing is okay, but selecting from pick-lists, clicking-and-dragging, and other more advanced user interface idioms can make the task almost painless.

Significant advantages arise from a world where the storage system is reduced in importance and the retrieval system is separated from it and significantly enhanced. Some form of digital soup will help us to get control of the unpredictable information that is beginning to make up more and more of our everyday information universe. We can offer users powerful information management tools without demanding that they configure their information in advance or that they conform to that configuration in the future. After all, they can't do it. So why insist?