Archaeology - Part 1: Data Mining


Archaeology—Part 1: Data Mining

Talking to people shouldn't be your only means of accessing semantic information. The existing systems often contain a wealth of data that the users are not aware of or forget to bring up.

The real trick with data (and system) exploration is knowing how much to do before interviews, how much concurrently, and how much later (as well as how much to do at all).

Data Mining and Profiling

Two sources of data are of interest here:

  • Official data—Official data is the data that is in the database. It has passed all the official validation and editing and represents the book of record as far as the company is concerned about what happened.

  • Unofficial data—Unofficial data is all the data that didn't make it into the system, or is in unstructured formats such as memos. A great source of this information is notes annotated on the margin in forms or on reports. This is one of the reasons that systems designers like to work with forms that have been used. You will often need to interview the user to find out what the annotations mean.

For the official data, data profiling and data mining software exist, and if you have the luxury, it is handy for finding data anomalies and patterns. Some of the major vendors offering data profiling software are Evoke,[59] Ascential,[60] and Avellino.[61]

If you don't have data mining or data profiling, you can still do a lot on an ad hoc basis. Here are some strategies:

  • Review existing reports, especially if you can find any that are exhaustive. The closer you can get to a file dump the better. You are looking for the anomalies, which are often excluded from some of the more widely distributed reports.

  • Use the client's report writer. Many systems will have a report writer, which you can use to extract this data. One of the advantages is that while it often takes a lot of effort to get a nice-looking report from a report writer, it is usually simple to get tabular output.

  • Use the client's query system. You may need to rely on their query system, which might include a structured query language (SQL) input if you're lucky, and may have you reviewing pages of data on a screen. Still, you'd be amazed at the patterns you will see just from paging through almost at random.

  • See if they have an ODBC attachment (ODBC is a widely used abstraction popularized by Microsoft that allows many general purpose tools access to a wide variety of relational data bases). Many systems now allow ODBC access to their databases. (From a security and integrity standpoint it is not a good idea to have this sort of "back door" on your system, but if the capability exists you might as well use it.)

What you do is scan for certain cases. If you have managed to download data into Excel or Access you can sort each column, which will make searching for the interesting cases much easier. You are looking for the following:

  • Values that might indicate subtypes—If you scan through a file that has an order-type code field, you might see an "N" in the column followed by a "W." In most cases these indicators will not be self-explanatory, although you may be able to determine the meaning from the context. In this case the "W" meant it was a warranty order (the rest being "normal"). Often blank will be normal.

  • Values that indicate temporal status—This requires the same sort of review, although you'd be looking for values that typically change over the life of the order. You might want to list these next to some key dates, such as order close date, to see if the status correlates with the order close date. Once you find a correlation, you immediately want to look for examples that violate the correlation, which you will want to quiz people on. "I noticed that order 1234 has a closed date of 12/31/2001, and yet the status was ‘P’ (pending)." This will most often bring up some unusual special case.

  • Unusual numeric values—For any measurement or dollar value, look for negative numbers or extremely large numbers ("high values" aren't used much anymore, but you may want to check just in case). Often the presence of negative numbers where you wouldn't expect them to be represents some other special case. Also look for 0 or very small values. (For example, $0.01 shows up with surprising frequency; often there are systems that will not allow $0 transactions, yet users want to enter some sort of memo transaction, and they are willing to have the cash total be off by small amounts in order to accomplish what they are trying to do.)

  • Dates—There are all sorts of odd conventions around dates. You may find many transactions on an obvious border date (e.g., 12/31/2001 or 1/1/2002). The number of them relative to 12/30 or 1/2 should tip you off that they aren't really what they appear to be. Also look for dates out of range, such as dates that should be historical but are in the future. You should look for invalid dates, not so much to bust the validation routine, which obviously isn't working, but to start a question about what this means or how the output copes with it. You also want to compare some dates; for example, it is often a good idea to compare the posted-on date/time with the effective or occurred-on date/time. If they are always the same you either have a system that is completely contemporaneous (highly unlikely) or one that isn't making this distinction. More interesting are the cases where the posted-on date is before the occurred-on date.

  • Unusual text values—You'll want to sort most text values and also search for special characters. You'll get all sorts of weird stuff here; look for patterns. We've had several clients put "*** do not use" as the address line for a vendor that they wanted to take off their approved vendor list, because integrity constraints in the system prevented them from simply deleting vendors with whom they had ever done business.

  • Referential integrity (not!)—Explore fields that should be foreign keys, but aren't always. If you have some ability to execute SQL against the database, or a copy of it, try doing the opposite of a join: Look for tables that don't join (i.e., tables that either have a blank foreign key or one that does not match). We have found cases where there appeared to be referential integrity, but there were fields that didn't match. Sometimes this was because the master file was changed and the integrity was not checked in that direction; in other cases there never was an integrity check, but the users had been mostly very good at keeping them in synch.

Each anomaly is either a bit of semantic information about how the existing system has been used or at least a conversation starter for the user groups.

[59]See http://www.evokesoftware.com/ for further information.

[60]See http://www.ascentialsoftware.com/ for further information.

[61]See http://www.avellino.com/ for further information.




Semantics in Business Systems(c) The Savvy Manager's Guide
Semantics in Business Systems: The Savvy Managers Guide (The Savvy Managers Guides)
ISBN: 1558609172
EAN: 2147483647
Year: 2005
Pages: 184
Authors: Dave McComb

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net