3.10 Link Analysis Limitations

3.10 Link Analysis Limitations

Link analysis is a very labor-intensive method of data mining. In investigations involving a high volume of transactions, such as those in money laundering, link analysis requires an extensive amount of data preparation. Even then, the results of link analysis are often quite limited when compared to other more powerful methods of data mining designed to discover the needles hidden in the haystacks.

Link analysis works best in situations where there is a limited number of observations, such as events (meetings) and entities (suspects). Its functionality begins to deteriorate once a large number of observations or transactions begins to populate a case file. Keep in mind that the key functionality of this technology is to organize the data in the form of a graph or chart. Link analysis is primarily a visualization technology, as are the tools covered in this chapter. This is not to say that these tools are not valuable and essential to investigators and analysts in resolving open criminal cases and identifying potential threats from dangerous entities.

Link analysis is distinct from other data mining technologies and tools that construct models via neural networks or extract association rules from databases via decision trees employing statistical and machine-learning algorithms. These data mining technologies discover and represent associations based on the aggregate statistical characteristics of a sample of instances drawn from large databases. For example, given the millions of vehicles that enter the United States via its various points of entry along the Southwest border, which makes and types of vehicles are most likely to be used to smuggle contraband? This is a question that a neural network or a machine-learning—based tool is ideally suited to solve, but one that a link analysis tool would be hard pressed to solve.

As mentioned earlier, one of the users of link analysis in detecting crime (money laundering) is the U.S. Treasury Department's Financial Crimes Enforcement Network. FinCEN analysts do not construct profiles of money laundering using conventional data mining tools. Instead, they use a variety of unconventional and largely manual methods to form profiles of money laundering. They use a form of iterative concept refinement—formulating initial profiles, querying FinCEN databases, evaluating the results using their own domain knowledge, and then refining the profile and iterating again. They use historical cases as prototypes, generalizing incidental aspects and finding similar cases. They also devise and test hypothetical schemes, based on domain knowledge and their own conjectures about likely methods for laundering money. Though profile generation is largely manual, two technical approaches do greatly aid analysts' reasoning: data restructuring techniques and link analysis.

FinCEN's data restructuring focuses on three relatively simple operations on local databases: disambiguation identifies when a single token (e.g., Bill Smith) refers to two or more individuals; consolidation identifies when several tokens (e.g., Bill Smith, Bill J. Smyth, and William Smyth, Jr.) refer to the same individual and builds a record of that individual; and aggregation provides useful summaries of transaction-level data.

Disambiguation, consolidation, and aggregation are simple in principle, but they are more difficult in practice. Disambiguation must make use of multiple identifiers (e.g., name, address, phone, account number) that are often misentered at the Detroit IRS Center or change over time. Disambiguation also makes considerable use of low-level domain knowledge about national and cultural conventions of name ordering and spelling. Consolidation is also surprisingly difficult given that different overlapping consolidations are useful for different analytical purposes. Similarly, useful types of aggregation depend strongly on the inferences they are meant to support.

FinCEN's use of consolidation and aggregation mirror related techniques used elsewhere for fraud detection. Consolidation can be viewed as a form of clustering, a common technique for understanding data in the absence of class labels indicating the correct inference (e.g., moneylaundering or notmoney-laundering). Similarly, aggregation is used in other fraud detection applications to build a profile of a "normal" use (e.g., for a credit card or a cell phone account). Deviations from this profile can then be used to indicate fraud. As we shall see, similar techniques exist for intrusion detection systems (IDSs) for identifying hackers.

As we have seen, the analysis of a network of associations is a visualization technique to reveal structure in sets of related records. Linkage data are typically modeled as a graph with nodes representing entities of interest in the domain and links representing relationships or transactions. As shown, these examples might be a collection of cash transactions to and from bank accounts, a collection of telephone toll data (e.g., numbers, times, and duration) subpoenaed for a criminal investigation, or a collection of sightings of individuals' meetings, their addresses, and other related commercial or social interactions.

Links, as well as nodes, may have attributes specific to the domain or relevant to the method of collection. For example, link attributes might indicate the certainty or strength of a relationship, the dollar value of a transaction, or the probability of a connection. Some linkage data may be simple but voluminous (e.g., telephone calls), with a uniformity of node and link types and a great deal of regularity. Other data may be extremely rich and varied, though sparse (e.g., law enforcement data), with elements possessing many domain-specific attributes, as well as confidence and value, which may change over time.

FinCEN analysts search for patterns of financial transactions and other events and facts that are indicative of money laundering. They sift through millions of reports by banks, tips from law enforcement agencies, and records of cooperating federal agencies to discover patterns that reveal illegal activity. FinCEN receives and processes over 10 million currency transaction reports (CTRs) and thousands of suspicious activity reports (SARs) submitted by banks and other financial institutions. FinCEN can also access dozens of remote, structured databases of postal, business, and travel records, as well as many remote, unstructured databases of news stories and other textual data. However, at present, inducing profiles from link charts is largely a manual process, although methods such as inductive logic programming and other techniques could be used to draw inductive inferences from these types of data.

As we shall see in subsequent chapters, link analysis is distinct from other data mining techniques that construct predictive models from neural networks and rules or decision trees from machine-learning algorithms. These techniques use networks as a model representation and discover associations based on the aggregate statistical characteristics of a sample of uniform instances drawn from some population. In contrast, link analysis uses networks as a data representation and infers useful knowledge based on the relations present in a network of heterogeneous records.

The main drawback to link analysis is that the aggregate number of data records that can be presented in most diagrams is limited. The human eye can only see so much, even for very experienced investigators and analysts. However, a number of AI technologies have the potential to assist investigators in constructing complex networks of voluminous entities and events, as we shall see in the following chapters. These techniques involve the use of intelligence agents for information retrieval, text mining software for sorting and organizing content from thousands of documents, neural networks for pattern recognition, and machine-learning algorithms for constructing profiles and extracting rules from large databases, all technologies covered in the following chapters.