DISCUSSION | (ed.) Intelligent Agents for Data Mining and Information Retrieval

Here, we discuss some aspects of using CMM for text processing tasks . First, we discuss the ability of CMM to reject noise. Then, we analyze the applicability of CMM for the approximate matching task. Next, we propose how we can deal with different lengths of words.

Ability to Reject Noise

The key property of CMM is its ability to deal with noisy input data. That means it can respond correctly when the input pattern is corrupt. The question is how far the recalling pattern can be from the originally trained one. The difference is measured by the Hamming distance (the number of different bits) between them. Simple CMM is not able to deal with shifted patterns.

The ability to reject noise significantly depends on the characteristics of trained data. The probability of answering correctly depends on how similar the corrupt pattern is to the trained one; it depends on their Hamming distance. The input method of coding should be designed with this aspect in mind. For example, it could correspond to the typical mistakes we make by typing text on a keyboard. So, the coding method could consider the layout of the keyboard. Neighboring letters would have a smaller Hamming distance because it is more probable that we overstrike a neighboring letter on the keyboard.

There is also another characteristic of real text we should take into account. The real text contains a lot of similar words that differ only by one letter (light-night, among-along, etc.). It contrarily affects the matrix because it causes many overlaps of patterns that reduce capacity.

Approximate Searching

Conventional techniques for approximate matching work when the distance is defined exactly. They use several types of distances, e.g., the Levenshtein distance, Hamming distance, etc. They work exactly because they give all words with a given distance as the result.

On the other hand, CMM is not capable of exactly measuring the distance between patterns. The recalling process tries to find the most similar pattern. CMM does not ensure that it responds correctly to every pattern with a certain Hamming distance from the trained pattern. So, we cannot say that CMM is able to correct patterns with a certain Hamming distance. It depends on several aspects. First, it depends on the used input code that gives similar binary patterns for similar input words. It also depends on the distances of other trained patterns. Similar patterns tend to make saturated locations in the matrix. Traditional pattern matching techniques produce a set of words (and their positions in the text) that satisfies a given distance as their result. Contrary to these techniques, CMM finds only the one nearest pattern. One possibility is to take more responses before the thresholding process chooses the only one.

Different Length of Words

The correlation matrix has a constant size , while words in text have different lengths. In our experiments, we have tested only words with a constant number letters. The solution could be to use several matrices. Each matrix would store words of a particular length. Another solution is to use one matrix and complete the words to the length of the longest one.