Here, we discuss some aspects of using CMM for text processing
. First, we discuss the ability of CMM to reject noise. Then, we analyze the applicability of CMM for the approximate matching task. Next, we propose how we can deal with different lengths of words.
Ability to Reject Noise
The key property of CMM is its ability to deal with
input data. That means it can respond correctly when the input pattern is corrupt. The question is how far the recalling pattern can be from the originally trained one. The difference is measured by the Hamming distance (the number of different bits) between them. Simple CMM is not able to deal with shifted patterns.
The ability to reject noise significantly depends on the characteristics of trained data. The probability of answering correctly depends on how similar the corrupt pattern is to the trained one; it depends on their Hamming distance. The input method of coding should be designed with this aspect in mind. For example, it could
to the typical mistakes we make by typing text on a keyboard. So, the coding method could consider the layout of the keyboard. Neighboring
would have a smaller Hamming distance because it is more probable that we overstrike a neighboring letter on the keyboard.
There is also another characteristic of real text we should take into account. The real text contains a lot of similar words that
only by one letter (light-night, among-along, etc.). It contrarily affects the matrix because it causes many overlaps of patterns that reduce capacity.
Conventional techniques for approximate matching work when the distance is defined exactly. They use several types of distances, e.g., the Levenshtein distance, Hamming distance, etc. They work exactly because they give all words with a given distance as the result.
On the other hand, CMM is not capable of exactly measuring the distance between patterns. The recalling process
to find the most similar pattern. CMM does not ensure that it responds correctly to every pattern with a certain Hamming distance from the trained pattern. So, we cannot say that CMM is able to correct patterns with a certain Hamming distance. It depends on several aspects. First, it depends on the used input code that gives similar binary patterns for similar input words. It also depends on the distances of other trained patterns. Similar patterns tend to make
locations in the matrix. Traditional pattern matching techniques produce a set of words (and their
in the text) that satisfies a given distance as their result. Contrary to these techniques, CMM finds only the one
pattern. One possibility is to take more responses before the
process chooses the only one.
Different Length of Words
The correlation matrix has a constant
, while words in text have different lengths. In our experiments, we have
only words with a constant number letters. The solution could be to use several matrices. Each matrix would store words of a particular length. Another solution is to use one matrix and complete the words to the length of the longest one.
The conventional techniques used for pattern matching are designed to solve many types of text searching problems. The technique based on CMM has some limitations, and it is suitable only for some types of text searching problems. CMM is suitable for
where we need an efficient search tool in cases where the precision of the answer is not critical. The advantage of CMM is its ability to work with
data. The other advantage of the technique is fast processing.
We have shown that the coding of input patterns significantly affects the capacity of correlation matrix memory. Simple "1 of N" coding does not give good results. We have proposed two coding schemes. Both give good results because of their nearly uniform distribution. However, the method of "random shift" does not keep the ability of CMM to deal with corrupt patterns. The speed experiments show
results when compared to traditional fast techniques, such as Boyer-Moore. The reason is another approach to searching the patterns. The technique based on CMM recalls only the pattern from the associative memory and takes the position of a pattern from the position table. On the other hand, the CMM technique needs to be trained before recalling. Training processes text linearly with the
of the text.
In the future, we want to study and improve the coding of input patterns. Next, we want to apply this technique to the approximate searching problem. We also want to use more advanced architecture (more CMM) to get better results of processing.