Recipe5.13.Finding Subsequences

Recipe 5.13. Finding Subsequences

Credit: David Eppstein, Alexander Semenov

Problem

You need to find occurrences of a subsequence in a larger sequence.

Solution

If the sequences are strings (plain or Unicode), Python strings' find method and the standard library's re module are the best approach. Otherwise, use the Knuth-Morris-Pratt algorithm (KMP):

def KnuthMorrisPratt(text, pattern):     ''' Yields all starting positions of copies of subsequence 'pattern'         in sequence 'text' -- each argument can be any iterable.         At the time of each yield, 'text' has been read exactly up to and         including the match with 'pattern' that is causing the yield. '''     # ensure we can index into pattern, and also make a copy to protect     # against changes to 'pattern' while we're suspended by `yield'     pattern = list(pattern)     length = len(pattern)     # build the KMP "table of shift amounts" and name it 'shifts'     shifts = [1] * (length + 1)     shift = 1     for pos, pat in enumerate(pattern):         while shift <= pos and pat != pattern[pos-shift]:             shift += shifts[pos-shift]         shifts[pos+1] = shift     # perform the actual search     startPos = 0     matchLen = 0     for c in text:         while matchLen == length or matchLen >= 0 and pattern[matchLen] != c:             startPos += shifts[matchLen]             matchLen -= shifts[matchLen]         matchLen += 1         if matchLen == length: yield startPos

Discussion

This recipe implements the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage that builds a table of shift amounts and takes time that's directly proportional to the length of the pattern, each text symbol is processed in constant amortized time. Explanations and demonstrations of how KMP works can be found in all good elementary texts about algorithms. (A recommendation is provided in See Also.)

If text and pattern are both Python strings, you can get a faster solution by suitably applying Python built-in search methods:

def finditer(text, pattern):     pos = -1     while True:         pos = text.find(pattern, pos+1)         if pos < 0: break         yield pos

For example, using an alphabet of length 4 ('ACGU' . . .), finding all occurrences of a pattern of length 8 in a text of length 100000, on my machine, takes about 4.3 milliseconds with finditer, but the same task takes about 540 milliseconds with KnuthMorrisPratt (that's with Python 2.3; KMP is faster with Python 2.4, taking about 480 milliseconds, but that's still over 100 times slower than finditer). So remember: this recipe is useful for searches on generic sequences, including ones that you cannot keep in memory all at once, but if you're searching on strings, Python's built-in searching methods rule.