Recipe 5.13. Finding SubsequencesCredit: David Eppstein, Alexander Semenov ProblemYou need to find occurrences of a subsequence in a larger sequence. SolutionIf the sequences are strings (plain or Unicode), Python strings' find method and the standard library's re module are the best approach. Otherwise, use the Knuth-Morris-Pratt algorithm (KMP): def KnuthMorrisPratt(text, pattern): ''' Yields all starting positions of copies of subsequence 'pattern' in sequence 'text' -- each argument can be any iterable. At the time of each yield, 'text' has been read exactly up to and including the match with 'pattern' that is causing the yield. ''' # ensure we can index into pattern, and also make a copy to protect # against changes to 'pattern' while we're suspended by `yield' pattern = list(pattern) length = len(pattern) # build the KMP "table of shift amounts" and name it 'shifts' shifts = [1] * (length + 1) shift = 1 for pos, pat in enumerate(pattern): while shift <= pos and pat != pattern[pos-shift]: shift += shifts[pos-shift] shifts[pos+1] = shift # perform the actual search startPos = 0 matchLen = 0 for c in text: while matchLen == length or matchLen >= 0 and pattern[matchLen] != c: startPos += shifts[matchLen] matchLen -= shifts[matchLen] matchLen += 1 if matchLen == length: yield startPos DiscussionThis recipe implements the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage that builds a table of shift amounts and takes time that's directly proportional to the length of the pattern, each text symbol is processed in constant amortized time. Explanations and demonstrations of how KMP works can be found in all good elementary texts about algorithms. (A recommendation is provided in See Also.) If text and pattern are both Python strings, you can get a faster solution by suitably applying Python built-in search methods: def finditer(text, pattern): pos = -1 while True: pos = text.find(pattern, pos+1) if pos < 0: break yield pos For example, using an alphabet of length 4 ('ACGU' . . .), finding all occurrences of a pattern of length 8 in a text of length 100000, on my machine, takes about 4.3 milliseconds with finditer, but the same task takes about 540 milliseconds with KnuthMorrisPratt (that's with Python 2.3; KMP is faster with Python 2.4, taking about 480 milliseconds, but that's still over 100 times slower than finditer). So remember: this recipe is useful for searches on generic sequences, including ones that you cannot keep in memory all at once, but if you're searching on strings, Python's built-in searching methods rule. See AlsoMany excellent books cover the fundamentals of algorithms; among such books, a widely admired one is Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, 2d ed. (MIT Press). |