CS 584/684 S2014: String Matching

String Matching

Bart Massey

String Matching Problem

Given a "pattern string" s of length m, and a much longer "target string" t of length n, find occurrences of s in t.
Obvious applications in text searching; also, DNA analysis, fault analysis, etc
Since n might be as large as 10^12 and m might be as large as 10^4 in some problems, efficient algorithms are important

Topics

Naïve Algorithm
Rabin-Karp
DFA/KMP
Boyer-Moore

Review: Naïve Algorithm

We've already talked about this in class:

 for i <- 0 .. n - m - 1  
     for j <- 0 .. m - 1  
         if t[i + j] != s[j]  
             break  
     if j >= m  
         return i  
 fail

About as good as you can do if m = 1
O(mn) in general, so if m is large ...

Rabin-Karp "Rolling Hash" Algorithm

Idea: For each length-m substring t' of t, compute hash(t'). If hash(t') = hash(s), check to see if t' = *s

 h0 <- hash(s)  
 for i <- 0 .. n - m - 1  
     h <- hash(t[i]..t[i + m - 1])  
     if (h0 != h)  
         continue  
     for j <- 0 .. m - 1  
         if t[i + j] != s[j]  
             break  
     if j >= m  
         return i  
 fail

Stupid idea on the face of it, since same complexity is the same and constants are worse
Saving grace: Use a "rolling hash", that is, a hash that can be adjusted from the previous hash in O(1) time
- Now algorithm is O(n) in average case, and worst case O(mn) only occurs if the hash function is terrible

Rabin-Karp Rolling Hash Function

Note that for prime q
- (a + b) mod q = (a mod q + b mod q) mod q
- Easy corollary: (a * b) mod q = (a mod q * b mod q) mod q
- Yields fast algorithm (O(lg b)) for a^b mod q
The hash
- The dq-hash of a substring t[i]..t[i+m-1] is
  
  h = (d^(m-1)t[i] + d^(m-2)t[i+1] + .. + d^0t[i+m-1]) mod q
- To roll the hash forward, take
  
  h' = ((h - d^(m-1)t[i])d + t[i + m]) mod q*
- It is easy(?) to see that
  
  h' = dq-hash(t[i+1]..t[i+m])

Rabin-Karp Matcher

Now we can build our matcher

 hp <- dq-hash(s)  
 h <- dq-hash(t[0]..t[m-1])  
 for i <- m .. n - 1  
     if h = hp  
         for j <- 0 .. m - 1  
             if t[i + j - m] != s[j]  
                 break  
         if j >= m  
             return i  
     h <- roll(h, t[i-m], t[i])  
 fail

DFA Matcher

Imagine that we are trying to match s = "abc". Then the following pseudocode is sufficient

 j <- 0  
 for i <- 0 .. n - 1  
     if t[i] = s[j]  
         j <- j + 1  
     else  
         j <- 0  
     if j >= m  
         return i - m

Easy to see this code is O(n)
Catch: Doesn't work for matching s = "aab" against t = "aaab"
Problem: Self-similar pattern strings

Fixing the DFA Matcher

Funny little procedure from the book: identify what prefix of the match is still valid, add appropriate back edges

 for q <- 0 .. m  
     for each a in the alphabet  
         k <- min(m - 1, q + 2)  
         repeat  
             k = k - 1  
         until s[0]..s[q] . a has a prefix of s[0]..s[k]  
         delta(q, a) <- k  
 return delta

This is naive O(m^3), can get O(m) by memoization

 j <- 0  
 delta <- compute-delta(s)  
 for i <- 0 .. n - 1  
     j <- delta(j, t[i])  
     if j >= m  
         return i - m

Knuth-Morris-Pratt: Boring, since roughly a special case of DFA matcher

Boyer-Moore

Idea: Maybe we can avoid looking at every character in t!
Example: Matching s = "abcde" against t = "abcdf"
- The matchers we have seen so far search left-to-right, and fail only when hitting "f"
- If we match right-to-left, we will immmediately discover the mismatch
- Can now move forward to after "f" without looking at "abcd"!
- Gives O(n/m) matcher in this case
Problem: Self-similarity again. Match s = "aba" against t = "baba"
Solution is similar to DFA/KMP

Last modified: Tuesday, 29 April 2014, 9:54 AM