String Matching

String Matching

Bart Massey


String Matching Problem

  • Given a "pattern string" s of length m, and a much longer "target string" t of length n, find occurrences of s in t.

  • Obvious applications in text searching; also, DNA analysis, fault analysis, etc

  • Since n might be as large as 10^12 and m might be as large as 10^4 in some problems, efficient algorithms are important

Topics

  • Naïve Algorithm

  • Rabin-Karp

  • DFA/KMP

  • Boyer-Moore

Review: Naïve Algorithm

  • We've already talked about this in class:

     for i <- 0 .. n - m - 1  
         for j <- 0 .. m - 1  
             if t[i + j] != s[j]  
                 break  
         if j >= m  
             return i  
     fail
    
  • About as good as you can do if m = 1

  • O(mn) in general, so if m is large ...

Rabin-Karp "Rolling Hash" Algorithm

  • Idea: For each length-m substring t' of t, compute hash(t'). If hash(t') = hash(s), check to see if t' = *s

     h0 <- hash(s)  
     for i <- 0 .. n - m - 1  
         h <- hash(t[i]..t[i + m - 1])  
         if (h0 != h)  
             continue  
         for j <- 0 .. m - 1  
             if t[i + j] != s[j]  
                 break  
         if j >= m  
             return i  
     fail
    
  • Stupid idea on the face of it, since same complexity is the same and constants are worse

  • Saving grace: Use a "rolling hash", that is, a hash that can be adjusted from the previous hash in O(1) time

    • Now algorithm is O(n) in average case, and worst case O(mn) only occurs if the hash function is terrible

Rabin-Karp Rolling Hash Function

  • Note that for prime q

    • (a + b) mod q = (a mod q + b mod q) mod q
    • Easy corollary: (a * b) mod q = (a mod q * b mod q) mod q
    • Yields fast algorithm (O(lg b)) for a^b mod q
  • The hash

    • The dq-hash of a substring t[i]..t[i+m-1] is

      h = (d^(m-1)t[i] + d^(m-2)t[i+1] + .. + d^0t[i+m-1]) mod q

    • To roll the hash forward, take

      h' = ((h - d^(m-1)t[i])d + t[i + m]) mod q*

    • It is easy(?) to see that

      h' = dq-hash(t[i+1]..t[i+m])

Rabin-Karp Matcher

  • Now we can build our matcher

     hp <- dq-hash(s)  
     h <- dq-hash(t[0]..t[m-1])  
     for i <- m .. n - 1  
         if h = hp  
             for j <- 0 .. m - 1  
                 if t[i + j - m] != s[j]  
                     break  
             if j >= m  
                 return i  
         h <- roll(h, t[i-m], t[i])  
     fail
    

DFA Matcher

  • Imagine that we are trying to match s = "abc". Then the following pseudocode is sufficient

     j <- 0  
     for i <- 0 .. n - 1  
         if t[i] = s[j]  
             j <- j + 1  
         else  
             j <- 0  
         if j >= m  
             return i - m
    
  • Easy to see this code is O(n)

  • Catch: Doesn't work for matching s = "aab" against t = "aaab"

  • Problem: Self-similar pattern strings

Fixing the DFA Matcher

  • Funny little procedure from the book: identify what prefix of the match is still valid, add appropriate back edges

     for q <- 0 .. m  
         for each a in the alphabet  
             k <- min(m - 1, q + 2)  
             repeat  
                 k = k - 1  
             until s[0]..s[q] . a has a prefix of s[0]..s[k]  
             delta(q, a) <- k  
     return delta
    
  • This is naive O(m^3), can get O(m) by memoization

     j <- 0  
     delta <- compute-delta(s)  
     for i <- 0 .. n - 1  
         j <- delta(j, t[i])  
         if j >= m  
             return i - m
    
  • Knuth-Morris-Pratt: Boring, since roughly a special case of DFA matcher

Boyer-Moore

  • Idea: Maybe we can avoid looking at every character in t!

  • Example: Matching s = "abcde" against t = "abcdf"

    • The matchers we have seen so far search left-to-right, and fail only when hitting "f"
    • If we match right-to-left, we will immmediately discover the mismatch
    • Can now move forward to after "f" without looking at "abcd"!
    • Gives O(n/m) matcher in this case
  • Problem: Self-similarity again. Match s = "aba" against t = "baba"

  • Solution is similar to DFA/KMP

Last modified: Tuesday, 29 April 2014, 9:54 AM