## String Matching

# String Matching

**Bart Massey**

### String Matching Problem

Given a "pattern string"

*s*of length*m*, and a much longer "target string"*t*of length*n*, find occurrences of*s*in*t*.Obvious applications in text searching; also, DNA analysis, fault analysis, etc

Since

*n*might be as large as 10^12 and*m*might be as large as 10^4 in some problems, efficient algorithms are important

### Topics

Naïve Algorithm

Rabin-Karp

DFA/KMP

Boyer-Moore

### Review: Naïve Algorithm

We've already talked about this in class:

`for i <- 0 .. n - m - 1 for j <- 0 .. m - 1 if t[i + j] != s[j] break if j >= m return i fail`

About as good as you can do if m = 1

O(mn) in general, so if m is large ...

### Rabin-Karp "Rolling Hash" Algorithm

Idea: For each length-m substring

*t'*of*t*, compute*hash(t')*. If*hash(t') = hash(s)*, check to see if*t' = *s*`h0 <- hash(s) for i <- 0 .. n - m - 1 h <- hash(t[i]..t[i + m - 1]) if (h0 != h) continue for j <- 0 .. m - 1 if t[i + j] != s[j] break if j >= m return i fail`

Stupid idea on the face of it, since same complexity is the same and constants are worse

Saving grace: Use a "rolling hash", that is, a hash that can be adjusted from the previous hash in

*O(1)*time- Now algorithm is
*O(n)*in average case, and worst case*O(mn)*only occurs if the hash function is*terrible*

- Now algorithm is

### Rabin-Karp Rolling Hash Function

Note that for prime

*q**(a + b) mod q = (a mod q + b mod q) mod q*- Easy corollary:
*(a * b) mod q = (a mod q * b mod q) mod q* - Yields fast algorithm
*(O(lg b))*for*a^b mod q*

The hash

The

*dq*-hash of a substring*t[i]..t[i+m-1]*is*h = (d^(m-1)t[i] + d^(m-2)t[i+1] + .. + d^0t[i+m-1]) mod q*To roll the hash forward, take

*h' = ((h - d^(m-1)t[i])*d + t[i + m]) mod q*It is easy(?) to see that

*h' = dq-hash(t[i+1]..t[i+m])*

### Rabin-Karp Matcher

Now we can build our matcher

`hp <- dq-hash(s) h <- dq-hash(t[0]..t[m-1]) for i <- m .. n - 1 if h = hp for j <- 0 .. m - 1 if t[i + j - m] != s[j] break if j >= m return i h <- roll(h, t[i-m], t[i]) fail`

### DFA Matcher

Imagine that we are trying to match s = "abc". Then the following pseudocode is sufficient

`j <- 0 for i <- 0 .. n - 1 if t[i] = s[j] j <- j + 1 else j <- 0 if j >= m return i - m`

Easy to see this code is O(n)

Catch: Doesn't work for matching

*s = "aab"*against*t = "aaab"*Problem: Self-similar pattern strings

### Fixing the DFA Matcher

Funny little procedure from the book: identify what prefix of the match is still valid, add appropriate back edges

`for q <- 0 .. m for each a in the alphabet k <- min(m - 1, q + 2) repeat k = k - 1 until s[0]..s[q] . a has a prefix of s[0]..s[k] delta(q, a) <- k return delta`

This is naive

*O(m^3)*, can get*O(m)*by memoization`j <- 0 delta <- compute-delta(s) for i <- 0 .. n - 1 j <- delta(j, t[i]) if j >= m return i - m`

Knuth-Morris-Pratt: Boring, since roughly a special case of DFA matcher

### Boyer-Moore

Idea: Maybe we can avoid looking at every character in t!

Example: Matching

*s = "abcde"*against*t = "abcdf"*- The matchers we have seen so far search left-to-right, and fail only when hitting "f"
- If we match
*right-to-left*, we will immmediately discover the mismatch - Can now move forward to after "f"
*without looking at*"abcd"! - Gives O(n/m) matcher in this case

Problem: Self-similarity again. Match

*s = "aba"*against*t = "baba"*Solution is similar to DFA/KMP