# String Matching

Bart Massey

### String Matching Problem

• Given a "pattern string" s of length m, and a much longer "target string" t of length n, find occurrences of s in t.

• Obvious applications in text searching; also, DNA analysis, fault analysis, etc

• Since n might be as large as 10^12 and m might be as large as 10^4 in some problems, efficient algorithms are important

### Topics

• Naïve Algorithm

• Rabin-Karp

• DFA/KMP

• Boyer-Moore

### Review: Naïve Algorithm

 for i <- 0 .. n - m - 1
for j <- 0 .. m - 1
if t[i + j] != s[j]
break
if j >= m
return i
fail

• About as good as you can do if m = 1

• O(mn) in general, so if m is large ...

### Rabin-Karp "Rolling Hash" Algorithm

• Idea: For each length-m substring t' of t, compute hash(t'). If hash(t') = hash(s), check to see if t' = *s

 h0 <- hash(s)
for i <- 0 .. n - m - 1
h <- hash(t[i]..t[i + m - 1])
if (h0 != h)
continue
for j <- 0 .. m - 1
if t[i + j] != s[j]
break
if j >= m
return i
fail

• Stupid idea on the face of it, since same complexity is the same and constants are worse

• Saving grace: Use a "rolling hash", that is, a hash that can be adjusted from the previous hash in O(1) time

• Now algorithm is O(n) in average case, and worst case O(mn) only occurs if the hash function is terrible

### Rabin-Karp Rolling Hash Function

• Note that for prime q

• (a + b) mod q = (a mod q + b mod q) mod q
• Easy corollary: (a * b) mod q = (a mod q * b mod q) mod q
• Yields fast algorithm (O(lg b)) for a^b mod q
• The hash

• The dq-hash of a substring t[i]..t[i+m-1] is

h = (d^(m-1)t[i] + d^(m-2)t[i+1] + .. + d^0t[i+m-1]) mod q

• To roll the hash forward, take

h' = ((h - d^(m-1)t[i])d + t[i + m]) mod q*

• It is easy(?) to see that

h' = dq-hash(t[i+1]..t[i+m])

### Rabin-Karp Matcher

• Now we can build our matcher

 hp <- dq-hash(s)
h <- dq-hash(t..t[m-1])
for i <- m .. n - 1
if h = hp
for j <- 0 .. m - 1
if t[i + j - m] != s[j]
break
if j >= m
return i
h <- roll(h, t[i-m], t[i])
fail


### DFA Matcher

• Imagine that we are trying to match s = "abc". Then the following pseudocode is sufficient

 j <- 0
for i <- 0 .. n - 1
if t[i] = s[j]
j <- j + 1
else
j <- 0
if j >= m
return i - m

• Easy to see this code is O(n)

• Catch: Doesn't work for matching s = "aab" against t = "aaab"

• Problem: Self-similar pattern strings

### Fixing the DFA Matcher

• Funny little procedure from the book: identify what prefix of the match is still valid, add appropriate back edges

 for q <- 0 .. m
for each a in the alphabet
k <- min(m - 1, q + 2)
repeat
k = k - 1
until s..s[q] . a has a prefix of s..s[k]
delta(q, a) <- k
return delta

• This is naive O(m^3), can get O(m) by memoization

 j <- 0
delta <- compute-delta(s)
for i <- 0 .. n - 1
j <- delta(j, t[i])
if j >= m
return i - m

• Knuth-Morris-Pratt: Boring, since roughly a special case of DFA matcher

### Boyer-Moore

• Idea: Maybe we can avoid looking at every character in t!

• Example: Matching s = "abcde" against t = "abcdf"

• The matchers we have seen so far search left-to-right, and fail only when hitting "f"
• If we match right-to-left, we will immmediately discover the mismatch
• Can now move forward to after "f" without looking at "abcd"!
• Gives O(n/m) matcher in this case
• Problem: Self-similarity again. Match s = "aba" against t = "baba"

• Solution is similar to DFA/KMP