String Matching
String Matching
Bart Massey
String Matching Problem
Given a "pattern string" s of length m, and a much longer "target string" t of length n, find occurrences of s in t.
Obvious applications in text searching; also, DNA analysis, fault analysis, etc
Since n might be as large as 10^12 and m might be as large as 10^4 in some problems, efficient algorithms are important
Topics
Naïve Algorithm
Rabin-Karp
DFA/KMP
Boyer-Moore
Review: Naïve Algorithm
We've already talked about this in class:
for i <- 0 .. n - m - 1 for j <- 0 .. m - 1 if t[i + j] != s[j] break if j >= m return i fail
About as good as you can do if m = 1
O(mn) in general, so if m is large ...
Rabin-Karp "Rolling Hash" Algorithm
Idea: For each length-m substring t' of t, compute hash(t'). If hash(t') = hash(s), check to see if t' = *s
h0 <- hash(s) for i <- 0 .. n - m - 1 h <- hash(t[i]..t[i + m - 1]) if (h0 != h) continue for j <- 0 .. m - 1 if t[i + j] != s[j] break if j >= m return i fail
Stupid idea on the face of it, since same complexity is the same and constants are worse
Saving grace: Use a "rolling hash", that is, a hash that can be adjusted from the previous hash in O(1) time
- Now algorithm is O(n) in average case, and worst case O(mn) only occurs if the hash function is terrible
Rabin-Karp Rolling Hash Function
Note that for prime q
- (a + b) mod q = (a mod q + b mod q) mod q
- Easy corollary: (a * b) mod q = (a mod q * b mod q) mod q
- Yields fast algorithm (O(lg b)) for a^b mod q
The hash
The dq-hash of a substring t[i]..t[i+m-1] is
h = (d^(m-1)t[i] + d^(m-2)t[i+1] + .. + d^0t[i+m-1]) mod q
To roll the hash forward, take
h' = ((h - d^(m-1)t[i])d + t[i + m]) mod q*
It is easy(?) to see that
h' = dq-hash(t[i+1]..t[i+m])
Rabin-Karp Matcher
Now we can build our matcher
hp <- dq-hash(s) h <- dq-hash(t[0]..t[m-1]) for i <- m .. n - 1 if h = hp for j <- 0 .. m - 1 if t[i + j - m] != s[j] break if j >= m return i h <- roll(h, t[i-m], t[i]) fail
DFA Matcher
Imagine that we are trying to match s = "abc". Then the following pseudocode is sufficient
j <- 0 for i <- 0 .. n - 1 if t[i] = s[j] j <- j + 1 else j <- 0 if j >= m return i - m
Easy to see this code is O(n)
Catch: Doesn't work for matching s = "aab" against t = "aaab"
Problem: Self-similar pattern strings
Fixing the DFA Matcher
Funny little procedure from the book: identify what prefix of the match is still valid, add appropriate back edges
for q <- 0 .. m for each a in the alphabet k <- min(m - 1, q + 2) repeat k = k - 1 until s[0]..s[q] . a has a prefix of s[0]..s[k] delta(q, a) <- k return delta
This is naive O(m^3), can get O(m) by memoization
j <- 0 delta <- compute-delta(s) for i <- 0 .. n - 1 j <- delta(j, t[i]) if j >= m return i - m
Knuth-Morris-Pratt: Boring, since roughly a special case of DFA matcher
Boyer-Moore
Idea: Maybe we can avoid looking at every character in t!
Example: Matching s = "abcde" against t = "abcdf"
- The matchers we have seen so far search left-to-right, and fail only when hitting "f"
- If we match right-to-left, we will immmediately discover the mismatch
- Can now move forward to after "f" without looking at "abcd"!
- Gives O(n/m) matcher in this case
Problem: Self-similarity again. Match s = "aba" against t = "baba"
Solution is similar to DFA/KMP