We say that pattern $P$ occurs with shift $s$ in text $T$ (or, equivalently, that pattern $P$ occurs beginning at position $s+1$ in text $T$ ) if $0 \leqslant s \leqslant n - m$ and $T[s+1 \dots s+m] = P[1 \dots m]$ (that is, if $T[s+j] = P[j]$, for $1 \leqslant j \leqslant m$). If $P$ occurs with shift $s$ in $T$, then we call s a valid shift; otherwise, we call $s$ an invalid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern $P$ occurs in a given text $T$.
Algorithm | Preprocessing Time | Matching Time |
---|---|---|
Naive | $0$ | $O((n - m + 1)m) |
Rabin-Karp | $\Theta(m)$ | $O((n - m + 1)m) |
Finite automaton | $O(m|\Sigma|)$ | $\Theta(n)$ |
Knuth-Morris-Pratt | $\Theta(m)$ | $\Theta(n)$ |
Notations
Notation | Meaning |
---|---|
$\Sigma$ | Alphabet |
$\Sigma^*$ | Set of all finite-length strings |
$\varepsilon$ | Zero-length empty string |
$|x|$ | Length of string |
$xy$ | Concatenation of strings |
$w \sqsubset x$ | $w$ is prefix of $x$ |
$w \sqsupset x$ | $w$ is suffix of $x$ |
Lemma 32.1 (Overlapping-suffix lemma) Suppose that $x$, $y$ and $z$ are strings that $x \sqsupset z$ and $y \sqsubset z$. If $|x| < |y|$, then $x \sqsupset y$. If $|x| \geqslant |y|$, then $y \sqsupset x$. If $|x| = |y|$, then $x = y$.
32.1 The naive string-matching algorithm
- Time complexity: $O((n - m + 1)m)$,
- Worst-case running time: $\Theta((n - m + 1)m)$
32.2 The Rabin-Karp Algorithm
- Preprocessing: $\Theta(m)$,
- Worst-case: $\Theta((n - m + 1)m)$,
- Expected: $O(n) + O(m(v + n/q))$
32.3 String Matching with Finite Automata
Finite Automata
A finite automaton $M$, is a 5-tuple $(Q, q_0, A, \Sigma, \delta)$ where
- $Q$ is a finite set of states.
- $q_0 \in Q$ is the start state,
- $A \subseteq Q$ is a distinguished set of accepting states,
- $\Sigma$ is a finite input alphabet,
- $\delta$ is a function from $Q \times \Sigma$ into $Q$, called the transition function of $M$.
Whenever the current state $q$ is a member of $A$, the machine $M$ has accepted the string read so fad. An input that is not accepted is rejected.
A finite automation $M$ induces a function $\phi$, called the final-state function, from $\Sigma^*$ to $Q$ such that $\phi(w)$ is the state $M$ ends up in after scanning the string $w$. Thus, $M$ accepts a string $w$ if and only if $\phi(w) \in A$. We define the function $\phi$ recursively, using the transition function:
$$\phi(\varepsilon) = q_0,$$ $$\phi(wa) = \delta(\phi(w), a) \text{ for } w \in \Sigma^*, a \in \Sigma$$
String-matching Automata
In order to specify the string-matching automaton corresponding to a given pattern $P[1..m]$, we first define an auxiliary function $\sigma$, called the suffix function corresponding to $P$. The function $\sigma$ maps $\Sigma^*$ to ${0, 1, \dots, m}$ such that $\sigma(x)$ is the length of the longest prefix of $P$ that is also a suffix of $x$: $$\sigma(x) = \max {k: P_k \sqsupset x}$$
We define the string-matching automaton that corresponds to a given pattern $P[1..m]$ as follows:
- The state set $Q$ is ${0, 1, \dots, m}$. The start state $q_0$ is state $0$, and state $m$ is the only accepting state.
- The transition function $\delta$ is defined by the following equation, for any state $q$ and character $a$: $$\delta(q, a) = \sigma(P_qa)$$
Finite Automoton Mater Algorithm
Lemma 32.2 (Suffix-function inequality) For any string $x$ and character $a$, we have $\sigma(xa) \leqslant \sigma(x) + 1$.
Lemma 32.3 (Suffix-function recursion lemma) For any string $x$ and character $a$, if $q = \sigma(x)$, then $\sigma(xa) = \sigma(P_qa)$.
Theorem 32.4 If $phi$ is the final-state function of a string-matching automaton for a given pattern $P$ and $T[1..n]$ is an input text for the automaton, then $$\phi(T_i) = \sigma(T_i)$$ for $i = 0, 1, \dots, n$.
Computing the Transition Function
- Preprocessing: $O(m|\Sigma|)$,
- Matching: $\Theta(n)$.
32.4 The Knuth-Morris-Pratt Algorithm
Not required.
Loading Comments By Disqus