Problem Statement:
Implement Rabin Karp hashing based Pattern-Matching Algorithm.
Example:
Text (T): rabing karp uses hashing technique
Pattern (P) : hashing
n : length of text
m : length of pattern
Download Source Code
Solution:
Rabin Karp Algorithm finds a match of the pattern in the text by using hashing. Match is found only if hash of Pattern and hash of text of m characters gives same result.
Our objective is to reduce complexity from O(m x n ) as was in case of Brute force. We will have to desigyn a hash function which take O(1) complexity for finding hash of the pattern.
We will consider Horner's rule to calculate the hash of the pattern and hash of text of
'm' length.
Let us denote pattern as polynomial
here we will call 'x' as radix , and above polynomial can also be represented as
So this was about calculating hash by iterative method using the simplified equation.
Now,
The pattern 'P' will match Text 'T' iff T[i , i+1 , .....i+m-1] = P [0,1,......, m-1 ]
ti : represents the hash value from T[i , .....i+m-1]
and ti+1 : represents the hash value from T[i+1 , .....i+m-1 , i+m]
Lets consider a Text containing all numbers as characters , i.e. Character set is {1,2,3,4,5,6,7,8,9,0} and
lets say radix = 10, therefore calculating the numerical value of "35678" which will be 35,678.
Therefore, p = P[m-1] + 10 { P[m-2] + 10 { P[m-3] + 10 { ...... + 10 { P[1] + 10 * P[0] } } } }
for "35678"
p = 8 + 7*10 + 6*102 + 5*103 + 3 *104
= 8 + 10 { 7 + 10 { 6 + 10 { 5 + 10 {3 + 10 * 0} } } }
we assume that to calculate this hash takes O(1) complexity , assume if ti is the hash code value of Text Window does not match with pattern P, we will have to calculate T(i +1 ) next , here we can leverage ti to calculate ti+1.
ti = T[i+m-1] + 10 { T[i+m-2] + 10 { T[i+m-3] +10 { ...... + 10 { T[i+1] + 10 * T[i] } } } }
Now,
ti+1 = 10 (ti – 10m-1 T[i]) + T[i+m]
N.B.: 10m-1 value can also be pre-computed.
Now consider 35678 does not match and next digit to be considered is 4, then
- Remove Most Significant digit 35678 - 3 * 10000 = 5678
- Multiply above value by 10 and add 4 , 5678*10 + 4 = 56784
Below is the implementation:
/**
* Rabin Karp uses hashing technique to search for Strings
* @author PRATEEK Rathore
*/
public class RabinKarp {
//Contant prime number
private static final int CONSTANT_PRIME = 101;
//Hash Value of the pattern which will remain constant through out the problem
private long pathash;
private int radix;
//Input pattern
private String pat;
//Pattern Length
private int patLen;
//Its the place value of Most Significant bit
private long RM1;
//Preprocessing of pattern in the constructor
public RabinKarp( String pat) {
radix = 256;
patLen = pat.length();
this.pat = pat;
RM1 = 1;
for (int i = 1; i < patLen; i++)
RM1 = (radix * RM1) % CONSTANT_PRIME;
//return hash value of pattern
pathash = hash(pat, patLen);
}
//Return hash value of given string from 0 to the input length
private long hash(String pat, int len) {
long h = 0;
for (int j = 0; j < len; j++)
h = (radix * h + pat.charAt(j)) % CONSTANT_PRIME;
return h;
}
/**
* Sub-routine called when hash-function is equal,
* character by character comparison is done.
*/
private boolean doesMatch(String text, int currIndex) {
for (int j = 0; j < patLen; j++)
if (pat.charAt(j) != text.charAt(currIndex + 1 + j))
return false;
return true;
}
/**
* Search for the pattern in the input string
* @param text
* @return: index where pattern is found in text
*/
public int search(String text) {
int textLen = text.length();
if (patLen > textLen)
return -1;
long textHash = hash(text, patLen);
for (int i = 0; i < textLen - patLen; i++) {
textHash = radix * (textHash - RM1 * text.charAt(i))% CONSTANT_PRIME +
CONSTANT_PRIME + text.charAt(i + patLen) % CONSTANT_PRIME;
if (textHash == pathash && doesMatch(text, i))
return i;
}
return -1;
}
public static void main(String[] args) {
String pat = "prateek";
String txt = "Welcome prateek to Bangalore";
RabinKarp rabin = new RabinKarp( pat);
int index= rabin.search(txt);
System.out.println("Text found at "+ index + ": " + txt.substring(index, index+ pat.length()+1));
}
}
Time Complexity:
Worst Case is : O(m ( n- m+1 ))
Best and Average Case is : O (m + n )
References:
1. http://en.wikipedia.org/wiki/Horner's_method
Please comment and post your suggestions.
Happy Coding !! :)