Loading [MathJax]/extensions/TeX/AMSsymbols.js

February 21, 2013

Knuth-Morris-Pratt and Boyer-Moore

For this post I made two scripts one for the Knuth-Morris-Pratt and other for Boyer-Moore, both in python.

Knuth-Morris-Pratt.

This algorithm searches in a text T the occurrences of the pattern P, this is made by employing the observation that when a mismatch occurs, the pattern shifts k positions in the text, k is the number of matches (T[i] == P[j]).

For example:

T: a b a c a a b a c c
P: a b a c a b

A mismatch occurs in T[6] and P[6] so k will be equal to 5 and we shift 5 spaces.


T: a b a c a a b a c c
P:              a b a c a b


This is the script:

import random
import sys
import time
def createWord(size):
word = ''
for i in range(size):
letter = chr(random.randint(97,101))
word+=str(letter)
return word
def createPattern(size):
pattern =''
for i in range(size):
letter = chr(random.randint(97,101))
pattern += str(letter)
return pattern
def kmp(text,pattern):
m = len(text)
n = len(pattern)
matchesPos = []
match = 0
pos = 0#current position in text
comp = 0 #comparisons
while pos < m:
skips = 0
#if length of pattern is greater than the rest of the text the function finishes
if n > len( text[pos:m] ):
return len(matchesPos)
for i in range(n):
if i+pos < m:
comp += 1
if text[i+pos] == pattern[i]:
match+=1
skips +=1
else:
break
if skips > 0:
pos += skips
else:
pos += 1
if skips == n:
matchesPos.append(match)
return comp#len(matchesPos)
def main():
textSize = int(sys.argv[1])
patSize = int(sys.argv[2])
text = createWord(textSize)
pattern = createPattern(patSize)
#text = sys.argv[1]
#pattern = sys.argv[2]
initTime = time.time()
comp = kmp(text,pattern)
endTime = time.time()
timed = endTime-initTime
print comp , ' ', timed
#print text.find(pattern)
if __name__ == "__main__":
main()
view raw kmp.py hosted with ❤ by GitHub

Boyer-Moore

This algorithm is based on two ideas:
  • comparing pattern characters to text from right to left
  • Use of two heuristics
    • bad-symbol that indicates how much to shift based on the text character that caused a mismatch
    • good-suffix that indicates how much to shift based on matched part (suffix) of the pattern


Bad-character
A mismatched character “B” from the text appears only in the beginning of the pattern. Thus we can simply shift the pattern to the right and align both characters B, skipping comparisons.


Good-Suffix

In this case it move the pattern thus the repeated portion must now align with its first occurrence in the pattern.

Again the shift must align the second portion with its first occurrence.

Now it align the left end of the pattern with the rightmost occurrence of “B”.


Here is the script

import random
import sys
def createWord(size):
word = ''
for i in range(size):
letter = chr(random.randint(97,101))
word+=str(letter)
return word
def createPattern(size):
pattern =''
for i in range(size):
letter = chr(random.randint(97,101))
pattern += str(letter)
return pattern
def boyerMoore(text, pattern):
'''lenTxt = length of the text
lenPat = length of the pattern'''
lenTxt = len(text)
lenPat = len(pattern)
#precompute
bch = badCharacter(text,pattern,lenTxt,lenPat)
gsuff = goodSuff(pattern,lenPat)
matchesPos = []
j = 0
comp = 0#comparisons
while (j <= lenTxt - lenPat):
i = lenPat-1
#for i in range(lenTxt-1,-1,-1):
while True:
comp +=1
if i>=0 and pattern[i] == text[i+j]:
i -= 1
else:
break
if i<0:
matchesPos.append(j)
j+= gsuff[0]
else:
j += max(gsuff[i], bch[text[i+j]]-lenPat+1+i)
print matchesPos, ' ', comp
return comp
def badCharacter(text,pattern,t,p):
'''Dictionary that tells us
'''
badChar = {}
for i in range(t-1):
if text[i] not in pattern:
badChar[text[i]] = p
for j in range(p-1):
badChar[pattern[j]] = p-j-1
#badChar.append(p-j-1)
return badChar
def suffixes(pattern, x):
a = x - 1
b = a
suff = []
for i in range(x):
suff.append(0)
suff[x-1] = x
for i in range(x-2,-1,-1):
if i > a and suff[i+x-1-b] < i-a:
suff[i]= suff[i+x-1-b]
else:
if i < a:
a = i
b = i
while (a>=0 and pattern[a] == pattern[a+x-1-b]):
a-=1
suff[i] = b - a
return suff
def goodSuff(pattern,x):
gSuff = []
s = suffixes(pattern,x)
for i in range(x):
gSuff.append(x)
for i in range(x-2,-1,-1):
if (s[i] == i+1):
for j in range(x-2):#x-1-i):
if gSuff[j] == x:
gSuff[j] = x-1-i
for i in range(x-1):
gSuff[x-1-s[i]] = x-1-i
return gSuff
def main():
#textSize = int(sys.argv[1])
#patSize = int(sys.argv[2])
#text = createWord(textSize)
#pattern = createPattern(patSize)
text = sys.argv[1]
pattern = sys.argv[2]
boyerMoore(text,pattern)
#print text.find(pattern)
if __name__ == "__main__":
main()
view raw boyer.py hosted with ❤ by GitHub
Since after doing this scripts I was little of time I just take some times


Conclusions

I made more tests I seeing the results I see that the Boyer-Moore algorithm is faster than Knuth-Morris-Pratt.

I have to say that sometimes Boyer-Moore crashes when I use big patterns, that is something I couldn't fix

1 comment:

  1. Me hubiera gustado ver un script para lo del experimento; 4 pts código. En el reporte esperaba gráficas y una discusión tipo O(...) de complejidad. Van 2 pts por el reporte.

    ReplyDelete