Text preview for : CSL-76-1_A_Fast_String_Searching_Algorithm.pdf part of xerox CSL-76-1 A Fast String Searching Algorithm xerox parc techReports CSL-76-1_A_Fast_String_Searching_Algorithm.pdf



Back to : CSL-76-1_A_Fast_String_Se | Home

A FAST STRING SEARCHING ALGORITHM
BY ROBERT S. BOYER* AND J STROTHER MOORE**




CSL-76-1 JULY, 1976



An Algori thm is presented that searches for the location, i, of the first occurrence
of a character string, pat, in another string, string. During the search operation,
the characters of pat are matched starting wi t.h the last character of pat. The
informa t i on gained by starti ng t he rna teh a t the end of the pa tterll often a llows the
algorithm to proceed in largt~ jumps through the text being searched. Thus, the
algori thm has the ullusual property that, in most cases, not all of the first i
characters of string are inspected. The number of characters actually inspected
(on the average) decreases as a function of the length of pat. For a random
English pattern of length 5, the algori thm wi 11 typically inspect if 4 characters of
string before finding a match at i. Furthermore, the algorithm has been
implemented so that (on the average) fewer than i+patlen machine instructions
are executed. These conclusions are supported with empirical evidence and a
theoretical analysis of the average behavior of the algori thm. The worst case
behavior of the algori thm is linear in i+patlen, assumi ng the availabili ty of array
space for tables linear in patlen plus the size of the alphabet.


KEY wonns
bihliographic search, computational complexity, information retrieval, linear time
bound, pa ttern matching, text-edi ting


cn CATEGORIES

3.74, 4.40, 5.25


*Computer Science Group, Stanford Hesearch Institute, Menlo Park, Ca. 94025.
This work was partially supported by ONU Contract N00014-75-C-OBIB.

**Formerly at the Computer Science Laboratory, Xerox Palo Alto Research Center,
Palo Alto, Ca 94304, currently at Computer Science Group, Stanford Research
Institute, Menlo Park, Ca 94025.\
Introduction

Suppose that pat is a string of length patlen and we wish to find the
position, i, of the leftmost character in the first occurrence of pat in some
string string:
pat: AT-THAT
string: WHICH-FINALLY-IIALTS.--AT-TIIAT-POINT
i: t


The obvious search algorithm considers each character position of string
and determines whether the successive patlen characters of string starting
at that position match the successive patlen characters of pat. Knuth,
Morris, and Pratt [4] have observed that this algorithm is quadratic. That
is, in the worst case, the number of comparisons is on the order of
i*patlen 1.

Knuth, Morris, and Pratt have described a linear search algorithm which
preprocesses pat in time linear in patlen and then searches string in time
linear in i+patlen. In particular, their algorithm inspects each of the first
i+patlen-l characters of string precisely once.

We now present a search algorithm which is usually