Jump to content

Commentz-Walter algorithm

From Wikipedia, the free encyclopedia

Incomputer science,theCommentz-Walter algorithmis astring searching algorithminvented byBeate Commentz-Walter.[1]Like theAho–Corasick string matching algorithm,it can search for multiple patterns at once. It combines ideas from Aho–Corasick with the fast matching of theBoyer–Moore string-search algorithm.For a text of lengthnand maximum pattern length ofm,its worst-case running time isO(mn), though the average case is often much better.[2]

GNUgreponce implemented a string matching algorithm very similar to Commentz-Walter.[3]

History

[edit]

The paper on the algorithm was first published byBeate Commentz-Walterin 1979 through the Saarland University and typed by "R. Scherner".[1]The paper detailed two differing algorithms she claimed combined the idea of theAho-CorasickandBoyer-Moorealgorithms, which she called algorithms B and B1. The paper mostly focuses on algorithm B, however.

How the Algorithm Works

[edit]
A sample Reversed Pattern Tree

The Commentz-Walter algorithm combines two known algorithms in order to attempt to better address the multi-pattern matching problem. These two algorithms are theBoyer-Moore,which addresses single pattern matching using filtering, and theAho-Corasick.To do this, the algorithm implements a suffix automaton to search through patterns within an input string, while also using reverse patterns, unlike in theAho-Corasick.[4]

Commentz-Walter has two phases it must go through, these being a pre-computing phase and a matching phase. For the first phase, the Commentz-Walter algorithm uses a reversed pattern to build a pattern tree, this is considered the pre-computing phase. The second phase, known as the matching phase, takes into account the other two algorithms. Using theBoyer-Moore’s technique of shifting and theAho-Corasick's technique of finite automata, the Commentz-Walter algorithm can begin matching.[4]

The Commentz-Walter algorithm will scan backwards throughout an input string, checking for a mismatch. If and when the algorithm does find a mismatch, the algorithm will already know some of the characters that are matches, and then use this information as an index. Using the index, the algorithm checks the pre-computed table to find a distance that it must shift, after this, the algorithm once more begins another matching attempt.

Time Complexity

[edit]

Comparing theAho-Corasickto the Commentz-Walter Algorithm yields results with the idea of time complexity.Aho-Corasickis considered linearO(m+n+k) where k is the number of matches. Commentz-Walter may be considered quadraticO(mn). The reason for this lies in the fact that Commentz-Walter was developed by adding the shifts within theBoyer–Moore string-search algorithmto theAho-Corasick,thus moving its complexity from linear to quadratic.

According to a study done in “The Journal of National Science Foundation of Sri Lanka 46” Commentz-Walter seems to be generally faster than theAho–Corasick string matching algorithm.This, according to the journal, only exists when using long patterns. However, the journal does state that there is no critical analysis on this statement and that there is a lack of general agreement on the performance of the algorithm.[5]

As seen in a visualization of the algorithm’s running time done in a study by “The International Journal of Advanced Computer Science and Information Technology” the performance of the algorithm increased linearly as the shortest pattern within the pattern set increased.[4]

Alternative Algorithm

[edit]

In the original Commentz-Walter paper, an alternative algorithm was also created. This algorithm, known as B1, operates similarly to the main Commentz-Walter algorithm with the only difference being in the way the pattern tree is used during the scanning phase.

The paper also claims this algorithm performs better at the cost of increasing the running time and space of both the preprocessing phase and search phase. This algorithm has not been formally tested in other studies however, so its actual performance is unknown.[1]

References

[edit]
  1. ^abcCommentz-Walter, Beate (1979).A String Matching Algorithm Fast on the Average(PDF).International Colloquium on Automata, Languages and Programming.LNCS.Vol. 71. Graz, Austria: Springer. pp. 118–132.doi:10.1007/3-540-09510-1_10.ISBN3-540-09510-1.Archived fromthe original(PDF)on 2017-10-10.
  2. ^Watson, Bruce William (1995-09-15).Taxonomies and toolkits of regular language algorithms.Eindhoven University of Technology.doi:10.6100/IR444299.ISBN90-386-0396-7.
  3. ^"grep: remove Commentz-Walter code".GNUgrep.2017-01-18.Retrieved2022-05-15.
  4. ^abcAkinul Islam Jony (2014)."Analysis of Multiple String Pattern Matching Algorithms"(PDF).International Journal of Advanced Computer Science and Information Technology (IJACSIT).3(4): 344–353.ISSN2296-1739.
  5. ^Dewasurendra, SD; Vidanagamachchi, SM (2018)."Average time complexity analysis of Commentz-Walter algorithm".Journal of the National Science Foundation of Sri Lanka.46(4): 547–557.doi:10.4038/jnsfsr.v46i4.8630.