Edit distance

Incomputational linguisticsandcomputer science,edit distanceis astring metric,i.e. a way of quantifying how dissimilar twostrings(e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications innatural language processing,where automaticspelling correctioncan determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. Inbioinformatics,it can be used to quantify the similarity ofDNAsequences, which can be viewed as strings of the letters A, C, G and T.

Different definitions of an edit distance use different sets of like operations.Levenshtein distanceoperations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the termLevenshtein distanceis often used interchangeably withedit distance.^[1]

Types of edit distance[edit]

Different types of edit distance allow different sets of string operations. For instance:

TheLevenshtein distanceallows deletion, insertion and substitution.
Thelongest common subsequence(LCS) distance allows only insertion and deletion, not substitution.
TheHamming distanceallows only substitution, hence, it only applies to strings of the same length.
TheDamerau–Levenshtein distanceallows insertion, deletion, substitution, and thetransposition(swapping) of two adjacent characters.
TheJaro distanceallows only transposition.

Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNAsequence alignmentalgorithms such as theSmith–Waterman algorithm,which make an operation's cost depend on where it is applied.

Formal definition and properties[edit]

Given two strings $a$ and $b$ on an alphabet $Σ$ (e.g. the set ofASCIIcharacters, the set ofbytes[0..255], etc.), the edit distanced( $a$ , $b$ )is the minimum-weight series of edit operations that transforms $a$ into $b$ .One of the simplest sets of edit operations is that defined by Levenshtein in 1966:^[2]

Insertionof a single symbol. If

a

=

u

v

,then inserting the symbol

x

produces

u

x

v

.This can also be denoted ε→

x

,using ε to denote the empty string.

Deletionof a single symbol changes

u

x

v

to

u

v

(

x

→ε).

Substitutionof a single symbol

x

for a symbol

y

≠

x

changes

u

x

v

to

u

y

v

(

x

→

y

).

In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimumnumberof operations required to transform $a$ to $b$ .A more general definition associates non-negative weight functions $w$ _ins( $x$ ), $w$ _del( $x$ ) and $w$ _sub( $x$ , $y$ ) with the operations.^[2]

Additional primitive operations have been suggested.Damerau–Levenshtein distancecounts as a single edit a common mistake:transpositionof two adjacent characters, formally characterized by an operation that changes $u$ $x$ $y$ $v$ into $u$ $y$ $x$ $v$ .^[3]^[4] For the task of correctingOCRoutput,mergeandsplitoperations have been used which replace a single character into a pair of them or vice versa.^[4]

Other variants of edit distance are obtained by restricting the set of operations.Longest common subsequence (LCS)distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.^[1]^: 37Similarly, by only allowing substitutions (again at unit cost),Hamming distanceis obtained; this must be restricted to equal-length strings.^[1] Jaro–Winkler distancecan be obtained from an edit distance where only transpositions are allowed.

Example[edit]

TheLevenshtein distancebetween "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

kitten →sitten (substitute "s" for "k" )
sitten → sittin (substitute "i" for "e" )
sittin → sitting(insert "g" at the end)

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

kitten → itten (delete "k" at 0)
itten →sitten (insert "s" at 0)
sitten → sittn (delete "e" at 4)
sittn → sittin (insert "i" at 4)
sittin → sitting(insert "g" at 6)

for a total cost/distance of 5 operations.

Properties[edit]

Edit distance with non-negative cost satisfies the axioms of ametric,giving rise to ametric spaceof strings, when the following conditions are met:^[1]^: 37

Every edit operation has positive cost;
for every operation, there is an inverse operation with equal cost.

With these properties, the metric axioms are satisfied as follows:

d

(

a

,

b

) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.

d

(

a

,

b

) > 0 when

a

≠

b

,since this would require at least one operation at non-zero cost.

d

(

a

,

b

) =

d

(

b

,

a

) by equality of the cost of each operation and its inverse.

Triangle inequality:

d

(

a

,

c

) ≤

d

(

a

,

b

) +

d

(

b

,

c

).^[5]

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.^[1]

Other useful properties of unit-cost edit distances include:

LCS distance is bounded above by the sum of lengths of a pair of strings.^[1]^: 37
LCS distance is an upper bound on Levenshtein distance.
For strings of the same length, Hamming distance is an upper bound on Levenshtein distance.^[1]

Regardless of cost/weights, the following property holds of all edit distances:

When $a$ and $b$ share a common prefix, this prefix has no effect on the distance. Formally, when $a$ = $uv$ and $b$ = $uw$ ,then $d$ ( $a$ , $b$ ) = $d$ ( $v$ , $w$ ).^[4]This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.

Computation[edit]

The first algorithm for computing minimum edit distance between a pair of strings was published byDamerauin 1964.^[6]

Common algorithm[edit]

Using Levenshtein's original operations, the (nonsymmetric) edit distance from $a=a_{1}\ldots a_{m}$ to $b=b_{1}\ldots b_{n}$ is given by $d_{mn}$ ,defined by therecurrence^[2]

{\begin{aligned}d_{i0}&=\sum _{k=1}^{i}w_{\mathrm {del} }(a_{k}),&&\quad {\text{for}}\;1\leq i\leq m\\d_{0j}&=\sum _{k=1}^{j}w_{\mathrm {ins} }(b_{k}),&&\quad {\text{for}}\;1\leq j\leq n\\d_{ij}&={\begin{cases}d_{i-1,j-1}&{\text{for}}\;a_{i}=b_{j}\\\min {\begin{cases}d_{i-1,j}+w_{\mathrm {del} }(a_{i})\\d_{i,j-1}+w_{\mathrm {ins} }(b_{j})\\d_{i-1,j-1}+w_{\mathrm {sub} }(a_{i},b_{j})\end{cases}}&{\text{for}}\;a_{i}\neq b_{j}\end{cases}}&&\quad {\text{for}}\;1\leq i\leq m,1\leq j\leq n.\end{aligned}}

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.^[3]

The straightforward,recursiveway of evaluating this recurrence takesexponential time.Therefore, it is usually computed using adynamic programmingalgorithm that is commonly credited toWagner and Fischer,^[7]although it has a history of multiple invention.^[2]^[3] After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at $d_{mn}$ .

This algorithm has atime complexityof Θ( $m$ $n$ ) where $m$ and $n$ are the lengths of the strings. When the full dynamic programming table is constructed, itsspace complexityis alsoΘ( $m$ $n$ );this can be improved toΘ(min( $m$ , $n$ ))by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations.^[3]A linear-space solution to this problem is offered byHirschberg's algorithm.^[8]^: 634A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.^[9]

Improved algorithms[edit]

Improving on the Wagner–Fisher algorithm described above,Ukkonendescribes several variants,^[10]one of which takes two strings and a maximum edit distance $s$ ,and returnsmin( $s$ , $d$ ).It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes timeO( $s$ ×min( $m$ , $n$ )),where $m$ and $n$ are the lengths of the strings. Space complexity isO( $s$ ²)orO( $s$ ),depending on whether the edit sequence needs to be read off.^[3]

Further improvements byLandau,Myers,and Schmidt[1]give anO( $s$ ²+ max( $m$ , $n$ ))time algorithm.^[11]

For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson^[12]having worst case runtime of O(nm/logn).

Applications[edit]

Edit distance finds applications incomputational biologyand natural language processing, e.g. the correction of spelling mistakes or OCR errors, andapproximate string matching,where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

Hirschberg's algorithmcomputes the optimalalignmentof two strings, where optimality is defined as minimizing edit distance.
Approximate string matchingcan be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string $p$ ,called the pattern, and a constant $k$ ;it then builds adeterministic finite state automatonthat finds, in an arbitrary string $s$ ,a substring whose edit distance to $p$ is at most $k$ ^[13](cf. theAho–Corasick algorithm,which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is thebitap algorithm,also defined in terms of edit distance.
Levenshtein automataare finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.^[4]

Language edit distance[edit]

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually aformal language.Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string andanystring taken from a set of strings. More formally, for any languageLand stringxover an alphabet $Σ$ ,thelanguage edit distanced(L,x) is given by^[14] $d(L,x)=\min _{y\in L}d(x,y)$ ,where $d(x,y)$ is the string edit distance. When the languageLiscontext free,there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance.^[15]For less expressive families of grammars, such as theregular grammars,faster algorithms exist for computing the edit distance.^[16]

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.^[14]^[17]

References[edit]

^^a ^b ^c ^d ^e ^f ^gNavarro, Gonzalo (1 March 2001)."A guided tour to approximate string matching"(PDF).ACM Computing Surveys.33(1): 31–88.CiteSeerX10.1.1.452.6317.doi:10.1145/375360.375365.S2CID 207551224.Retrieved19 March2015.
^^a ^b ^c ^dDaniel Jurafsky; James H. Martin.Speech and Language Processing.Pearson Education International. pp. 107–111.
^^a ^b ^c ^d ^eEsko Ukkonen (1983).On approximate string matching.Foundations of Computation Theory. Springer. pp. 487–495.doi:10.1007/3-540-12689-9_129.
^^a ^b ^c ^dSchulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata".International Journal of Document Analysis and Recognition.5(1): 67–85.CiteSeerX10.1.1.16.652.doi:10.1007/s10032-002-0082-8.S2CID 207046453.
^Lei Chen; Raymond Ng (2004).On the marriage of L_p-norms and edit distance(PDF).Proc. 30th Int'l Conf. on Very Large Databases (VLDB). Vol. 30.doi:10.1016/b978-012088469-8.50070-x.
^Kukich, Karen (1992)."Techniques for Automatically Correcting Words in Text"(PDF).ACM Computing Surveys.24(4): 377–439.doi:10.1145/146370.146380.S2CID 5431215.Archived fromthe original(PDF)on 2016-09-27.Retrieved2017-11-09.
^R. Wagner; M. Fischer (1974)."The string-to-string correction problem".J. ACM.21:168–178.doi:10.1145/321796.321811.S2CID 13381535.
^Skiena, Steven(2010).The Algorithm Design Manual(2nd ed.).Springer Science+Business Media.Bibcode:2008adm..book.....S.ISBN 978-1-849-96720-4.
^Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010)."Cache-oblivious dynamic programming for bioinformatics".IEEE/ACM Transactions on Computational Biology and Bioinformatics.7(3): 495–510.doi:10.1109/TCBB.2008.94.PMID 20671320.S2CID 2532039.
^Ukkonen, Esko (1985)."Algorithms for approximate string matching"(PDF).Information and Control.64(1–3): 100–118.doi:10.1016/S0019-9958(85)80046-2.
^Landau; Myers; Schmidt (1998). "Incremental String Comparison".SIAM Journal on Computing.27(2): 557–582.CiteSeerX10.1.1.38.1766.doi:10.1137/S0097539794264810.
^Masek, William J.; Paterson, Michael S. (February 1980)."A faster algorithm computing string edit distances".Journal of Computer and System Sciences.20(1): 18–31.doi:10.1016/0022-0000(80)90002-1.ISSN 0022-0000.
^Esko Ukkonen (1985). "Finding approximate patterns in strings".J. Algorithms.6:132–137.doi:10.1016/0196-6774(85)90023-9.
^^a ^bBringmann, Karl; Grandoni, Fabrizio;Saha, Barna;Williams, Virginia Vassilevska(2016)."Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product"(PDF).2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).pp. 375–384.arXiv:1707.05095.doi:10.1109/focs.2016.48.ISBN 978-1-5090-3933-3.S2CID 17064578.
^Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages".SIAM Journal on Computing.1(4): 305–312.doi:10.1137/0201022.ISSN 0097-5397.
^Wagner, Robert A. (1974)."Order-n correction for regular languages".Communications of the ACM.17(5): 265–268.doi:10.1145/360980.360995.S2CID 11063282.
^Saha, B.(2014-10-01).The Dyck Language Edit Distance Problem in Near-Linear Time.2014 IEEE 55th Annual Symposium on Foundations of Computer Science. pp. 611–620.doi:10.1109/FOCS.2014.71.ISBN 978-1-4799-6517-5.S2CID 14806359.

[navarnarutoro-1] ^^a ^b ^c ^d ^e ^f ^gNavarro, Gonzalo (1 March 2001)."A guided tour to approximate string matching"(PDF).ACM Computing Surveys.33(1): 31–88.CiteSeerX10.1.1.452.6317.doi:10.1145/375360.375365.S2CID 207551224.Retrieved19 March2015.

[slp-2] Daniel Jurafsky; James H. Martin.Speech and Language Processing.Pearson Education International. pp. 107–111.

[ukkonen83-3] Esko Ukkonen (1983).On approximate string matching.Foundations of Computation Theory. Springer. pp. 487–495.doi:10.1007/3-540-12689-9_129.

[ssm-4] Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata".International Journal of Document Analysis and Recognition.5(1): 67–85.CiteSeerX10.1.1.16.652.doi:10.1007/s10032-002-0082-8.S2CID 207046453.

[5] Lei Chen; Raymond Ng (2004).On the marriage of L_p-norms and edit distance(PDF).Proc. 30th Int'l Conf. on Very Large Databases (VLDB). Vol. 30.doi:10.1016/b978-012088469-8.50070-x.

[6] Kukich, Karen (1992)."Techniques for Automatically Correcting Words in Text"(PDF).ACM Computing Surveys.24(4): 377–439.doi:10.1145/146370.146380.S2CID 5431215.Archived fromthe original(PDF)on 2016-09-27.Retrieved2017-11-09.

[7] R. Wagner; M. Fischer (1974)."The string-to-string correction problem".J. ACM.21:168–178.doi:10.1145/321796.321811.S2CID 13381535.

[8] Skiena, Steven(2010).The Algorithm Design Manual(2nd ed.).Springer Science+Business Media.Bibcode:2008adm..book.....S.ISBN 978-1-849-96720-4.

[CLR-08-9] Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010)."Cache-oblivious dynamic programming for bioinformatics".IEEE/ACM Transactions on Computational Biology and Bioinformatics.7(3): 495–510.doi:10.1109/TCBB.2008.94.PMID 20671320.S2CID 2532039.

[10] Ukkonen, Esko (1985)."Algorithms for approximate string matching"(PDF).Information and Control.64(1–3): 100–118.doi:10.1016/S0019-9958(85)80046-2.

[11] Landau; Myers; Schmidt (1998). "Incremental String Comparison".SIAM Journal on Computing.27(2): 557–582.CiteSeerX10.1.1.38.1766.doi:10.1137/S0097539794264810.

[12] Masek, William J.; Paterson, Michael S. (February 1980)."A faster algorithm computing string edit distances".Journal of Computer and System Sciences.20(1): 18–31.doi:10.1016/0022-0000(80)90002-1.ISSN 0022-0000.

[13] Esko Ukkonen (1985). "Finding approximate patterns in strings".J. Algorithms.6:132–137.doi:10.1016/0196-6774(85)90023-9.

[:0-14] Bringmann, Karl; Grandoni, Fabrizio;Saha, Barna;Williams, Virginia Vassilevska(2016)."Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product"(PDF).2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).pp. 375–384.arXiv:1707.05095.doi:10.1109/focs.2016.48.ISBN 978-1-5090-3933-3.S2CID 17064578.

[15] Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages".SIAM Journal on Computing.1(4): 305–312.doi:10.1137/0201022.ISSN 0097-5397.

[16] Wagner, Robert A. (1974)."Order-n correction for regular languages".Communications of the ACM.17(5): 265–268.doi:10.1145/360980.360995.S2CID 11063282.

[17] Saha, B.(2014-10-01).The Dyck Language Edit Distance Problem in Near-Linear Time.2014 IEEE 55th Annual Symposium on Foundations of Computer Science. pp. 611–620.doi:10.1109/FOCS.2014.71.ISBN 978-1-4799-6517-5.S2CID 14806359.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations