The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let () denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of ()=()/. While it is known that, for increasing n, () approaches a limit , the exact value of this limit is unknown, for any ≥2. This paper presents an upper bound to based on the exact computation of some () and a lower bound to based on combinatorial arguments on edit scripts. Statistical estimates of () are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are 0.1742≤2≤0.3693 while the obtained estimate is 2≈0.2888; for a quaternary alphabet, 0.3598≤4≤0.6318 and 4≈0.5180. These values are more accurate than those previously published.

Bounds and Estimates on the Average Edit Distance

Schimd M.;Bilardi G.
2019

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let () denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of ()=()/. While it is known that, for increasing n, () approaches a limit , the exact value of this limit is unknown, for any ≥2. This paper presents an upper bound to based on the exact computation of some () and a lower bound to based on combinatorial arguments on edit scripts. Statistical estimates of () are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are 0.1742≤2≤0.3693 while the obtained estimate is 2≈0.2888; for a quaternary alphabet, 0.3598≤4≤0.6318 and 4≈0.5180. These values are more accurate than those previously published.
2019
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
978-3-030-32685-2
978-3-030-32686-9
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3319023
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact