This software is licensed under the BSD 3-Clause License. Please refer to the separate LICENSE file for the exact text of the license. You are obligated to give attribution if you use this code.
pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance. Courtesy Wikipedia:
In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
This implementation is based on Michael Homer's pure Python implementation, which implements the optimal string alignment distance algorithm. It runs in O(N*M)
time using O(M)
space. It supports unicode characters.
This code requires Python 3.7+ and a C compiler such as GCC. Although the code was written in Cython, Cython is not required for installation.
pyxDamerauLevenshtein is available on PyPI at https://pypi.org/project/pyxDamerauLevenshtein/.
Install using pip:
pip install pyxDamerauLevenshtein
Install from source:
python setup.py install
or
pip install .
The following methods are available:
Edit distance (damerau_levenshtein_distance
)
Normalized edit distance (normalized_damerau_levenshtein_distance
)
max(string1, string2)
. 0.0 means that the sequences are identical, while 1.0 means that they have nothing in common. Note that this definition is the exact opposite of difflib.SequenceMatcher.ratio()
.Edit distance against a sequence of sequences (damerau_levenshtein_distance_seqs
)
list
, tuple
).Normalized edit distance against a sequence of sequences (normalized_damerau_levenshtein_distance_seqs
)
list
, tuple
).Basic use:
```python from pyxdameraulevenshtein import damerau_levenshtein_distance, normalized_damerau_levenshtein_distance damerau_levenshtein_distance('smtih', 'smith') # expected result: 1 normalized_damerau_levenshtein_distance('smtih', 'smith') # expected result: 0.2 damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) # expected result: 7
from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs, normalized_damerau_levenshtein_distance_seqs array = ['test1', 'test12', 'test123'] damerau_levenshtein_distance_seqs('test', array) # expected result: [1, 2, 3] normalized_damerau_levenshtein_distance_seqs('test', array) # expected result: [0.2, 0.33333334, 0.42857143] ```
Other Python DL implementations:
pyxDamerauLevenshtein differs from other Python implementations in that it is both fast via Cython and supports unicode. Michael Homer's implementation is fast for Python, but it is two orders of magnitude slower than this Cython implementation. jellyfish provides C implementations for a variety of string comparison metrics and is sometimes faster than pyxDamerauLevenshtein.
Python's built-in difflib.SequenceMatcher.ratio()
performs about an order of magnitude faster than Michael Homer's implementation but is still one order of magnitude slower than this DL implementation. difflib, however, uses a different algorithm (difflib uses the Ratcliff/Obershelp algorithm).
Performance differences (on Intel i7-2600 running at 3.4Ghz):
>>> import timeit
>>> #this implementation:
... timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=500000)
7.417556047439575
>>> #Michael Homer's native Python implementation:
... timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=500000)
667.0276439189911
>>> #difflib
... timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=500000)
135.41051697731018
@gfairchild this is how I implement installation + type stubs in my projects. This avoids the duplication from https://github.com/lanl/pyxDamerauLevenshtein/pull/34.
The only duplicated file is the actual type hint, which can't be avoided when working with c extensions.
Added a stubs project for python types as laid out in PEP561. resolves #31
I tried to reproduce the benchmarks in your readme. However my results running the same benchmark are greatly different from the results you achieved. Note I scaled the benchmarks down from 500000
to 5000
, since this enough to get a good idea of the performance difference without spending all day running the benchmark.
On Python 3.9 I get:
```
import timeit timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000) 0.30914585100072145 timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000) 2.0448212559995227 timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000) 0.29983263299982355
and on Python2.7:
import timeit timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000) 0.4308760166168213 timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000) 1.8721919059753418 timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000) 0.3515639305114746 ```
So the performance differences in your benchmarks appear to differentiate by one order of magnitude for both pyxdameraulevenshtein <-> Michael Homer's implementation
and pyxdameraulevenshtein <-> difflib
. My best guess would be that the Python version you created them on (they existed since the start when the library was Python 2.4+) was still significantly slower than more recent versions of Python. It would probably make sense to redo these benchmarks on a more recent version of Python.
I've written the needed pyi files to type the library. It looks like it would take some pretty big changes to the package structure to make it work with side by side stubs. Would you prefer I keep trying to get it to work side by side? I can also make a stub-only package and give it to you to upload separately? Or would you prefer something else?
in python-Levenshtein they have a method called editops
that returns the operations done from the source to the destination.
editops('spam', 'park') [('delete', 0, 0), ('insert', 3, 2), ('replace', 3, 3)]
I feel that this would be useful method for the package.
Can we add a parameter like threshold or limit, if distance is above the threshold then just stop calculation and return? if would be really helpful in some cases, where you need compare the distance in real-time and you don't want waster time if distance is too large.
damerau-levenshtein cython edit-distance-algorithm