Talk:Damerau–Levenshtein distance

From Wikipedia, the free encyclopedia
WikiProject Computer science (Rated Start-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 

Technical Tag

I added the Technical tag and the context tags to this. The reason is that this is highly technical. That is not necessarily a problem but at least the introduction should say something less technical. If I read it correctly, then the introduction should say something like: Damerau-Levenshtein distance is a distance calculated for human misspellings. The distance calculated is useful for computer software spelling checkers. Note that this is useful only for languages that use alphabetic symbols and not those that use logographic or syllabic symbols. Fanra 01:06, 23 April 2007 (UTC)


Question: are the tranpositions restricted to consecutive characters or any two characters within the strings? —Preceding unsigned comment added by 141.225.10.68 (talk) 15:59, 24 April 2008 (UTC)

Consecutive characters in this case. I believe one of the other algorithms on this page allows for arbitrary transpositions.

Plikarish (talk) 21:04, 4 February 2010 (UTC)


There is an error in this algorithm. String indices start at 0, but the algorithm starts string comparision at index = 1. The first character will not be tested and when i = lenStr1, str1[i] will be out of range. —Preceding unsigned comment added by 68.15.100.248 (talkcontribs) 28 June 2006

There's no error — string indices start at 1, and are properly swept by the main loop. It's the d array which is indexed from zeros. The 0-th row and column are filled by two separate loops, so they can be accessed with d[i-1, j] and d[i, j-1] expressions in main loop. --CiaPan 14:33, 29 March 2007 (UTC)
for j from 1 to lenStr2 should be for j from 0 to lenStr2 See [1] —Preceding unsigned comment added by 173.21.33.192 (talk) 01:28, 3 February 2010 (UTC)
Which one? There are two 'for j' loops... --CiaPan (talk) 11:13, 3 February 2010 (UTC)

I implemented the algo and it took me awhile to figure out what I was doing wrong. It was due to the fact that the algo assumes strings are indexed starting at 1. I think it should be explicitly stated for the reader's benefit that string indices start at 1. (Many (most?) languages treat strings as char arrays which will be 0 indexed. It's a simple fix either to change the algo or to include it as a comment. I've added a comment to this effect on the page. Plikarish (talk) 21:08, 4 February 2010 (UTC)

The routine header states explicitly: char str1[1..lenStr1] and similar for str2. This defines the input, so it is an important part of the whole algorithm definition. I can't help if you read just a random part of the definition, omitting necessary prerequisities. Do read it whole, and you'll understand the algorithm properly. Then you will be able to implement it correctly. --CiaPan (talk) 10:02, 5 February 2010 (UTC)

The last sentence:

An extension of the edit distance algorithm, that does satisfy a triangle inequality is described in the paper: F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964

actually links to this paper:

Source Journal of the ACM (JACM) archive
Volume 22 , Issue 2 (April 1975) table of contents
Pages: 177 - 183
Year of Publication: 1975
ISSN:0004-5411
Authors:
Robert A. Wagner Department of Systems and Information Sciences, Vanderbilt University, Nashville, TN
Roy Lowrance 255 West Squire Drive, Rochester, NY and Vanderbilt University, Nashville, Tennessee
Publisher ACM Press New York, NY, USA

I suspect that this is in fact the intended paper and that the sentence just needs to be re-written. —Preceding unsigned comment added by 151.193.220.27 (talkcontribs) 9 March 2007


It seems to me that the comment

   // note that d is zero-indexed, while a and b are one-indexed.

is wrong: the 2 nested loops index `a` and `b` from 0 to their length-1 included.

   for i := 1 to length(a) inclusive do
       for j := 1 to length(b) inclusive do
           if a[i-1] = b[j-1] then

-- Ambrevar (talk) 11:18, 15 February 2017 (UTC)

@Ambrevar: That error was a result of edit by Bawalou on 25 July 2016 (Special:Diff/731459024). I have reverted it. --CiaPan (talk) 12:28, 15 February 2017 (UTC)

Original research

Please someone check these edits

http://en.wikipedia.org/w/index.php?title=Damerau%E2%80%93Levenshtein_distance&diff=285208890&oldid=284007942

--CiaPan (talk) 18:16, 23 April 2009 (UTC)

I don't think that's what Wikipedia is for, is it? The part I deleted is pasted here for posterity. If it's original research, it doesn't belong in the article, but I'm struggling even to see the relevance of an allegedly superior algorithm, except perhaps as a link in the 'See Also' section. SeanCollins (talk) 05:26, 11 December 2009 (UTC)

Transposition using Lawrence logic:

    for (int swapcheck = 2; swapcheck <= Math.Max(string1.Length, string2.Length); swapcheck++) 
    {
        if ((sNewIdx > swapcheck)
             && (sOldIdx > swapcheck - 1)
             && (sNew[sNewIdx - 1] == sOld[sOldIdx - swapcheck])
             && (sNew[sNewIdx - swapcheck] == sOld[sOldIdx - 1]))
        {
            matrix[sNewIdx, sOldIdx] = (matrix[sNewIdx - 1, sOldIdx - 1] - 1);
        }
    }

where:

  • sNewIdx is the index value of string 1
  • sOldIdx is the index value of string 2

Kindly check out with the matrix format Levenshtein code link available below The above method is the result of experimentation for about one month, it gives lesser edit distance cost for string manipulation.

Please don't delete code. Read the paper.

If no one checked on the edits, that doesn't mean you should delete. Please read the paper. I have just reverted. Having actual code on the Wikipedia page is very useful. If you don't believe me, look at the 10 links at the bottom -- everybody is saying that they implemented the code based on the Wikipedia page. Translating code does not count as original research. It is equivalent to adding to an English Wikipedia article based on a French reference article. Otherwise, my code was a almost-verbatim translation from the paper pseudocode into ActionScript 3.0. If you're not happy with the language, then translate the code into a language you prefer, although you should have a good reason why your language of choice make the article better in some way. Please do not delete the code again. I will revert, and try to get an editor to moderate.Gabiteodoru (talk) 02:39, 14 October 2010 (UTC)

You got it is exactly vice versa. If they implemented code based on wikipedia, this is not a valid reference. Only sources from recognized experts are allowable, not from random geeks and hackers. Adding an English article based on an unreferenced French article is against the rule of wikipedia about verifiability and may be rightfullty turned into a verifiable stub. Please find a rule in wikipedia which says that translation of code is not original research. By the way, you don't provide reference where it is translated from. Last Lost (talk) 23:41, 25 October 2010 (UTC)
The reference for the code is clearly given in reference #4. Please read the article carefully and the references before making deletions.
You are wrong about the translation policy; WP:NONENG clearly states: "Because this is the English Wikipedia, English-language sources are preferred over non-English ones, unless no English sources of equal quality and relevance are available."
When I cited others' implementation of the code, I did it as evidence of relevance, not as a reference; keep in mind that Service to the Public is also a Key Wikimedia Value. Please read my comments carefully.
I am unable to find a rule about code translation; please provide a link to the rule if it exists, or do not use that argument.
I will wait for your reply, or 3 days, whichever comes first, before reverting.Gabiteodoru (talk) 03:39, 26 October 2010 (UTC)

Code

Unsourced text is disallowed in wikipedia. There is no way to prove that the code is correct. And it is not a job of wikipedia to do that. Proofs of correctness of any wikipedia statements are only through references. Please don't restore the deleted text without providing references. Last Lost (talk) 23:36, 25 October 2010 (UTC)

  • I used this code myself to learn about the algorithm. I completely fail to see the gain by removing it on principle, and consider its removal a net loss for Wikipedia and a degradation of the end-user experience. So maybe you're winning on principle here, but everyone else is losing. Why not put a "citation needed" in there instead of axing it? Beej71 (talk) 02:10, 27 October 2010 (UTC)
I do have citations for both pieces of code. So I'm still trying to figure out why they were removed :) Hopefully will get it back up soon. You can check the history in the mean time (the one last edited by me) Gabiteodoru (talk) 02:35, 27 October 2010 (UTC)
Sorry. I didn't notice the citations, since they were well above the code itself. Last Lost (talk) 22:45, 3 November 2010 (UTC)

I'm in the process of implementing a Java version of the C# code provided, and I have to say it's awful. No comments to explain the algorithm, and useless tiny variable names (sd for the SortedDictionary - I already know it's a SortedDictionary, tell me what's in it! similarly DB, i1, and j1 are just as useless). By reference to the base Levenshtein pseudo-code one can figure out most of it, but the page would be greatly improved by having a better implementation or better pseudo-code (note that the current pseudo-code is labeled as being for OSA, not the full Damerau-Levenshtein). — Preceding unsigned comment added by 193.34.186.98 (talk) 13:41, 29 July 2011 (UTC)

I know what you mean. I'm currently trying to implement it in C#, i.e. not even translating it to some other language. The problem is, I don't want to just copy the code, but actually understand it. I'll report back after figuring out what the various parts do and what the variables represent. --Wormbo 18:41, 6 January 2012 (UTC)

I think all the code should be deleted. Way too long for an encyclopedia. I believe there are other Wikimedia projects where it would be a better fit. 5.12.181.134 (talk)

Pseudo-code of function damerauLevenshteinDistance

Who can help me with this specific piece of (pseudo-)code?

DA[a[i-1]] = i;

What I understand:

  • a is an array of characters.
  • a[i-1] is the character at position i-1.
  • that is, DA is indexed by characters? But a few lines above, DA is initialized with integer indices:

for(var d:uint = 0; d<C; ++d) DA[d]=0; — Preceding unsigned comment added by 188.154.155.131 (talk) 10:04, 4 June 2011 (UTC)

uh it's a really confusing algorithm; characters are represented by integers; read the paper if you want to understand itGabi Teodoru 22:11, 23 June 2011 (UTC) — Preceding unsigned comment added by Gabiteodoru (talkcontribs)


Definition vs Pseudo Code (Conflict)

The definition has clearly specified that Damerau Levenshtein Distance allows transpose, whereas the pseudocode in the next section presents more restricted version of the same. The definition and the pseudo code both should point to same thing. This is conflicting. — Preceding unsigned comment added by 2620:10D:C082:1054:2ACF:E9FF:FE1E:7E97 (talk) 21:35, 7 May 2015 (UTC)

Missing ref?

What source is "{{ref|itman}}" at the bottom of the "Algorithm" section referring to? That note doesn't exist in the "References" section. - dcljr (talk) 23:59, 8 February 2012 (UTC)

Yes, the note exists at the very end of "External links" as 'Site devoted to fuzzy searching and information retrieval'. When browsing the article's source, look for {{note|itman}}. --CiaPan (talk) 21:06, 11 February 2012 (UTC)

The terms optimal string alignment and restricted edit distance

I wonder whether the terms optimal string alignment and restricted edit distance originate from scientific literature. Both terms are used in the Wikipedia article for describing the first algorithm since the first version of the article from 2005, but I cannot find older literature which uses these terms. In particular, I cannot find a hint on this terminology in the cited paper of Oommen and Loke, which uses this algorithm for handling generalized transpositions. Only some publications from recent years adopted the unverified terminology from Wikipedia.

Moreover, the concept of alignment does not use transpositions at all in my understanding. In fact, optimal string alignment redirects to Levenshtein distance, and the article on sequence alignment does not consider transpositions.

The terms optimal string alignment and restricted edit distance should be removed (and maybe replace by an original name from scientific literature), if they cannot be verified. A more general question is whether the according algorithm has some purpose where it is more suitable than the “true” Damerau–Levenshtein distance or whether it is just a wrong algorithm for the the Damerau–Levenshtein distance.

-- Chwinter (talk) 20:05, 1 May 2013 (UTC)

Restricted edit sequence and restricted edit distance are used by Ukkonen in his 1985 paper (Algorithms for Approximate String Matching) where he considers generic editing operation sets and generic cost functions (Sect. 4). Probably Ukkonen was the first one who introduced these terms. Hence I am fine with using these terms in the Wikipedia article. But note that Ukkonen used the terms in a more general context. Maybe it is best to use the phrase restricted version of the Damerau–Levenshtein distance together with a reference to Ukkonen's paper.

There is only one problem which arises in Ukkonen's discussion of restricted edit distances: Although he knows the Lowrance–Wagner algorithm, he states that the restricted and the unrestricted version coincide under some natural conditions, which are satisfied in the Damerau–Levenshtein case. This statement (basically Theorem 6) is disproved by the counter-example in the Wikipedia article. However, I would prefer to cite a scientific paper which clarifies this error. Is there such a paper?

I still do not see literature references for optimal string alignment. Hence I still vote for removing this term.

-- Chwinter (talk) 13:33, 12 May 2013 (UTC)

Ukkonen's paper from 1985 states that “the restricted edit distance is a natural measure of similarity” “in some applications in error correction and information retrieval” (p. 115). Hence we have at least a small citable statement on the utility of the restricted edit distance. This can be inserted into the section Applications.

-- Chwinter (talk) 19:59, 4 June 2013 (UTC)

Inline references

Any objection to converting the reference style to a more conventional <ref>....</ref>? That way arbitrary tags do not need to be assigned. Glrx (talk) 04:48, 5 June 2013 (UTC)

Where does the name "Damerau–Levenshtein distance" originate?

Having read quite a bit of string algorithms literature, I've got the idea that the name "Damerau–Levenshtein distance" is a Wikipedia invention. The idea of amending Levenshtein distance by allowing transpositions appears (without reference to Damerau, or any reference for that matter) in, e.g., Ukkonen 1983, but I've never seen the particular name used by this article in any peer-reviewed article. QVVERTYVS (hm?) 21:17, 25 November 2013 (UTC)

I don't know, but Wolfram seems to have adopted the term. http://reference.wolfram.com/mathematica/ref/DamerauLevenshteinDistance.html It may not have an academic source, but it may be the common name. Glrx (talk) 01:36, 27 November 2013 (UTC)

Here are some examples of the use of the term from the string algorithms literature:

  • "Previous error models have all been based on Damerau-Levenshtein distance measures (Damerau 1964; Levenshtein 1966)" Brill and Moore (2000). "An Improved Error Model for Noisy Channel Spelling Correction" (PDF). Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. pp. 286–293. doi:10.3115/1075218.1075255. 
  • "We show how a dictionary can be used with the Damerau-Levenshtein string-edit distance metric to construct a case-insensitive pass-phrase system." Bard (2007). "Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric". Proceedings of the fifth Australasian symposium on ACSW frontiers. 68. pp. 117–124. 
  • "This direction ever evolves from simple Damerau-Levenshtein distance (Damerau, 1964; Levenshtein, 1966) to probabilistic models..." Li; et al. (2006). "Exploring distributional similarity based models for query spelling correction" (PDF). Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. pp. 1025–1032. doi:10.3115/1220175.1220304. 

Accordingly, I removed the "citation needed". Gdr 11:57, 11 December 2013 (UTC)

Algorithms in External links

Rostepher added an external link to some code he put up on github today. A related link was previously removed by Stasek Lem.[2]

I have just spent a lot of time reviewing what happened and looking for information that this article once had.

Rostepher's code runs up against WP:COI, WP:OR and WP:SYN. I'm willing to let that slide on code implementations (WP:CALC), but the linked code is broken and does not show an understanding of what is going on. See, for example, the lines

// still deciphering the alogrithm from the cryptic vairable names...
unsigned db; // no idea what this is

The purpose is the same as in the OSA version. A value for db is calculated but not used (which should trigger a compiler warning). The code itself is not strictly C. The code claims origins from some stackoverflow.com postings, so there may be copyright issues. I see lots of trouble

Consequently, I removed the link to Rostepher's code.

Also troubling is that DL transposition code on the web often references an ActionScript implementation in this WP article. That implementation that was removed from the article in November 2013 as original research.[3] I have restored the ActionScript implementation but not the Csharp transliteration. The code appears to be developed from the original paper. I haven't gone further back in the history to find the original author.

I'm tempted to blow away most if not all of the ELs because the current article already has code implementations and would not contain implementations in more than one language (WP is not a code repository). I'm not going to take the time to go through them now.

I'm also concerned about a WP:COPYLINK violation with the URL for Levenshtein, reference 5. The URL appears to be a home directory rather than Doklady or Springer or author posting.

Glrx (talk) 05:30, 1 July 2014 (UTC) (modified Glrx (talk) 14:10, 1 July 2014 (UTC))

I just removed all the ELs. People looking for implementations should just search for them. WP cannot vouch for the quality of such links. We also have to rewrite the ActionScript code into pseudocode, or just remove it altogether. QVVERTYVS (hm?) 08:16, 1 July 2014 (UTC)
I made the distance matrix allocation implicit rather than explicit. The allocation was implicit in the OSA algorithm.
I reordered some lines (initialize DA first), made some single line for loops multiline, and added some comments.
I'm leaving it there for now, but I plan some other changes later:
make distance matrix use H[-1..a.length][-1..b.length] (match paper)
only calculate transpose distance if i1 != 0 and j1 != 0 (parallel the OSA algorithm)
Glrx (talk) 05:13, 4 July 2014 (UTC)

OSA psuedocode superfluous transpositions

The OSA pseudocode has one issue in that it will carry out superfluous transpositions when the source and destination contain two identical characters next to each other such as "Kitten" and "Sitting" which the algorithm as written will return a distance of 4, when the actual distance is 3. Amending the code line

  if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then

to

  if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j] and str1[i-1] <> str1[i]) then

prevents such superfluous transpositions at the cost of an additional comparison. — Preceding unsigned comment added by PaulBBSRC (talkcontribs) 23:59, 3 November 2014 (UTC)

Algortihm computes the cost of the transposition but then takes the min. The change is not needed. Glrx (talk) 16:58, 8 November 2014 (UTC)

A real distance called OSA

I would just want to stress that there exists in the literature a real distance for sequences of symbols called OSA (for Optimal Symbol Alignment). It is a true metric (satisfying the triangular inequality), in contrast to the OSA measure that is described in this Wikipedia entry. The reference to this paper is:

J. Herranz, J. Nin and M. Solé. Optimal Symbol Alignment distance: a new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering, Volume 23, Issue 10, pp. 1541-1554, 2011.
( see: http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2010.190 )

Maybe it would be a good idea to add this reference and some related sentence to this Wikipedia entry. Who can do this, and how ?147.83.104.30 (talk) 13:34, 24 March 2015 (UTC)

Is this reference about Damerau–Levenshtein distance, or some other distance? Because if it's not about the same distance, then it's not clear to me why it should be mentioned. And as a somewhat obscure topic (a primary source only cited 11 times according to Google scholar) it's also not clear that it warrants a separate article. —David Eppstein (talk) 05:41, 25 March 2015 (UTC)

External link doesn't work for me (HTTP 404). These work better:

  • http://www.computer.org/csdl/trans/tk/2011/10/ttk2011101541-abs.html
  • alias: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.190

--CiaPan (talk) 08:11, 25 March 2015 (UTC)

Just use {{cite doi}}: Herranz, J.; Nin, J.; Sole, M. (2011). "Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols". IEEE Transactions on Knowledge and Data Engineering. 23 (10): 1541. doi:10.1109/TKDE.2010.190. . QVVERTYVS (hm?) 14:45, 25 March 2015 (UTC)

i and j are not defined in the article text

The two variables and are not explained in the article text. From the source code I conclude that they are the string length, but this should be already explained under Definition. --87.139.233.163 (talk) 07:00, 30 November 2015 (UTC)

It's right there: "The Damerau–Levenshtein distance between two strings and is given by ". Then d is defined for arbitrary i, j. QVVERTYVS (hm?) 09:36, 30 November 2015 (UTC)
Hope this edit [4] clarifies the function (and it parameters) meaning. --CiaPan (talk) 10:00, 30 November 2015 (UTC)

Code for distance with adjacent transpositions

I have implemented the algorithm and found that line 7 should be da[i] := 0, not da[i] := i to give plausible results. As I have not proven the correctness otherwise, I'm requesting a verification here. Moreover I would add the comment transposition to the fourth parameter of minimum(). MasterFaS (talk) 01:54, 4 February 2016 (UTC)

I added your changes. Glrx (talk) 21:11, 6 February 2016 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified 2 external links on Damerau–Levenshtein distance. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

  • Added archive https://web.archive.org/web/20121221172057/http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf to http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf
  • Added archive https://web.archive.org/web/20100401081500/http://acl.ldc.upenn.edu:80/P/P06/P06-1129.pdf to http://acl.ldc.upenn.edu/P/P06/P06-1129.pdf

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

You may set the |checked=, on this template, to true or failed to let other editors know you reviewed the change. If you find any errors, please use the tools below to fix them or call an editor by setting |needhelp= to your help request.

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

If you are unable to use these tools, you may set |needhelp=<your help request> on this template to request help from an experienced user. Please include details about your problem, to help other editors.

Cheers.—InternetArchiveBot (Report bug) 13:27, 5 December 2016 (UTC)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Talk:Damerau–Levenshtein_distance&oldid=765615911"
This content was retrieved from Wikipedia : http://en.wikipedia.org/wiki/Talk:Damerau–Levenshtein_distance
This page is based on the copyrighted Wikipedia article "Talk:Damerau–Levenshtein distance"; it is used under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA). You may redistribute it, verbatim or modified, providing that you comply with the terms of the CC-BY-SA