C#Dot Net Perls

C#
Levenshtein Distance

by Sam Allen
Vladimir Levenshtein.

Problem

Use the Levenshtein distance algorithm, which allows approximate (fuzzy) string matching. It must be fast and robust and in managed code. We require that the algorithm perform well, and that it adheres to the standard format. This string distance can be used for spell-checkers, search engines, suggestion searches, as well as checking for plagiarism, and cryptography.

When using this input Levenshtein distance Explanation
ant, aunt 1 One edit to get from first to second word
Sam, Samantha 5 Five letters must be added to the end
flomax, volmax 3 Example of prescription medications that have been confused by medical professionals

C# Solution

First, credit goes to Vladimir Levenshtein, a Russian scientist. The code that follows is the complete C# implementation I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height. Here's the implementation we need to add to our project.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

/// <summary>
/// Contains the implementation of Levenshtein distance.
/// </summary>
public static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings (the parameters).
    /// </summary>
    /// <param name="s">The first of the two strings.</param>
    /// <param name="t">The second of the two strings.</param>
    /// <returns>The Levenshtein cost.</returns>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

How can I call the method?

We have some example calls to the Levenshtein distance computation above. We use the collection initializer syntax and test a series of two-string pairs. Look at the cost variable to see the exact calling code.

/// <summary>
/// This is an example class that tests the Levenshtein algorithm.
/// </summary>
class TestLevenshtein
{
    public TestLevenshtein()
    {
        List<string[]> testStrings = new List<string[]>
        {
            new string[]{"ant", "aunt"},
            new string[]{"Sam", "Samantha"},
            new string[]{"clozapine", "olanzapine"},
            new string[]{"flomax", "volmax"},
            new string[]{"toradol", "tramadol"}
        };

        foreach (string[] inner in testStrings)
        {
            int cost = LevenshteinDistance.ComputeDistance(inner[0], inner[1]);
            Console.WriteLine(inner[0] + " -> " + inner[1] + " = " + cost.ToString());
        }
    }
}

Is it fast?

Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. There is a C# version linked from that page, but I adapted that code for some big performance improvements. I changed the first statement into the second statement.

// Slow because it makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

// More efficient because it doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

This is a case of small change, big effect. What the change I show above does is test the characters in the string instead of copying them into a new string. Here are some results of my informal benchmarks for you to look at. After the table, I show the data in a graph format.

Version Approximate Time Elapsed
Original with Substring 45 seconds
With character index
Shown on page
15 seconds
Levenshtein performance comparison.

Conclusion

Download and view this code at my code viewing site. The code here is in the public domain, and was modified by me. This means you are free to use it anywhere you want. I recommend the link above to the Levenshtein resource page and want to note that the picture of Mr. Levenshtein's is taken from his home page in Russian.

Dot Net Perls is dedicated to sharing code and knowledge. It has
© 2007-2008 Sam Allen. All rights reserved.

Ads by The Lounge