Use the Levenshtein distance algorithm, which allows approximate (fuzzy) string matching. It must be fast and robust and in managed code. We require that the algorithm perform well, and that it adheres to the standard format. This string distance can be used for spell-checkers, search engines, suggestion searches, as well as checking for plagiarism, and cryptography.
| When using this input | Levenshtein distance | Explanation |
| ant, aunt | 1 | One edit to get from first to second word |
| Sam, Samantha | 5 | Five letters must be added to the end |
| flomax, volmax | 3 | Example of prescription medications that have been confused by medical professionals |
First, credit goes to Vladimir Levenshtein, a Russian scientist. The code that follows is the complete C# implementation I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height. Here's the implementation we need to add to our project.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
/// <summary>
/// Contains the implementation of Levenshtein distance.
/// </summary>
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings (the parameters).
/// </summary>
/// <param name="s">The first of the two strings.</param>
/// <param name="t">The second of the two strings.</param>
/// <returns>The Levenshtein cost.</returns>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
We have some example calls to the Levenshtein distance computation above. We use the collection initializer syntax and test a series of two-string pairs. Look at the cost variable to see the exact calling code.
/// <summary>
/// This is an example class that tests the Levenshtein algorithm.
/// </summary>
class TestLevenshtein
{
public TestLevenshtein()
{
List<string[]> testStrings = new List<string[]>
{
new string[]{"ant", "aunt"},
new string[]{"Sam", "Samantha"},
new string[]{"clozapine", "olanzapine"},
new string[]{"flomax", "volmax"},
new string[]{"toradol", "tramadol"}
};
foreach (string[] inner in testStrings)
{
int cost = LevenshteinDistance.ComputeDistance(inner[0], inner[1]);
Console.WriteLine(inner[0] + " -> " + inner[1] + " = " + cost.ToString());
}
}
}
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. There is a C# version linked from that page, but I adapted that code for some big performance improvements. I changed the first statement into the second statement.
// Slow because it makes new strings. cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // More efficient because it doesn't make new strings with Substring. cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
This is a case of small change, big effect. What the change I show above does is test the characters in the string instead of copying them into a new string. Here are some results of my informal benchmarks for you to look at. After the table, I show the data in a graph format.
| Version | Approximate Time Elapsed |
| Original with Substring | 45 seconds |
|
With character index Shown on page |
15 seconds |
Download and view this code at my code viewing site. The code here is in the public domain, and was modified by me. This means you are free to use it anywhere you want. I recommend the link above to the Levenshtein resource page and want to note that the picture of Mr. Levenshtein's is taken from his home page in Russian.