In this tutorial, we will investigate the Levenshtein distance algorithm, which is also known as the Edit distance algorithm which compares words for similarity.
The Levenshtein distance algorithm was created by Russian scientist, Vladimir Levenshtein.
The Levenshtein distance algorithm compares words for similarity by calculating the smallest number of changes / substitutions required to transform one string into another. These changes include:
The algorithm is used in different applications or as a basis for:
The algorithm uses a brute force technique to transform a source String into a target String. It looks at all permutations in order to find the minimum number of changes, required to perform the transformation. The Levenshtein Distance algorithm is also knows as the edit distance algorithm.
For example consider the source word dog and the target word dodge. The edit distance between these two words is 2, because dog can be converted to dodge by inserting a d before g and an e after.
The Levenshtein distance algorithm can also assign different costs to each type of edit. For example insertion and deletion could have a cost of 2, and the cost of substitution could be less costly with a cost of 1.
In this tutorial we will follow a dynamic programming approach to implement the Levenshtein Distance Algorithm. To illustrate the approach we will use the matrix shown below.
The matrix above uses rows and columns to represent the source word dog, and the target word dodge. The matrix will be used to calculate the edit distance. The rows of the matrix represents a source word to be transformed and it’s entries are the cost of inserting each character. Moreover, the columns in the matrix are used for the target word to be transformed and the entries are the cost of deletion.
To calculate the values in each cell we will use a formula such as:
1 | minCost(value of left diagonal + substitution cost, value above + deletion cost, left value + insertion cost)
|
We will use the following values for the costs of the edits.
1 2 3 | insertion: 1
deletion: 1
substitution: 1
|
Moreover in the above formula we will always set the cost of insertion and deletion to 1, and we will only use a substitution value of 1, if the character at index (row:col) is different.
Let’s look at an example to calculate the first cell. For this cell the positional values will be:
1 2 3 | left diagonal: 0
value above: 1
left value: 1
|
and the edit cost will be:
1 2 3 | substitution: 0 – characters are the same
insertion: 1 – 1 is a constant value
deletion: 1 – 1 is a constant value
|
so:
1 | minCost(0 + 0, 1 + 1, 1 + 1) → min(0, 2, 2) = 0
|
Lets continue and look at the cell (1,2). For this cell the value will be:
1 | min(1 + 1, 2 + 1, 0 + 1) → min(2, 3, 1) = 1
|
We will continue the process for the remainder of the cells. After we have calculated the values we have the matrix shown below.
In the above matrix the value in the bottom right corner is the result of the Levenshtein Distance calculation. In this case the minimum edit distance between the words dog and dodge is 2.
On analyzing the above algorithm we can see that the algorithm performs with quadratic complexity O(MN) as each character from the source M is compared with each character from the target N to produce the fully populated matrix.
Lets write some code to implement the above algorithm.
First we will allow for the cost of the edits to be configurable:
1 2 3 4 5 6 7 8 9 | public LevenshteinDistance(int insertionCost, int deletionCost, int substitutionCost) {
AssertUtils.gte(insertionCost, 0, "Insertion cost must be greater than or equal to 0");
AssertUtils.gte(deletionCost, 0, "Deletion cost must be greater than or equal to 0");
AssertUtils.gte(substitutionCost, 0, "Substitution cost must be greater than or equal to 0");
this.insertionCost = insertionCost;
this.deletionCost = deletionCost;
this.substitutionCost = substitutionCost;
}
|
Next we will write some code to calculate the distance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | public int calculateDistance(CharSequence source, CharSequence target) {
AssertUtils.notNull(source, "Source cannot be null");
AssertUtils.notNull(target, "Target cannot be null");
int sourceLength = source.length();
int targetLength = target.length();
int[][] matrix = new int[sourceLength + 1][targetLength + 1];
matrix[0][0] = 0;
for (int row = 1; row <= sourceLength; ++row) {
matrix[row][0] = row;
}
for (int col = 1; col <= targetLength; ++col) {
matrix[0][col] = col;
}
for (int row = 1; row <= sourceLength; ++row) {
for (int col = 1; col <= targetLength; ++col) {
matrix[row][col] = calcMinCost(source, target, matrix, row, col);
}
}
return matrix[sourceLength][targetLength];
}
|
We will now explain the code listed above. In the above code we first initialize the matrix using the source and target lengths:
1 | int[][] matrix = new int[sourceLength + 1][targetLength + 1];
|
We also set the matrix row and column size to one more than the source and target word length. This is so we can populate the empty matrix by adding the default values for the first row and first column of our matrix.
1 2 3 4 5 6 7 | for (int row = 1; row <= sourceLength; ++row) {
matrix[row][0] = row;
}
for (int col = 1; col <= targetLength; ++col) {
matrix[0][col] = col;
}
|
The above code populates the default values for the first row and column of the matrix. With the matrix fully initialized, the next step is calculate the value for each cell in the matrix.
1 2 3 4 5 | for (int row = 1; row <= sourceLength; ++row) {
for (int col = 1; col <= targetLength; ++col) {
matrix[row][col] = calcMinCost(source, target, matrix, row, col);
}
}
|
Above, we iterate through each cell of the matrix calculating its cost. When this step is complete the minimum distance can be found in the last cell of the matrix:
1 | return matrix[sourceLength][targetLength];
|
Now, let’s look at how to calculate the minimum cost of each cell.
1 2 3 4 5 6 7 8 | private int calcMinCost(CharSequence source, CharSequence target,
int[][] matrix, int row, int col) {
return Math.min(
calcSubstitutionCost(source, target, matrix, row, col), Math.min(
calcDeletionCost(matrix, row, col),
calcInsertionCost(matrix, row, col))
);
}
|
The calcMinCost
method finds the minimum cost of an edit for a cell. It calculates the substitution, insertion, and deletion costs. This method uses the Math.min
method to return the minimum value of the calculated costs.
The full code listing is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | public class LevenshteinDistance {
private final int insertionCost;
private final int deletionCost;
private final int substitutionCost;
public LevenshteinDistance() {
this(1, 1, 1);
}
public LevenshteinDistance(int insertionCost, int deletionCost, int substitutionCost) {
AssertUtils.gte(insertionCost, 0, "Insertion cost must be greater than or equal to 0");
AssertUtils.gte(deletionCost, 0, "Deletion cost must be greater than or equal to 0");
AssertUtils.gte(substitutionCost, 0, "Substitution cost must be greater than or equal to 0");
this.insertionCost = insertionCost;
this.deletionCost = deletionCost;
this.substitutionCost = substitutionCost;
}
public int calculateDistance(CharSequence source, CharSequence target) {
AssertUtils.notNull(source, "Source cannot be null");
AssertUtils.notNull(target, "Target cannot be null");
int sourceLength = source.length();
int targetLength = target.length();
int[][] matrix = new int[sourceLength + 1][targetLength + 1];
matrix[0][0] = 0;
for (int row = 1; row <= sourceLength; ++row) {
matrix[row][0] = row;
}
for (int col = 1; col <= targetLength; ++col) {
matrix[0][col] = col;
}
for (int row = 1; row <= sourceLength; ++row) {
for (int col = 1; col <= targetLength; ++col) {
matrix[row][col] = calcMinCost(source, target, matrix, row, col);
}
}
return matrix[sourceLength][targetLength];
}
private int calcMinCost(CharSequence source, CharSequence target,
int[][] matrix, int row, int col) {
return Math.min(
calcSubstitutionCost(source, target, matrix, row, col), Math.min(
calcDeletionCost(matrix, row, col),
calcInsertionCost(matrix, row, col))
);
}
private int calcInsertionCost(int[][] matrix, int row, int col) {
return matrix[row][col - 1] + insertionCost;
}
private int calcDeletionCost(int[][] matrix, int row, int col) {
return matrix[row - 1][col] + deletionCost;
}
private int calcSubstitutionCost(CharSequence source, CharSequence target,
int[][] matrix, int row, int col) {
int cost = 0;
if (source.charAt(row - 1) != target.charAt(col - 1)) {
cost = substitutionCost;
}
return matrix[row - 1][col - 1] + cost;
}
}
|
Finally we will show the test used to validate the implementation:
1 2 3 4 5 6 | @Test
public void calculateDistance() {
LevenshteinDistance calculator = new LevenshteinDistance(1,1,1);
final int result = calculator.calculateDistance("dog", "dodge");
assertEquals(2,result);
}
|
In this tutorial we examined the levenshtein distance algorithm and some of its applications. We also looked at one approach to implementing it using dynamic programming. We implemented an algorithm that performs with quadratic complexity O(MN). Therefore, the performance of this algorithm can be improved using different techniques. As it stands the algorithm performs in a time relative to O(MN), therefore, without some tweaking it probably does not exhibit the performance required for production use cases, such as an efficient spell-checker.