I was always curious about the auto correct feature of Google. I mean, lets face it - it looks ultra cool especially when we type in some vague disjointed letters and Google suggests us exactly what we had in our mind. It feels like Google kind of reads our mind !
With some time in my hand, I decided to venture out on finding how to do it. Of course, I don't have the monstrous server farms that Google has its disposal, to make an equivalent web solution. Also, considering the fact that I'm mostly a C/C++ guy, I decided to analyze and implement a solution more suited to the C++/desktop configuration.
A quick Google search for this problem domain yields many results. Apparently string matching / searching is a popular research topic. But, I wanted a more simpler solution than suggested by the results. And what do you know, the age old classic book on algorithms - 'The Algorithm Design Manual' by Skiena, just had something that meets my needs. Combining the solution given in the book with some extra code, I could arrive at a decent auto correct solution. Of course, it has many flaws. But its a good first step effort IMO.
So, without further ado, here's what it looks like :
For ex: lets assume our input string is abiliti and we have to convert it to either ability or capability.
We would need to substitute i with y, to convert abiliti to ability. But it would take much more effort to convert from abiliti to capability. So the distance between abiliti and ability is much lesser than abiliti and capability.
Now if we have a dictionary(text file) of grammatically correct words, all we need to do is to compare our input string (the one which needs a correction) against each of the word in the dictionary , calculating the distance in each case. The word with the minimum distance is the one that should be the correct one.
Easy isn't it ? Not so fast. Its easier said than done. And moreover, this method does have its flaws, which I will cover in another topic. But for now, lets see how to go about implementing this.
Lets say"dictionary" is a string of vectors containing the list of grammatically correct words.
lets initialize lowestCost to a high value say 65535,
Suggest would be the auto corrected string. This does work fine in many cases. But not all. Anyway, the correctness of the above solution will be analyzed in another post. But for now, even if its successful in giving the correct output, I would say it still has some basic flaws.
Can you spot the problem with the above approach ?
For one, we are unnecessarily comparing the input string with every word in the dictionary. To take the above ex: abiliti, my search should have compared only the alphabets starting from a. But in the above, I continue until Z.
So to improve it further, I would construct my dictionary in a more intelligent manner. Instead of a single vector, I would have an array of 26 vectors, each corresponding to an alphabet. I will have a Map that maps from the letter to the exact vector. Now, if I've to search for the word straight, I would go from my Map, to the corresponding vector list that has words starting only from S.
Our dictionary is now a map of alphabet -> vector list of grammatically correct words starting with that alphabet. So, instead of running through all the words, we just take the first character of our inputStr and use it as an index to directly jump to the corresponding vector list.
So, the above algorithm changes to :
With some time in my hand, I decided to venture out on finding how to do it. Of course, I don't have the monstrous server farms that Google has its disposal, to make an equivalent web solution. Also, considering the fact that I'm mostly a C/C++ guy, I decided to analyze and implement a solution more suited to the C++/desktop configuration.
A quick Google search for this problem domain yields many results. Apparently string matching / searching is a popular research topic. But, I wanted a more simpler solution than suggested by the results. And what do you know, the age old classic book on algorithms - 'The Algorithm Design Manual' by Skiena, just had something that meets my needs. Combining the solution given in the book with some extra code, I could arrive at a decent auto correct solution. Of course, it has many flaws. But its a good first step effort IMO.
So, without further ado, here's what it looks like :
1. Design:
Approximate String matching is achieved by finding the minimum distance between two strings. Here distance between two strings is calculated by the no. of operations (substitutions/insertions/deletions) it takes to convert one string to another.For ex: lets assume our input string is abiliti and we have to convert it to either ability or capability.
We would need to substitute i with y, to convert abiliti to ability. But it would take much more effort to convert from abiliti to capability. So the distance between abiliti and ability is much lesser than abiliti and capability.
Now if we have a dictionary(text file) of grammatically correct words, all we need to do is to compare our input string (the one which needs a correction) against each of the word in the dictionary , calculating the distance in each case. The word with the minimum distance is the one that should be the correct one.
Easy isn't it ? Not so fast. Its easier said than done. And moreover, this method does have its flaws, which I will cover in another topic. But for now, lets see how to go about implementing this.
2. Implementation
Lets say"dictionary" is a string of vectors containing the list of grammatically correct words.
lets initialize lowestCost to a high value say 65535,
Suggest would be the auto corrected string. This does work fine in many cases. But not all. Anyway, the correctness of the above solution will be analyzed in another post. But for now, even if its successful in giving the correct output, I would say it still has some basic flaws.
Can you spot the problem with the above approach ?
For one, we are unnecessarily comparing the input string with every word in the dictionary. To take the above ex: abiliti, my search should have compared only the alphabets starting from a. But in the above, I continue until Z.
So to improve it further, I would construct my dictionary in a more intelligent manner. Instead of a single vector, I would have an array of 26 vectors, each corresponding to an alphabet. I will have a Map that maps from the letter to the exact vector. Now, if I've to search for the word straight, I would go from my Map, to the corresponding vector list that has words starting only from S.
Our dictionary is now a map of alphabet -> vector list of grammatically correct words starting with that alphabet. So, instead of running through all the words, we just take the first character of our inputStr and use it as an index to directly jump to the corresponding vector list.
So, the above algorithm changes to :
Now , that the main function is in place , what about the utility function - string_compare, which is the one doing the actual grunt work of calculating the minimum distance. How do you go about comparing two strings to calculate the minimum distance ? Do keep in mind that we have to perform this operation for hundreds (maybe even thousands depending on the size of your dictionary). I'll leave it as an exercise to the reader. Skiena's book, proposes an elegant as well, using dynamic programming . It could also be easily googled.