Given a Context Free-Grammer G with start symbol ST and a string S, Find the minimum Levenshtein distance between S and any string accepted by G.
- If we form an automation using start symbol as root, we may get a graph with cycles and possibly self-loops and a set of terminal nodes allowing specific set or character or empty character to be accepted. Any string is said to be accepted if it can be generated by a choice of characters for terminal positions from start symbol.
- Let us calculate the answer for position (L, R, NT) where L and R are the positions of the left and right end of the string, and NT is the Non-terminal symbol. The value assigned to tuple (L, R, NT) is the minimum Levenshtein Distance between substring S[L..R] of the given string (possibly empty) and any string acceptable by S. We can try all terminal characters and empty character and find the Levenshtein distance from the current string. For Productions of type (NT1, NT2), we can find position k from L to R and find minimum of (L, k, NT1)+(k+1,R,NT2).
- Critical observation is, that if we do not gain anything from cycles formed in this automation which allows us to form a bottom-up Dynamic Programming structure to answer our problem. Answer to our problem is given by state (0, |S|-1, ST).
Refer the following box for Levenshtein Distance of two strings.
Click to view
The Levenshtein Distance between two string is the minimum number of operations required to transform one string to other when in one operation, we are allowed to insert a character, delete a character or substitute one character from either of the string. It is a well-known problem solvable in O(N^2) time, though we don’t need that here.
For understanding, Suppose we build the graph starting from start symbol and productions of type zero. We shall end up with a graph with self-loops. We are allowed to substitute each Terminal symbol with any character given by productions of type 1 or with an empty string if allowed using the production of type 2.
Let us solve this problem from small strings to larger strings.
For a substring of length 0, if the current symbol has empty string character, Minimum Levenshtein Distance is zero. Similarly, if we have a substring of a single character, Levenshtein Distance shall be depending upon whether that character is allowed as the terminal character for current symbol and whether current symbol allows empty character or not.
Now, For productions of type 0, we can split the string at all possible split positions. (Consider split position just before the start of the string and after the end of string too) and recursively calculate the Levenshtein Distance of respective Symbols with these substrings. Choose the minimum Levenshtein Distance arising as the sum of Levenshtein Distances of both parts. This way, we have utilized the production symbols of type 0 too.
But the problem may arise, that we may end up in cycles.
Consider graph formed by each tuple (L, R, NT) representing a vertex and adding directed edges for productions of type 0 from (L, R, NT) to (L, K, NT) and (K+1, R, NT) for all L-1 \leq K \leq R. This graph actually shows how we move from one state to another in recursive calls. But it has cycles too, which is a complication.
The thing is, that cycles are never useful to us. Consider we started from state (L, R, NT) (Call it State 1) and after passing through some recursive calls, reached this state (Call it State 2) again. If Levenshtein Distance for state 1is Levenshtein Distance between$S[L…R]$ and any string from Set of strings X, then Levenshtein Distance for state 1, when returned from state 2, shall actually be minimum Levenshtein Distance for state 1 plus some value P from intermediate calls to other states. Since this value P can never be negative, It is never optimal to enter into any cycle, since moving through cycle only increases our Levenshtein Distance which is not optimal.
Hence, we can proceed with the procedure given above to translate into a bottom-up Dynamic Programming solution with DP state (L, R, NT) representing the Levenshtein Distance of substring S[L..R] from strings represented by Symbol NT.
Another special point is to handle empty strings carefully. We have L>R representing an empty string. It may be better to handle it for once for each symbol, instead of handling for all (L, R) combinations where L > R. Further Levenshtein Distance of empty string to any other string is given by the length of the other sting.
This solution greatly resembles CYK algorithms for checking if a string belongs to the given Context Free grammar or not, is very similar to this solution. The pseudo-code present on Wikipedia page (linked above) may come in useful for better understanding.
Related Problems: I wasn’t able to find any related problems or problems based on Context Free Grammer. I’d be glad if anyone can share any problem they know.
The time complexity of this solution is the same CYK algorithm which is O(|S|^3*N) where N is the number of productions of Grammer.
AUTHOR’S AND TESTER’S SOLUTIONS:
Feel free to Share your approach, If it differs. Suggestions are always welcomed.