CYDB - Editorial

admin · November 22, 2012, 2:09pm

PROBLEM LINKS

DIFFICULTY

HARD

EXPLANATION

What exactly does Chef’s program compute? It iterates over all offsets i in string B. For each offset it tries all valid lengths j, checks whether first j characters of string A match with string B at current offset i and adds a corresponding score. String matching allows at most two errors on positions indicated by digits 1 in string C. Let’s call string B the text and string A the pattern. Instead of trying increasing lengths j, we could find the longest match of the pattern with text at offset i and apply the formula for sum of squares. That won’t be fast enough, but it’s very useful for testing whether your faster algorithm is in fact correct.

The problem was designed around the Aho-Corasick algorithm for string searching. Explaining the algorithm is beyond the scope of this editorial so just look it up if you’re not familiar with it (here’s one of many presentations). A quick summary would be that it is a generalization of Knuth-Morris-Pratt algorithm so that it can simultaneously search for a set of patterns in linear time. Instead of computing a failure function, it builds a trie of pattern strings and computes failure links in that trie. The searching phase is very similar to KMP. The result of running the algorithm is that it determines for all positions i, what is the longest prefix of a pattern which ends at position i.

First expand the string A into a set of possible patterns. Instead of computing the result by starting positions of the pattern in text as in Chef’s program, we can compute it by ending positions of the pattern. We can do this with a small modification of the Aho-Corasick algorithm. For every position in text, we get the length of the longest matching pattern prefix. To take into account the shorter ones, all we have to do is follow the failure links in the trie. Just precompute the scores of each such path along the failure links and you get an algorithm which is linear in the length of text and in the sum of pattern lengths.

Tester for this round came up with an alternative algorithm with complexity O(nb*log na). It is based on computing hashes of string prefixes and binary search. For each position i in string B, you can use binary search to find the longest prefix of A, which exactly matches with string B at offset i. All you have to do is repeat this three times to take care of allowed mismatches. A lot of effort went into preparing test cases, which would prevent such solutions with some additional optimizations from getting accepted. One such optimization would be to use the Z-algorithm for computing longest common prefixes of string A and all suffixes of B. If the entire string A matches or if the mismatch occurs at an invalid position, you can skip the other two steps of binary search with hashes and proceed to the next position i in string B. A side effect of this was that intended Aho-Corasick solutions had to be carefully implemented. For example, a sum of squares of first 1000 natural numbers fits in 32-bit integer type, so you should avoid modulo operations, which are rather slow. Some solutions still squeezed through but they were mostly more complicated than the intended solution. Low accuracy of submissions shows that our efforts were effective.

SETTER’S SOLUTION

Can be found here.