LCPMAX - Editorial

dpraveen · April 17, 2016, 10:30am

PROBLEM LINK:

Practice
Contest

Author: Constantine Sokol
Tester: Pavel Sheftelevich
Editorialist: Praveen Dhinwa

DIFFICULTY:

medium

PREREQUISITES:

None

PROBLEM:

Given two strings s, t. Both s, t contain first 10 lower case Latin alphabets (i.e. ‘a’-‘j’). You have to answer many queries of following kind. In each query, you are given two integers x, y. You have to find the length of the LCP (longest common prefix) of s[x..n] and t[y..n]. You are allowed to apply some permutation (mapping) to all of the distinct letters of s before finding the LCP. You should pick the bijection/mapping in such a way that LCP is as large as possible. If there are several possible bijections, you should pick the lexicographically smallest mapping, i.e. one with the least possible letter assigned to ‘a’, then (if still ambiguous) to ‘b’ and so on.

QUICK EXPLANATION:

If we omit the condition of allowing mapping among the characters of s, then this turns into a very standard problem, which we can solve using binary search over length of LCP and for each length checking whether the prefixes of that length of both the strings s, t are equal or not. For checking whether two substrings are equal or not, we can use hashing.
Even with the mapping, by making some simple observations, we can use the similar approach to find the LCP and corresponding mapping.

EXPLANATION:

Let us denote suffix of string s starting at x by S and that of t starting at y by T.

We have to find LCP(S, T).

In the beginning, we don’t know the mapping of the characters of S, So we just assume that they are mapped to some character not present in the strings at all, let us say ‘z’.
We will go step by step and build this mapping while maximizing the length of LCP.
Let us find the first position in the array S, T where the corresponding characters don’t match. Initially, we know that character of S are all mapped to ‘z’, so the first position itself won’t match.
Let i be the index at which S[i] != T[i].
So, in order to get a longer common prefix, we should try to map the S[i] to T[i]. Note that you can only do this mapping if the character S[i] was not already mapped to some character other than ‘z’, otherwise LCP build till now will be changed and be shorter than the current one, because there will be a character j < i such that S[j] != T[j].

We keep doing this till we are getting a longer prefix. When we stop, we know the length of LCP, though you should notice that it might be the case that there are some characters in S which are still not mapped. You can map them to any of the unmapped characters. But, as the answer asks to find lexicographically smallest possible mapping. You should greedily map them to smallest possible character available to map with.

So, we know the algorithm for solving the problem. We have to find out how can we implement the steps mentioned here fast, so that we can answer around 5 * 10^4 queries.

Finding the first position where the character mismatches for a given mapping of characters, is similar to find longest common prefix of corresponding strings. To find LCP, we can binary search over the length of the LCP and check whether the corresponding substrings are equal or not. For checking whether the two substrings are equal or not, we can compare the strings using hashing. Usually, it is easy to hash with string being fixed (i.e. mapping is not there). Let us see how we can deal with mapping in hashing.

Hashing with a given mapping.

You are given a string s. You have to find the hash of the any substring of s with a given mapping of characters of s.
First let us understand how the hash function of a string changes when we map some character to some other.

Let s = "baba" = {1, 0, 1, 0}. Let base = 2.
Hash(s) = 1 * 2^3 + 0 * 2^2 + 1 * 2^1 + 0*2^0).
Let us change “a” to “b”, we get Hash(s') = 1 * 2^3 + 1 * 2^2 + 1 * 2^1 + 1*2^0)
Notice that 0 * 2^2 + 0*2^0 changes to 1 * 2^2 + 1*2^0, remaining all other terms are equal.
The difference between these two will be 1 * 2^2 + 1*2^0.

Let terms(x) denotes the terms in hashing contributed by character x (excluding the multiple (‘x’ - ‘a’), i.e. (2^2 + 2^0) in the Hash(s).
If we change any character x to y, then the change in hashing will be (y - x) * terms(x).
So, for finding new hash, we just need to maintain terms for each of the characters.

Now, dealing with substring is also easier. You have to find part of terms included in the substring. For that, you can just maintain a prefix sum for terms.

See the tester’s code for more details on it.

Time Complexity

For each query, we can binary search at most 10 times (number of distinct characters in s or t). Substring hashing can be done in $O(1). So, overall time complexity for a query will \mathcal{O}(10 , log(n)$.

AUTHOR’S AND TESTER’S SOLUTIONS:

Author’s solution can be found here.
Tester’s solution can be found here.

war_machine225 · April 18, 2016, 9:04am

int get_hash(int n, int hashes[], int l, int r) {
return ((hashes[r] - hashes[l - 1] + MOD) * 1ll * powers[n - r]) % MOD;
}

The concept of hashing in strings is new for me and I haven’t understood it well. Could someone please explain why powers[n-r] is done in the above function (It is a function from setter’s solution). How about multiplying that entire thing by base^(-l) so that in the resulting hash, the leftmost character’s value is multiplied by 1 which would give the hash of the corresponding substring?

cjoa · April 18, 2016, 10:23am

Is the time complexity 10log(N) or 1010*log(N)? In author’s and tester’s solutions, the binary search check function loops over all 10 characters.

chemthan · April 18, 2016, 1:09pm

It seems that codechef’s system is too slow.

chandan721 · May 27, 2016, 2:54am

@dpraveen

in author’s solution

what does hashes[j][i] in build_hashes() function and

character___hashes[0][i] in check_substring() function signify?