We have a string of length N(<=10^5). For each index i=0 to N-1, we need to find a unique substring of minimum possible length which contains the index i.
So, let’s begin with Suffix Array and LCP array. This has already been explained so many times earlier and I don’t think I can do better than those explanations. So here is an awesome tutorial on Suffix Arrays and LCP Array by @kuruma. If you don’t know about Suffix Array don’t continue before reading this tutorial. Also read about segment trees here and here.
If we have build the suffix array and LCP array.
For any i,
Let p = max(LCP(suffix[i-1],suffix[i]), LCP(suffix[i],suffix[i+1]))
If p is not equal to length of i’th suffix, then all the substrings, S[suffix[i] to suffix[i]+p], S[suffix[i] to suffix[i]+p+1],…, S[suffix[i] to N-1] are unique substrings of length p+1, p+2, … respectively. How?
Suppose three suffices of a string after making suffix array are
suffix [ i - 1 ] : abccxyz…
suffix [ i ] : abcdfgh…
suffix [ i + 1 ] : abcdqrs…
LCP( i - 1 , i) = abc
LCP( i, i + 1 ) = abcd
So starting from index suffix[i], at most 4 length substring “abcd” are common in another position suffix[i+1].
So we can say, “abcdf” is unique because if it is not, then there should be another suffix “abcdf…” just after suffix[i] in the suffix array.
As “abcdf” does not occurs anywhere else, also “abcdfg”, “abcdfgh” and so on are unique.
So, we know that from suffix[i] to suffix[i]+p (both inclusive) the minimum length unique substring is p + 1.
Similarly indices suffix[i]+p+1, suffix[i]+p+2, … etc. also part of unique substrings which starts from i. But we have to handles these two case differently.
Let us handle first case first.
We’ll handle this using segment tree. We have update queries of the type (i,j,p) in which we do:
for k=i to j: if tree[k]>p: tree[k]=p
So, we sort in increasing order of p and star updating ranges beginning from largest p.
So, we keep an array of pair of q[i]=(p[i],i). We sort it in increasing order. And then going backwards we update in segment tree the values. But we are not storing p[i] in the segment tree. We are storing suffix[i] in the segment tree. We will extract the minimum length from these indexes later. So we do this:
for i=n-1 to 0: pos=q[i].second pp=suffix[pos] if ( pp + q[i].first - 1 < n ): update( pp, pp + q[i].first - 1, pp)
Now, we read all values from the segment tree and suppose in a array posi we store the values taken from the segment tree.
We can computer array length from posi by doing this:
for i=0 to n-1: length[i]=((posi[i]<0)?n+1:p[rank[posi[i]]])
Now posi[i] and length[i] might not be storing the optimal answer. Because second case might give us the optimal cases. Indices suffix[i]+p+1, suffix[i]+p+2, … etc. also part of unique substrings which starts from i. In this case larger the suffix[i], smaller the substring will be. Since we need suffix[i] to be largest, we can do this in one for loop. Pseudo code:
memset(g,-1,sizeof(g)) for i=0 to n-1: g[suffix[i]+p[i]]=suffix[i] t=-1 for i=0 to n-1: t=max(t,g[i]) if t<0: continue l=i-t+1 if l<length[i] or ( l==length[i] and rank[t]<rank[posi[i]]): // means we have good solution, if l==length and lexigraphically smaller, we take the smaller one posi[i]=t,length[i]=l
Now, for each index we have starting(0-based index) posi[i] and length equal to length[i].