SELECT - Editorial

kevinsogo · November 20, 2016, 9:03pm

PROBLEM LINK:

Author: Istvan Nagy
Tester: Kevin Atienza
Translators: Sergey Kulik (Russian), Team VNOI (Vietnamese) and Hu Zecong (Mandarin)
Editorialist: Kevin Atienza

DIFFICULTY:

Medium

PREREQUISITES:

Backtracking

PROBLEM:

Names are encrypted according to some algorithm (specified in the problem statement). Your job is to decrypt them. It’s possible that multiple solutions exist; output any one.

QUICK EXPLANATION:

Generate the name with backtracking, starting from the last. At every point in the backtracking, we need to remember the following data:

The current index.
The sum of the letters generated so far.
The list of intervals where we can still put some letters.

Each “interval” is a contiguous range of letters, say, lmnopqrs. In each interval, we need to remember the following data:

The lowest and highest possible letter that can be used (in the above example, l and s),
The number of letters we want to take from this interval,
The number of letters we have taken and will want to take below this interval.

All these data can be updated as we try out each letter during the backtracking phase.

EXPLANATION:

Decoding the encryption method

The first step is to understand what the encryption does.

encrypt(S[0..N-1])
    W[0..N-1] = {0,..,0}
    for i=0 to N-1
        if S[i]<'a' or S[i]>'z' then 
            return "failure in encryption"
        for j=0 to N-1
            if S[i]>S[j] then
                W[i]++
        for j=i to N-1
            if i != j and S[i] == S[j] then 
                return "failure in encryption"
            W[i] = W[i] + S[j] - 'a'
        W[i] = W[i] mod 10
    The encrypted name of person is W[0],W[1]..,W[N-1]

With some simple refactoring, we can split this process into four phases:

encrypt(S[0..N-1])
    W[0..N-1] = {0,..,0}

    # checking phase
    for i=0 to N-1
        if S[i]<'a' or S[i]>'z' then 
            return "failure in encryption"
        for j=i to N-1
            if i != j and S[i] == S[j] then 
                return "failure in encryption"

    # "rank" phase
    for i=0 to N-1
        for j=0 to N-1
            if S[i]>S[j] then
                W[i]++

    # "cumulative sum" phase
    for i=0 to N-1
        for j=i to N-1
            W[i] = W[i] + S[j] - 'a'

    # "mod" phase
    for i=0 to N-1
        W[i] = W[i] mod 10

    The encrypted name of person is W[0],W[1]..,W[N-1]

It’s self-explanatory what each of these phases do. From this, we get our first few observations:

The “checking” phase simply ensures that the input is valid, i.e., it is a string of N distinct lowercase letters.
The “mod” phase tells us that we are working modulo 10. (at least when computing W[i])
The “rank” and “cumulative sum” phases can be interchanged.
The “rank” phase calculates the letters’ ranks, i.e. the position of each letter assuming the string is sorted.

Let’s refactor it a little bit more:

encrypt(S[0..N-1])
    W[0..N-1] = {0,..,0}

    # checking phase
    for i=0 to N-1
        S[i] = S[i] - 'a' # convert to a number from 0 to 25
        if S[i] is not in the range [0,1,2,...,25] then 
            return "failure in encryption"
        for j=i+1 to N-1
            if S[i] == S[j] then 
                return "failure in encryption"

    # "rank" phase
    R[0..N-1] = {0,..,0}
    for i=0 to N-1
        for j=0 to N-1
            if S[i]>S[j] then
                R[i]++

    # "cumulative sum" phase
    T[0..N-1] = {0,..,0}
    for i=0 to N-1
        for j=i to N-1
            T[i] = T[i] + S[j]

    # "mod" phase
    for i=0 to N-1
        W[i] = (R[i] + T[i]) mod 10

    The encrypted name of person is W[0],W[1]..,W[N-1]

We can state this procedure in yet another way: we can say that W[i] = (R[i] + T[i]) \bmod 10, where

R[i] is the rank of i, i.e., the (0-indexed) position of S[i] when the string is sorted.
T[i] is the cumulative sum starting at i, i.e. S[i] + S[i+1] + S[i+2] + \ldots S[N-1], or simply S[i] + T[i+1].

Backtracking

Our goal now is to generate a name S that encrypts to the string of digits W. But beyond this refactoring, it’s hard to analyze the “encrypt” function, so we’ll simply resort to backtracking, taking advantage of the smallness of K

There are two ways this can be done: either left to right or right to left. Both are possible, but we’ll do the right-to-left method. Our goal is to generate the letters S[N-1], S[N-2], \ldots, S[2], S[1], S[0] such that as we generate S[i], we ensure that S[i] decrypts to W[i] correctly. By doing it from right to left, we can keep track of the cumulative sum (T[i]) of all the letters so far.

Unfortunately, there’s one complication: we can’t ensure that S[i] decrypts to W[i] without knowing in advance what the remaining letters will be! (Or at least some information about them.) This is because of the “rank” phase: as we select S[i], we also need to ensure that its rank (R[i]) is correct, and R[i] depends on all letters, not just the ones we’ve already generated. Therefore, whenever we select S[i], we must also specify its rank, and then we must ensure that our future choices will be such that this rank will be correct.

The way we do this is to ensure that the correct number of letters will be taken from certain intervals of the alphabet. Let’s give an example. Suppose N = 8. Let’s try to generate the string, starting with the last letter. Suppose we want the last letter to be k, and we want its rank to be exactly 4. To ensure that the letter gets rank 4, we want to make sure that:

The letters with ranks 1 to 3 will be taken from the interval abcdefghij in the future.
The letters with ranks 5 to 8 will be taken from the interval lmnopqrstuvwxyz in the future.

Now, we need to select the second letter. Suppose that we want it to be s, and its rank to be exactly 6. To guarantee this, we want to ensure that:

The letters with ranks 1 to 3 will be taken from the interval abcdefghij in the future.
The letters with ranks 5 to 5 will be taken from the interval lmnopqr in the future.
The letters with ranks 7 to 8 will be taken from the interval tuvwxyz in the future.

And so on.

As you can see, we need to remember all these things as we perform the backtracking. Notice that every time we choose the next letter, we “split” one of the intervals into two. We can continue this until we generate the entire string, or hit a dead end, in which case we backtrack and try other possibilities. Finally, we also need to keep track of T[i] during the backtracking.

On top of this, we need to ensure that S[i] encrypts correctly to W[i]. This means that we must select the letter and the rank such that the following is satisfied: W[i] \equiv R[i] + T[i] \pmod{10}.

More backtracking details

We now have a general backtracking strategy. It’s time to handle the details and formalize it For simplicity, we’ll assume each letter is actually a number from 0 to 25, not a letter from a to z.

Let’s call our (recursive) function find. This will have three arguments: i, T and intervals.

i is the current index, i.e. we’re trying to fill up S[i].
T is the cumulative sum of the letters after S[i], i.e. it is equal to T[i+1].
intervals is the list of intervals where we can select letters from.

Each “interval” encodes information of the following form:

The letters with ranks R1 to R2 will be taken from the interval [S1,S2] in the future.

Specifically, we represent it as a 4-tuple (S1, S2, R1, R2).

The initial call to our function will be: find(N-1, 0, [(0,25,0,N-1)]). Note that intervals contains a single interval (0,25,0,N-1), which says

The letters with ranks 0 to N-1 will be taken from the interval [0,25] in the future.

Now, let’s try to see what happens in the call find(i, T, intervals). During this time, we want to select the letter S[i]. We also want to select its rank, R[i]. We can only select this letter from one of the intervals in intervals, and once we select it, we need to split that interval into two, as illustrated above.

Specifically, suppose we want our letter to be S[i], and it is taken from some interval (S1, S2, R1, R2). Let’s also assume its rank is R[i]. This selection of S[i] and R[i] implies all of the following:

S[i] must be in the range S1 <= S[i] <= S2.
R[i] must be in the range R1 <= R[i] <= R2.
T[i] == S[i] + T.
The following must be satisfied: W[i] == (R[i] + T[i]) % 10.
The interval (S1, S2, R1, R2) will be split into the intervals (S1, S[i]-1, R1, R[i]-1) and (S[i]+1, S2, R[i]+1, R2).

The recursion ends when we reach i = -1, in which case we’ve already generated the whole string. That’s it!

Here’s an implementation in Python: (Pypy)

def find(i, T, intervals):
   if i == -1:
      # end case: whole string generated
      return True

   # across all intervals
   for idx, (S1, S2, R1, R2) in enumerate(intervals):
      # for all ranks
      for Ri in range(R1,R2+1):
         # try all letters
         for S[i] in range(S1,S2+1):
            Ti = S[i] + T  # cumulative sum
            if W[i] == (Ri + Ti) % 10:
               # recurse
               if find(i-1, Ti, intervals[:idx] + [(S1,S[i]-1,R1,Ri-1), (S[i]+1,S2,Ri+1,R2)] + intervals[idx+1:]):
                  return True


z, n = map(int, raw_input().strip().split())
S = [None]*n
for cas in xrange(z):
   W = map(int, raw_input().strip())
   assert find(n-1, 0, [(0, 25, 0, n-1)])
   print ''.join(chr(s + ord('a')) for s in S)

This search can be optimized further with a couple of ideas; for example, instead of choosing S[i] and R[i] in increasing order, we can try randomizing the order in which we try it. Another is to use more sophisticated improvements like beam search.

AUTHOR’S AND TESTER’S SOLUTIONS:

setter
tester
editorialist

gvaibhav21 · November 21, 2016, 12:56am

what was the significance of 3960 in this problem? Was it to allow probabilistic solutions to pass?

iscsi · November 21, 2016, 2:09am

yes, we want to allow probabilstic solutions. My solution which is unfortunately linked below on the SETELE as a setter solution is using beam search. The point it is a heuristic solution and maybe it sometimes cannot produce a result (Of course we can make the beam wider and it will produce every time if a solution exists, but it takes more time if we don’t use a good heuristic). The advantage of the beam search solution is that if we increase the N to 15-20 it seems to still working and fast. (But we cannot guarantee the solution always exists, that is the reason why we use a small N=8.)