CHN04 - Editorial

kevinsogo · January 29, 2016, 11:14am

PROBLEM LINK:

Author: ???
Tester: Kevin Atienza, Jingbo Shang
Editorialist: Kevin Atienza

PREREQUISITES:

Dynamic programming, sqrt decomposition, segment tree

PROBLEM:

Given an array L = [L_1, L_2, \ldots L_n] which is a permutation of [1\ldots n], we need to remove all elements of an array. In a single removal step, we first select an element, which we’ll call a source, then remove it and all elements it forms an inversion pair with.

What is the minimum number of moves to remove all elements?

QUICK EXPLANATION:

No two sources will form an inversion pair. Thus, the sequence of sources is an increasing subsequence.
Let L_i < L_j be two consecutive sources. Then there can never be a k such that i < k < j and L_i < L_k < L_j (otherwise not all elements are removed). In other words, our increasing subsequence is maximal.
The sources can be selected in any order.

Thus, the answer is the shortest maximal increasing subsequence of L.

This can be computed with dynamic programming and sqrt decomposition. Let f(j) be the shortest such subsequence that ends at L_j. Then f(j) = 1 + \min_{i < j} f(i) for all i satisfying the maximality condition. With sqrt decomposition, each value can be computed in O(\sqrt{n}) amortized time. Other techniques exist (such as using segment trees) to compute this quickly.

EXPLANATION:

The first thought is to be greedy and simply pick the element with the maximum number of inversion pairs at every step. However, this is incorrect, as shown by the following counterexample:

[4,1,2,10,9,8,7,6,5,3]

3 has the maximum number of inversions (7 pairs). If you remove it, you’ll end up with

[1,2]

and you’ll need to perform two more steps. But if you had deleted 4, then you’ll be left with

[10,9,8,7,6,5]

and you only need to remove one more. (Any one will do.)

The failure of this greedy solution stems from the fact that if two values L_i and L_j form an inversion pair (that is, L_i > L_j and i < j), then at most one can be used as a source. They can’t be both sources because once one is selected as a source, the other will be removed at that step.

But since there are no inversions among sources, if we take all sources as a subsequence of L_i, then we see that they form an increasing subsequence. Further, note that since they don’t form any inversions, the sources can be selected in any order. Thus, the answer is simply the shortest increasing subsequence such that all other elements form an inversion with one of them.

But how do we know if an increasing subsequence satisfies this condition? All other elements must form an inversion with one of the sources. The only way this fails for a given element, say L_k, is when for every source L_i, either i < k and L_i < L_k or i > k and L_i > L_k. But this simply means that our increasing sequence is not maximal, i.e. we can add more elements to the sequence and still make it increasing. Thus, we see that the answer is simply the shortest maximal increasing subsequence of L.

We will now try to find this subsequence.

Dynamic programming

An obvious dynamic programming approach comes to mind. Let’s define f(i) as the shortest maximal increasing subsequence of L[1\ldots i] that ends in L_i. Then we have the obvious recurrence

f(i) = 1 + \min_{\substack{j < i \\\ L_j < L_i}} f(j)

For all j satisfying the maximality condition, i.e. there is no k such that j < k < i and L_j < L_k < L_i. If there is no such j, then we simply say f(i) = 1. With f, the answer is simply \min_{j} f(j) for all j for which there is no i > j such that L_i > L_j. (Otherwise, the sequence will not be maximal.)

The following pseudocode illustrates one way of computing f and the answer:

f[1..n]

add an (n+1)th element to L, with value (n+1)

for i = 1..n+1:
    m = +infinity
    L_max = -infinity
    for j = i-1..1 by -1:
        if L_max < L[j] < L[i]:
            L_max = L[j]
            m = min(m, f[j])

    if m is infinity:
        f[i] = 1
    else:
        f[i] = 1 + m

print f[n+1]-1

(Note that we added n+1 at the end. This makes it easier for us to compute the answer without having to run a final loop for the answer.)

It iterates among all j < i in decreasing order, and uses a variable L_max to keep track of the running maximum L[j] encountered so far to check maximality quickly.

But doing this DP this way takes O(n^2) time so we need to find a faster way. We will describe a few ways to do it.

Sqrt decomposition (Malvika)

The problem might be amenable to optimization with sqrt optimization. Let’s say we split the array into blocks of size u each. (The last one may have fewer than u elements.) Thus, there are approximately n/u blocks. We will identify a block by its last index divided by u. For example, block u is L[1\ldots u], block 2u is L[u+1\ldots 2u], etc. In general, block j is L[j-u+1\ldots j] (where j is a multiple of u).

The slowest part of the code above is the following pass:

m = +infinity
L_max = -infinity
for j = i-1..1 by -1:
    if L_max < L[j] < L[i]:
        L_max = L[j]
        m = min(m, f[j])

Our goal is to use the sqrt decomposition so that each block can be processed quickly. Let’s define a “pass” as the following:

def perform_pass(j, bound, L_max):
    m = +infinity
    start = j-u+1
    end = j
    for j in end..start by -1:
        if L_max <= L[j] <= bound:
            L_max = L[j]
            m = min(m, f[j])

    return the pair (m, L_max)

(Notice we used the comparison <= instead of < in these, but since L is a permutation of [1\ldots n], this shouldn’t affect correctness.) Please take the time to understand what this piece of code does, because this will be very important.

We can use perform_pass to decompose our code into blocks:

for i = 1..n+1:
    m = +infinity
    L_max = -infinity

    // process 'extras'
    j = i-1
    while j % u != 0:
        if L_max < L[j] < L[i]:
            L_max = L[j]
            m = min(m, f[j])
        j -= 1

    // process blocks
    while j > 0:
        (cand_m, L_max) = perform_pass(j, L[i], L_max)
        m = min(m, cand_m)

        j -= u

    if m is infinity:
        f[i] = 1
    else:
        f[i] = 1 + m

    if i % u == 0:
        // new block to preprocess
        preprocess_block(i)

print f[n+1]-1

How do we preprocess a block L[j-u+1\ldots j] so that perform_pass can be done quickly in it? Here are a few observations:

We can replace bound with the largest element in the block that is \le bound.
We can replace L_max with the smallest element in the block that is \ge L_max.
After doing these replacements, the final value of L_max will be bound.

These new values can be computed with binary search on a sorted copy of the block.

Both replacements are acceptable, since we’re only comparing bound and L_max with elements from the block. (However, it has a minor problem, and that is when the new bound value becomes smaller than L_max. In that case, no values in the block satisfies the maximality condition, so simply return the pair (+infinity, L_max).) But in doing so, the number of distinct (L_max, bound) arguments for each block becomes O(u^2)! This suggests precomputing the answers for all distinct arguments, so that each perform_pass needs only a single lookup (after binary search).

More specifically, let’s define M(j, a, b) as the first value in the result of perform_pass(j, L[a], L[b]). Then we can use M for perform_pass as follows:

def perform_pass(j, bound, L_max):
    a = (index of largest element <= bound in L[j-u+1..j])

    if L[a] < L_max:
        return (+infinity, L_max)

    b = (index of smallest element >= L_max in L[j-u+1..j])

    m = M(j, a, b)
    L_max = L[a]

    return (m, L_max)

This now takes O(\log u) time, so all that remains is to precompute the M table. The size of M is O(nu) because j has to be a multiple of u, and a and b must belong to [j-u+1\ldots j].

Let’s assume j and b is fixed, and let’s try to compute M(j, a, b) for all a in increasing order of L[a]. If L[a] < L[b], then a = b, and the answer is simply +infinity (because there are no candidates j). Otherwise, L[a] \ge L[b]. Let a' be the index such that L[a'] < L[a] and L[a'] is the largest among all such $a’$s. Note that computing M(j, a', b) is similar to computing M(j, a, b). The only difference is that L[a] is not considered in M(j, a', b). This means that the computation in both are the same until the point when they encounter index a, in which case M(j, a', b) will just ignore it, while M(j, a, b) will consider it. We can exploit this similarity by using a stack to keep track of the computation state whenever L_max changes.

The stack contains pairs of integers (a, ans) in increasing order of a, and this pair means that the computation up to index a results in the answer \mathit{ans}.

The following pseudocode illustrates it for a fixed block j:

for b = j-u+1..j:
    initialize 'stack' as an empty stack
    for a from [j-u+1..j] in increasing order of L[a]:
        if L[a] < L[b]:
            set M(j, a, b) as +infinity
        else:
            ans = f[a]
            while stack is not empty:
                (p_a, p_ans) = stack.top()
                if a < p_a:
                    ans = min(ans, p_ans)
                    break
                else:
                    stack.pop()

            set M(j, a, b) as ans
            push the pair (a, ans) onto the stack

Since each a only pushes one entry to the stack, and each entry is removed at most once, the inner loop runs in O(u) amortized time, thus the whole code runs in O(u^2). We can do this preprocessing for each block after we obtain all f values in that block. Since it takes O(u^2) time to process each of the O(n/u) blocks, this takes O(nu) time overall. Afterwards, each f(i) value can be computed in O(u + n/u \log u) time. Thus, to get the fastest running time, we must select u = \Theta(\sqrt{n \log n}), which gives us a running time of O(n \sqrt{n \log n}).

The log factor can be shaved off by removing binary search, by precomputing all results of binary searches for each block. Specifically, for each block, we create two arrays:

\mathit{down}[v], defined as the index of the largest element \le v in the block.
\mathit{up}[v], defined as the index of the smallest element \ge v in the block.

All values of these arrays for 1 \le v \le n can be computed for each block in O(n) time, which means preprocessing a block takes O(u^2 + n) time. Using this, f(i) can now be computed without binary search in O(u + n/u). The overall running time becomes O(nu + n^2/u), and the optimal choice for u is \Theta(\sqrt{n}), giving a running time of O(n \sqrt{n}).

The rest of the editorial describes alternative approaches to optimizing the computation of the f(i). Note that we have already solved the problem, so if you’re happy with this solution, you may skip them.

Sqrt decomposition 2 (Arjun)

An alternative sqrt decomposition solution exploits a particular structure of the j s involved in the formula f(i) = 1 + \min_j f(j). We will call such j s “candidates” for i, i.e., a candidate for i is an index j < i such that L_j < L_i and there is no k such that j < k < i and L_j < L_k < L_i.

The first part is to compute the array of indices of the sorted version of L, i.e. x = [x_1,\ldots,x_n] is a permutation of [1,\ldots,n] such that

L_{x_1} < L_{x_2} < \ldots < L_{x_n}

Next, we perform sqrt decomposition on x. Let’s say we split them into blocks of size u each. Thus, the first block contains the indices of the u smallest values of L, the next one contains the next u values, and so on.

Now, we will define \mathit{parent}[j] as the largest i on the same block such that i < j and L_i > L_j. The \mathit{parent}[j] defines some kind of “forest”. It’s much easier to visualize this “forest” by thinking of each (i,L_i) pair as a point on the 2D plane. Then \mathit{parent}[j] is simply the rightmost point that is to the left and above (j,L_j), and the “forest” structure is easier to visualize.

This \mathit{parent} is actually related to the structure of our candidates:

Let j_1 > j_2 > j_3 > \ldots be the sequence of candidates for i. If j_c and j_{c+1} are in the same block, \mathit{parent}[j_c] = j_{c+1}.

This is just putting into words our pseudocode above for perform_pass.

Now, let’s try to compute f(i). We will want to find all candidates for i. The first part is to compute the rightmost index in every block that are to the left of i, i.e. find the largest j < i in each block. This can be done with binary search, or with a lookup table as in the previous. Thus, we get a sequence of indices [r_1, r_2, r_3, \ldots]. Because of the way x was formed, these indices satisfy L_{r_1} < L_{r_2} < L_{r_3}, \ldots. However, these points are not necessarily candidates for i. For example, r_c will not be a candidate for i if r_c < r_{c+1} and r_c < r_{c-1}. But we can easily remove those points by using a stack, similar to the Graham scan for removing “concave” parts of the convex hull:

stack = [r[1], r[2]]

for i in 3...(number of blocks):
    while stack.size() >= 2:
        size = stack.size
        r_j = stack[stack.size]     // last element
        r_k = stack[stack.size - 1] // next-to-last element
        if r_j < r[i] and r_j < r_k:
            stack.pop()

    stack.push(r[i])

// stack now contains only candidates

Thus, after doing this loop, we may assume that [r_1, r_2, \ldots] contains the first candidate for i in each block.

For each block, let r_c be the first candidate. We will need to figure out the sequence of candidates starting from r_c. These can be obtained by following the \mathit{parent} pointer until we find an index that is less than r_{c+1}, in which case we jump to the next block. Thus, for each block, we will need to know the lowest ancestor of r_c that is still greater than r_{c+1}, and then find the minimum f(j) along this path. We can do this by simply listing all ancestors of each j (of which there are at most u). Then to find the necessary ancestor, we simply perform a binary search on this list. To obtain the minimum f(j) on the paths from j to each ancestor, we simply maintain another array containing the required minimums. These values can be computed in O(u) time with a single pass.

The algorithm requires O(nu) preprocessing. Computing each f(i) and updating the structure requires O(n/u \log u + u) time. Thus, the optimal choice is \Theta(\sqrt{n \log n}) which gives a running time of O(n \sqrt{n \log n}).

As before, by replacing the binary searches with appropriate lookup tables, the running time can be reduced to O(n \sqrt{n}).

Segment trees (Surya)

We will now describe a technique which uses segment trees. This has the advantage of being much faster than previous algorithms.

We want to compute the values f[1], f[2], \ldots, f[n]. We can do so using divide and conquer as follows:

Find the f[i] for 1 \le i \le n/2 by recursively calling our procedure for the range [1,\ldots,n/2].
Next, update the $f[i]$s for i > n/2 with candidates that are \le n/2. This step will be explained below.
Finally, recursively call our procedure for the range [n/2+1, n] to update the $f[i]s with candidates > n/2$.

Now, we will need to update the values of f[n/2+1], \ldots, f[n] using only candidates from [1, \ldots, n/2]. Say we want to update f(i), where i > n/2. The key here is finding which j s in [1, \ldots, n/2] are candidates for i. This can be done with a particular clever observation. First, we define a few things:

For j \le n/2, define \mathit{succ}[j] as the smallest L_k such that L_j < L_k and j < k \le n/2. (or +\infty if no such k exists.)
For i > n/2, define \mathit{pred}[i] as the largest L_k such that L_k < L_i and n/2 < k \le i. (or -\infty if no such k exists.)

Then the observation is that:

j is a candidate for i if and only if \mathit{pred}[i] < L_j < L_i < \mathit{succ}[j].

Take a moment and try to see why this is true.

Now, armed with this, we now see that to update f[i], we only need to find the minimum f[j] among all j satisfying \mathit{pred}[i] < L_j < L_i and \mathit{succ}[j] > L_i. By plotting the points (\mathit{succ}[j], L_j), we now see that the points we need are those contained in the rectangle [i,+\infty]\times[\mathit{pred}[i],i], which reduces our problem to standard 2D range min queries!

We can reduce this to 1D range min queries by noticing that we only need to perform offline 2D range min queries. Multidimensional range queries without updates usually can be reduced one dimension by answering the queries in sorted order according to one dimension, and performing updates in that direction. In our case, we will process queries and updates in decreasing order of \mathit{succ}[j] and L_i, respectively. An update is simply an insertion of a new point. Thus, by doing this, we only need to process 1D range min queries, which can be done with normal segment trees!

The following pseudocode illustrates it:

Let x be [1..n/2] sorted according to succ[j]

Let st be an empty segment tree
for i in [n/2+1..n] in decreasing order of L[i]:
    // perform updates
    while x is not empty and succ[x.last()] > L[i]:
        st.set(L[x.last()], f[x.last()])
        x.pop_last()

    // perform queries
    f[i] = min(f[i], 1 + st.min(pred[i], L[i]))

Note that the keys for st can range from 1 to n, so you can’t initialize a new big segment tree every time. There are two ways around this:

Use implicit segment trees so initializing a new segment tree takes O(1) time.
Reuse the same segment tree, and use lazy propagation to reinitialize the whole tree.

The running time of this step is O(n \log n). Therefore, our whole divide and conquer solution runs in O(n \log^2 n) by the Master theorem. This also has the advantage that memory usage is low: only O(n).

Persistent segment trees (Kevin)

Finally, we describe another fast solution, which has an advantage over the previous one in that it is online. (So in a way it’s kinda like a best-of-both-worlds approach: It is online and fast.) The idea is similar to the original sqrt decomposition method, but instead of O(\sqrt{n}) blocks, we will need O(\log n) blocks with different sizes. Also, these “blocks” change (in a specific way) as we compute more values f(i).

We will preprocess each block so that M(j, a, b) the perform_pass routine can be performed quickly. However, we must modify the j argument slightly because our block is not of size u any more, so we instead say M(\mathit{start}, \mathit{end}, a, b), and we define it as the following:

def M(start, end, a, b):
    m = +infinity
    L_max = L[b]
    for j in end..start by -1:
        if L[b] <= L[j] <= L[a]:
            L_max = L[j]
            m = min(m, f[j])

    return m

This time, the results of the loop cannot all be stored. However, we can optimize it to run in O(\log n) with the use of segment trees.

First, assume that we can process all calls to M in increasing order of L[a]. This is what we did before when we were precomputing the values of M(j, a, b), except that we don’t assume b is fixed, i.e. we want to be able to process M calls with arbitrary b arguments. Thus, our “stack” approach to compute M(j, a, b) before will not work here.

The key here is to notice that M(\mathit{start}, \mathit{end}, a, b) is just the minimum of all f[j] for all j that satisfies the following conditions:

\mathit{start} \le j \le b
L[j] \le L[a]
There is no k such that j < k \le \mathit{end} and L[j] < L[k] < L[a]

This is true even if L[b] > L[a], in which case no candidate j s exist and the “minimum” f[j] is simply +infinity.

So for a fixed a, we want to be able to answer such queries quickly. To so so, we can maintain an array T[\mathit{start}\ldots \mathit{end}], where T[j] is defined as

T[j] = \begin{cases} f[j] & \text{if $L[j] \le L[a]$ and there is no $k$ such that $j < k < \mathit{end}$ and $L[j] < L[k] < L[a]$} \\\ +\infty & \text{otherwise} \end{cases}

Thus, M(\mathit{start}, \mathit{end}, a, b) is simply \displaystyle\min_{\mathit{start} \le j \le b} T[j], which can be answered with range minimum queries! Thus, a segment tree will help us.

All that remains is to maintain the array T as we update the variable a. Some $T[j]s will be replaced with +\infty$, and some will be replaced with f[j]. But remember that we’re processing them in increasing order of L[a], which means that when a is updated:

All T[j] will become +\infty for all j < a
T[a] becomes f[a]

But these are just range updates, which a segment tree also supports! Thus, we now have a way of computing M(\mathit{start}, \mathit{end}, a, b) for all arguments (a, b) in increasing order of L[a].

However, there is a problem: We can’t really process the queries in increasing order of a s. We need to process each query immediately. Luckily, we can accommodate for this by making our segment trees persistent, i.e. we preprocess the block and we should have a copy of the segment tree for every a. (Tree structures like segment trees can easily be converted into persistent ones with by thinking functionally: simply copy the nodes of the update path to “create” a new tree after each update.) This solves our issue.

The running time of processing a block is O(n \log n), and afterwards, each query runs in O(\log n).

All that remains is to describe how to maintain only O(\log n) blocks. The idea is that when computing f(i), the prefix L[1\ldots i-1] is decomposed into blocks whose sizes are powers of two. Additionally, the sizes of the trees must be decreasing. It can be seen that there is exactly one way to decompose an array (of any size) into blocks satisfying these conditions. This decomposition is related to the binary representation of i-1: There is a block for every on bit of i-1. So there are only O(\log n) blocks because i-1 has at most \log_2 n + 1 bits. Each block contains the segment tree structure as mentioned above.

Then, after computing f(i), we need to update these blocks. We will have to merge some of the blocks to maintain the property above. To merge two blocks, simply throw the segment trees array and build a new one from scratch. This takes O(k \log k) time, where k is the size of the block, so this seems to suggest that the running time is O(n^2 \log n). But notice that we don’t really merge large blocks that often. For example, if i-1 is odd, then no merging occurs at all. And if i-1 is even but is not a multiple of two, then only one merging occurs, and this only involves two blocks of size 1. In fact, it can be shown that total running time of all merging processes is O(n \log^2 n).

The running time of computing f(i) is O(\log^2 n), because there are O(\log n) blocks, each taking O(\log n) time to process. Thus, the overall running time is O(n \log^2 n).

Time Complexity:

O(n \sqrt{n} or O(n \log^2 n)

AUTHOR’S AND TESTER’S SOLUTIONS:

[setter][333]
[tester][444]
[editorialist][555]

[333]: The link is provided by admins after the contest ends and the solutions are uploaded on the CodeChef Server.
[444]: The link is provided by admins after the contest ends and the solutions are uploaded on the CodeChef Server.
[555]: The link is provided by admins after the contest ends and the solutions are uploaded on the CodeChef Server.