Problem Link:
Author: Misha Chorniy
Tester: Pushkar Mishra
Editorialist: Praveen Dhinwa
Difficulty:
medium hard
Pre-Requisites:
sqrt decomposition
Problem Statement
An array is called dominated if there exists an element in it whose frequency in the array is strictly greater than the half the size of array.
You will be given an array A of size n. You have to find out number of subarrays of A which are dominated.
Explanation
There is a unique dominating element of an array if there is any
An array is called dominated if there exists an element in it whose frequency in the array is strictly greater than the half the size of array. Note that by half of size the array, we mean the \lfloor \frac{\text{size of array}}{2} \rfloor. According to this definition, array [1, 1, 2, 3] won’t be called dominated as maximum frequency of an element in it is 2 which is equal to \lfloor \frac{4}{2} \rfloor.
Also note that this definition makes sure that there will be a unique dominating element of an array. We will call that element a dominator of the array. So, instead of counting number of dominating subarrays, we can as well find count the number of subarrays dominated by some dominator element.
Sqrt decomposition based solution
Let us first try to find out the subarrays dominated by A_i.
Let j \leq i be the smallest index such that A_i dominates the subarray [A_j, A_i].
Similarly, let k \geq i be the largest index such that A_i dominates the subarray [A_j, A_k].
Let freq(A_i) denote the frequency of A_i in the array A. Note that A_i can’t dominate a subarray of size > 2 * freq[A_i].
So, we notice that subarrays [j, i] and [i, k] will have their sizes at max 2 * freq[A_i].
Intuitively, we can understand that the number of subarrays dominated by an element will depends on its frequency, also the block of array surrounding it ([j, k] as described above) will have its size proportional to its frequency.
This suggest us to think of solutions based on frequency of an element. We will employ a sqrt decomposition solution which will deal with low and high frequency elements differently. For categorizing an element A_i into high or low frequency category, we have to define a boundary M (whose value will be decided in the end, consider it variable for now), such that elements with frequency < M are called low frequency elements, others are called high frequency elements.
Dealing with low frequency elements
We iterate over i from 1 to n and want to find number of j's such that j \geq i and subarray [A_i, A_j] is dominated by some low frequency element.
Notice that we don’t need to extend j all the way till N. In fact, we can note that i.e. j will be at most i + 2 * M. This is due to the fact that frequency of a low frequency element can be at max M, so the maximum size of array which can be dominated by this element will be at max 2 * M.
So, we iterate over j from i to at max i + M and find the maximum frequency element in the subarray [A_i, A_j] and check whether it is a low frequency element or not. If yes, then this subarray is dominated by some low frequency element, so we add it our answer.
for i = 1 to N
map<int, int> countMap; // key is an element and value is the count of the element
int which = 0
int maxCount = 0
for j = 1 to min(N, i + 2 * M)
countMap.add(A[j])
// Update if A[j] has become the maximum frequency element.
if (countMap.get(A[j]) > maxCount)
maxCount = countMap.get(A[j])
which = A[j];
// Check whether which is a low frequency element
if (freq[which] <= M)
ans++
Overall, this process will take O(N M log M) where log M factor comes from using maps for maintaining frequencies of the elements.
Dealing with high frequency elements
Now, we will see how to find out number of subarrays dominated by some high frequency element. So, we will take up each high frequency element in the array one by one and will find the number of subarrays dominated by it. Let x be such an element.
Let us take an example array [1 2 3 1 4 1 1 1 5 1]. Let us say that we want to compute the dominating subarrays that have 1 as the dominating number.
So we can ignore the other numbers and view our array as [1 X X 1 X 1 1 1 X 1] where X is some element other than 1.
How to check whether the subarray [A_j, A_i] is dominated by x.
Let us write this condition this mathematically. Let S be another array such that S[i] denotes the frequency of x in the subarray A_1 to A_i. In competitive programming lingo, such an array S is called prefix sum array of an array. So, frequency of x in subarray [A_j, A_i] will be S[i] - S[j - 1].
The subarray [A_j, A_i] would be dominated by x if
S_i - S_{j - 1} > \frac{i - \, (j - 1)}{2}
\implies 2 * (S_i - S_{j - 1}) > i - (j - 1)
\implies 2 * S_{j - 1} - (j - 1) < 2 * S_i - i.
Convert to some already known problem
Let us create an array B such that B_i = 2 * S_i - i. Now, the problem of finding number of subarrays dominated in array A by element x will be same as finding number of pair of indices (i, j), i \leq j in array B such that B_i < B_j.
What if we can find the number of inversion pairs in the array B? A pair of indices (i, j) is called inversion pair if i < j and B[i] > B[j]. We can notice that answer of our subproblem is \frac{n \, (n - 1)}{2} \, - \, \text{number of inversion pairs in } B.
We can solve the above inversion pair problem in O(N \, log N) where N denotes the number of elements in the array B. Note that the time complexity does not depend on range of elements of the array. You can submit your solutions for this problem at http://www.spoj.com/problems/INVCNT/. There are many possible approaches for this problem. The ones that I know are using merge sort, binary indexed tree (BIT) and segment tree. Please use the following resources to learn about these techniques.
- http://www.geeksforgeeks.org/counting-inversions/
- http://pavelsimo.blogspot.in/2012/09/counting-inversions-in-array-using-BIT.html
- http://www.geeksforgeeks.org/count-inversions-array-set-3-using-bit/
Binary indexed tree based approach
Now, I will mention the approach by binary indexed tree for finding number of subarrays dominated by high frequency elements.
So, we iterate over array A from left to right. We need to maintain some information so as to find out number of j’s that
2 * S_j - j < 2 * S_i - i. We maintain an array cnt where cnt[x] denotes the number of j’s such that 2 * S_j - j = x. As S_j can take value at max N, so the size of the cnt array will be at max 3 * N. Now, for finding number of j's for an i$i$, such that 2 * S_j - j < 2 * S_i - i, we can find the sum of elements in the range [-N, 2 * S_i - i].
So, we need a data structure in which we can support the following operations
- Increment cnt[x] by 1.
- Find the sum of elements from cnt[1] to cnt[L] for a given L.
These operations can be efficiently performed using binary indexed tree (BIT), or by using segment tree.
long long solve(int x) {
long long ans = 0
for i = 1..n
S[i] = S[i - 1] + (a[i] == x) ? 1 : 0;
add(2 * S[i] - i, +1)
ans += query(2 * S[i] - i);
return ans;
}
So, for an high frequeny element x, we can find the number of subarrays dominated by it in O(N \, log N) time. Note that number of high frequency elements are bounded by \frac{N}{M}, So, time complexity will be O(N \, log N \cdot frac{N}{M}).
Choosing appropriate value of M
By combining the above two approaches for dealing with low and high frequency elements separately, we get the total time as O(N\, M\, log N + N \, log N \cdot \frac{N}{M}). As M \leq N, log M \leq log N. So, we can as well write the expression as O(N \, (M + \frac{N}{M}) \, log N. Now we will choose M in such a way overall value of expression is minimized. As factor N \, log N is fixed for a given N, we need to minimize (M + \frac{N}{M}) which takes it minimum value at M = sqrt(N).
So, overall time complexity is O(N \, sqrt N \, log N). Did you get the reason why did we call this method sqrt decomposition method
Subtask 4
Now, this solution takes the approach of dealing with high frequency elements in the previous approach and smartly generalize this approach for all the elements in total O(N log N) time. So, similar to that approach, we will process each unique element x one by one.
Let (i_1, i_2, i_3, .., i_k) denote the positions in array A (in increasing order) where number x occurs. , i.e. A[i_1] = A[i_2] = \dots = A[i_k] = x.
Now, let us see two neighbouring occurrences of x, [i, j], so A[i] = A[j] = x.
So we do the following kind of operations at each step while moving from position i to j
add(2 * S_i - i, +1)
query(2 * S_i - i)
add(2 * S_{i + 1} - (i + 1), +1)
query(2 * S_{i + 1} - (i + 1)
add(2 * S_{i + 2} - (i + 2), +1)
query(2 * S_{i + 2} - (i + 2))
…
add(2 * S_{j - 1} - (j - 1), +1)
query(2 * S_{j - 1} - (j - 1))
add(2 * S_{j} - j, +1)
query(2 * S_{j} - j)
Now, we can re-write this as below by using the observation that value of S_k for k from i to j - 1 will be equal to S_i whereas the value of S_j will be S_i + 1.
add(2 * S_i - i, +1)
query(2 * S_i - i)
add(2 * S_i - (i + 1), +1)
query(2 * S_i - (i + 1))
add(2 * S_i - (i + 2), +1)
query(2 * S_i - (i + 2))
…
add(2 * S_i - (j - 1), +1)
query(2 * S_i - (j - 1))
add(2 * (S_i + 1) - j, +1)
query(2 * (S_i + 1) - j)
So, this means that from index i to j - 1, the corresponding index in cnt array gets decreasing by 1. So, this will form a continuous range. Similarly, query operation will also be forming a decreasing range.
Note that add operation done corresponding on i-th element, does not affect the next query operation till the j^{th} element. It is due to the fact the query index is strictly decreasing similar to the add operation. Please convince yourself with by working with some example. This is very crucial point to understand.
So, we can simulate the effect of all these operations in a single step by simply finding the sum of elements of cnt array from [2 * S_{j - 1} - (j - 1), \, 2 * S_i - i] and adding it to our answer and then increasing the cnt array range [2 * S_{j - 1} - (j - 1), \, 2 * S_i - i] by 1.
These range update and query operations can be done by maintaining a segment tree or a binary indexed tree…
Finally, at the end of processing the number x, we have to remove/undo the changes/update that we did during the processing so that the upcoming new element can be processed correctly.
For that, we have to somehow undo the changes that we did to cnt array during the processing of number x. We will store all the ranges at which did the update operations during the processing of x. We will go again over those ranges and undo their effect by making updates of adding -1 as opposed to addition of 1 earlier.
So, processing a unique number x with frequency freq_x takes O(freq_x \, log N) time. This enables us to process all the elements of array A in O(N \, log N) time.
Solution:
Setter’s solution can be found here
Tester’s solution can be found here
Please feel free to post comments if anything is not clear.