KGOOD - Editorial

PROBLEM LINK:

Contest
Practice

Author: Kevin Atienza
Testers: Sergey Kulik and Vasya Antoniuk
Translators: Sergey Kulik (Russian), Team VNOI (Vietnamese) and Hu Zecong (Mandarin)
Editorialist: Kevin Atienza

DIFFICULTY:

Easy

PREREQUISITES:

Strings, Greedy algorithms

PROBLEM:

A string is K-good if for every two letters in the string, if the first appears x times and the second y times, then |x - y| \le K. Given K and a string w, what is the minimum number of letters to remove to make w K-good?

QUICK EXPLANATION:

For every m, define F[m] as follows:

  • Remove every letter that appears < m times.
  • For every letter that appears > m+K times, remove just enough letters so that it appears exactly m+K times.
  • Let the number of removed letters be F[m].

The answer is then the smallest among \{F[1], F[2], \ldots, F[N]\}. To compute this quickly, simply precompute the letter frequencies in w beforehand, and operate on that array when computing F[m].

EXPLANATION:

Letā€™s use the notation w_{(c)} to denote the number of appearances of the letter c in the string w.

K-goodness

By definition, a string is K-good if for every two characters c and d, \left|w_{(c)} - w_{(d)}\right| \le K. But for a given string w, the value of \left|w_{(c)} - w_{(d)}\right| is maximized if c and d are the most and least occurring characters in w (or vice versa). Thus, we have a potentially simpler criterion for telling whether a string is K-good:

If c and d are the most- and least- occurring characters in w, then w is K-good if and only if w_{(c)} - w_{(d)} \le K.

Notice that we have removed the absolute value because w_{(c)} \ge w_{(d)}.

Removing letters

This immediately gives us a greedy algorithm for turning w into a K-good string in a few letter removals:

  • Compute the frequency counts for each of the 26 letters.
  • Find the least-occurring character d. (Note that we only consider characters that actually appear in w, that is, whose frequency counts are positive.)
  • For every character c appearing more than w_{(d)} + K times, remove w_{(c)} - (w_{(d)} + K) instances of c.

Clearly this algorithm yields a K-good string if you use our criterion above. Unfortunately though, it doesnā€™t yield the minimum number of removals! For example, consider the input boooooooooo 0. The algorithm above will have us remove 9 instances of the letter ā€˜oā€™, yielding the string ā€œboā€ which is 0-good. But itā€™s possible to do it with just 1 removal: just remove the ā€˜bā€™!

So how do we fix this? Clearly, we canā€™t assume that the least-occurring character in w will appear in the final string. In fact, it isnā€™t obvious what the minimum frequency count must be in the final string. So what do we do? We just try them all!

Our adjusted algorithm would be:

  • Decide on the minimum frequency count on the final string, and call it m.
  • Remove every letter that appears less than m times.
  • For every letter c that appears more than m + K times, remove w_{(c)} - (m + K) instances of c.

The answer is then the minimum number of removals for all such values m in the range m \in [1,N]. Now, clearly this is correct, but is it fast enough? Well, it depends on your implementation. Obviously, if you process the string for every m, youā€™ll get an O(N^2) running time which is pretty bad. But thankfully, we can reduce it by noticing that every iteration of the above procedure only requires the frequency counts of all the letters! By computing the frequency counts just once and then using that array instead, we get a running time of O(N)!

Hereā€™s a pseudocode:

// compute counts
counts['a'..'z'] // initialized to zero
for i=0..|w|-1:
    counts[w[i]]++

// compute answer
answer = N
for m=1..N:
    current = 0
    for c='a'..'z':
        if counts[c] < m:
            current += counts[c]
        else if counts[c] > m + K:
            current += counts[c] - (m + K)

    if answer > current:
        answer = current

print answer

Optimization

We can actually still optimize the algorithm above, by noticing that we donā€™t have to try all values m in the range [1,N]. Instead, we only try the m values that are equal to some w_{(c)}!

Why is that? Suppose we chose a value m and itā€™s not equal to any w_{(c)}. Then if we instead choose m+1, then there arenā€™t any additional characters whose frequency counts are < m+1, so we donā€™t have to remove any additional characters. However, we potentially save a few removals, particularly those whose frequency counts are > m+K! Thus, the result of the procedure above for m cannot be strictly less than its result for m+1, so it follows we donā€™t have to check m if itā€™s not equal to any w_{(c)}!

With this optimization, the number of times we perform the procedure reduces from N to (at most) 26! To formalize the optimization, notice that our original O(N) algorithm actually runs in O(N\alpha) where \alpha is the size of the alphabet. This optimization reduces that to O(N + \alpha^2), which is usually better because N \gg \alpha usually!

Hereā€™s a pseudocode:

// compute counts
counts['a'..'z'] // initialized to zero
for i=0..|w|-1:
    counts[w[i]]++

// compute answer
answer = N
for d='a'..'z':
    m = counts[d] // consider only the relevant m's
    current = 0
    for c='a'..'z':
        if counts[c] < m:
            current += counts[c]
        else if counts[c] > m + K:
            current += counts[c] - (m + K)

    if answer > current:
        answer = current

print answer

Time Complexity:

O(N)

AUTHORā€™S AND TESTERā€™S SOLUTIONS:

Setter
Tester

1 Like

I did a simple dp approach in this problem. I sorted the letter frequencies from least to highest.

Let A be the string and C be the sorted array of letter frequencies.
Then I used recurrence:

f(L, R, K) = minimum number of removals to make A[L:R] K-good
           = min(C[L]+f(L+1,R,K),C[R]-C[L]-K+f(L,R-1,K))  if C[R] - C[L] > K
           = 0 otherwise

Complexity of solving the DP is O(1) but recording the letter frequency is O(n).

Great editorial! :smiley:

3 Likes

I tried solving it using both approaches during contest but couldnā€™t get AC due to bugs. Btw, this doesnā€™t look like O(1)

I think itā€™s O(1) because there are 26*26 subproblems, which is a small constant. But you can argue that it is not O(1) because the computation still depends on the string and its letter frequencies.

How will we check a word of length 1 to be K-good ?

For example : if w=ā€œpā€ and K=5,
then is it 5-good or not ?

1 Like

If the word is of length one, it is K-good for any K because no pair of distinct letter exist to contradict the condition.

1 Like

1

abcdddddeeeeefffffggggghhhhh 3

the answer should be 3.when we remove all a,b,c

but some solutions that are accepted gave answer 5.

I have doubt in quick explanation section :
suppose input is aaaaea 0 then b[0]=5, b[1]=1
so we need to calculate F[0],F[1]ā€¦and min will our ans
calculating F[1] means we are taking b[1] i.e ā€˜eā€™ as reference element from which alphabets will not be removed.
My question is here :
Remove every letter that appears <m times.
Why not remove something less than m-k timesā€¦ here with m i mean b[0]ā€¦
since we are using modulus: For every letter that appears >m+K times, remove just enough letters so that it appears exactly m+Ktimes

@mightymercado

Can you please explain logic and working of your solution.

@mightymercado
I am beginner, your solution seems to be very nice can you please tell me how have you prepared for dp. I have seen some basic problems like 01-knapsack, LIS, coin change but I am not able to recognize dp when it comes to different problems(as in programming contests). Please help me out.

if input string is aaabbbbbccccccccc 3ā€¦i.e 3aā€™s, 6 bā€™s, 9 cā€™s
current for a should be 3
current for b should be 0
// though in the last we need to filter out b
current for c should be 3

But according to the algorithm provided here
current of a 3
current of b 3
current of c 9
I think the sole purpose for calculating current here is we are taking that element as reference that means nothing will be removed from that to make the string k goodā€¦
Can author please clarify my doubtā€¦Thanks in advance

@arpit728

Let a be the sorted array of letter frequencies and itā€™s ends are st(start) and ed(end).
Itā€™s best to remove letter with highest frequency or letter with lowest frequency. If we are removing letter with lowest frequency then we will need to remove them as a whole and if we are removing letter with highest frequency then we need to remove a[ed]-(a[st]+k) letters to make it k good. So our recurrence will look like this

calcmin(st,ed) = 0 if st>=ed or a[ed]-a[st]<=k

                = min(calcmin(st+1,ed)+a[st],calcmin(st,ed-1)+a[ed]-(a[st]+k));

Submission

You can read about this approach, here,

Itā€™s related to this problem.

thanks for explaining it

@kevinsogo

If m is the minimum count in final string then why wc-(m+k) letterh should be removed. But to make any character appear m times wc-m characters should be removed.

@kevinsogo

please explain why we should take m that is equal to some count. I read it in optimization section but didnā€™t get it.

@mightymercado Thanks for explaining it.

#include<stdio.h>
#include<string.h>
#include<ctype.h>
int main(void) {
int t;
scanf("%d",&t);
while(tā€“){
char str[100001];
int a[26]={0},k;
scanf("%s",&str);
scanf("%d",&k);
int i;
size_t len = strlen(str);
int min=len,max,count=0;
for (i = 0; i < len; i++){
char c = str[i];
a[(int)(c-ā€˜aā€™)]++;
}
for(i=0;i<26;i++){
if(a[i]==0)
continue;
else
if(min>a[i])
min=a[i];
}
//printf("%d",min);min is correct
max=min+k;
for(i=0;i<26;i++){
if(a[i]>max)
count+=a[i]-max;
}
printf("%d\n",count);
}

return 0;

}
can someone please help me out what is wrong in it?

@mightymercado I solved this problem using somewhat same approach as yours but i am getting TLE inspite of my solution being faster. i may have left any boundary cases. Can someone please check my code. http://ideone.com/rE25jd