In the May long contest, contestants were able to make many submissions and extract data about test cases. Few users say this is against rules, others say only the smart ones were able to do it and hence it is part of fair game. I start this thread to let you all know about my discussion with codechef team and to seek your view on this.
Here is what I have to tell about Challenge problems. I had a discussion with codechef team regarding the same and we have concluded with the following. I request all of you to respond with your views on this, as this is a common problem and not related just one contest.
Why should we avoid solutions that try to get information from test cases and optimize accordingly?
Testing is never complete. Though we are testing the solutions on only 20-40 test files, our goal is to test the solutions in most robust way. Solutions should be good with given CONSTRAINTS and never depend on given TEST DATA.
For the same reasons codechef introduced the concept of Partial Score. Partial scoring is introduced to make sure that the solutions are more robust and can handle any kind of test data with given constraints. Now if the users are able to extract details about test data and make optimizations accordingly, then the actual purpose of partial scoring is useless.
That being said, I would like to say that the solutions to challenge problems should not be optimized based on the test data. I propose the following :
- We should not show the total runtime. It is better to show maximum run time of all the test cases. We have 2 advantages with this. One is that the user cannot simply force run for larger time and extract information about number of test cases. Second is many a times users asks questions like “If the time limit is 2 seconds how is a submission time shown as 15 seconds”, this is because we are showing the sum of all time taken, by taking maximum (or even average) we can remove this ambiguity as well. Note that it is still possible to extract test cases (I think) but not easy.
Issues found: None from user point of view.
This idea is not great. To reduce post size and save time of readers. I am removing it and adding at end of post.
Too many test files : This is simple method and still works. Contestants are able to extract data as the number of data sets are low. What if we follow the following method. Say we wanted to have N data sets. Let us create 5N data sets. Submitted programs will run on all 5N data sets, score for N/5 (20% of N) first data sets will be given, and AC only if it passes all 5N data sets. As generally we use N=20 data sets, 5N is 100 and it is very difficult to extract information about all those data files. Now after the contest we will choose N random files out of 5*N files, and remove other extra files, and rejudge all the submissions on those N files. What happens with this method is, the user cannot extract data for all 100 test cases it is very hard, even if they start extracting for 20-30 cases, he cannot be sure if they will be included in final tests. This method is fair because : Initial partial score is on same test files for all the users, final full score is on same test files for all users. Only thing is that the setter has to generate a lot of test cases. As most of setters use automated generators, I think number of tests will not matter much. Another issue I see is, it will take a lot of time for each submission to be tested during contest as it has to run on 100 files. Rejudge after the contest wont be a problem as we will scale down to N tests only.
Issues Found: There is an software limitation of only 64 test files. However codechef team reported that it is not advised to use more than 20 test files.
The following idea is proposed by Gedi Zheng
Separate the first 20% of test data and the full test data, and limit the frequency of full data submission.
We can have two submit button for the challenge problem. One for the first 20% of test data, which only return the score and verdict on 20% of the test data. The other for the full test data, which only return verdict during contest. And we limit the frequency of submission for full data to something like once every 4 hours, so that the maximum number of total submissions can be made on the full test during will be limited to 60. It’ll be very difficult to get much information about the full test data, while it’ll be enough for contestants to ensure that their program can get AC on full data.
Issues Found: What if the user is getting AC on 20% test data. And when he submits it on larger file he is getting WA? As we are limiting the number of submissions, this case can be really tricky.
Also what should be the limit on larger data set? Some thing like 40 will be too less? some thing like 80 will be enough to extract data?
The following ideas is proposed by mugurelionut
generate (part of) each test case randomly at each run (this has been done in the past)
make the output of each test case sufficiently large so that it cannot be found entirely with 500 submissions
Issue: What can we do with interactive problems? Also how does it stop user from extracting details about test data?
- make the judge answer interactive queries adaptively (e.g. in order to try to generate the worst case situation for each contestant on the fly)
I agree with what Gedi Zheng suggested. That seems like a good solution. Please leave your comments on this.
- Always AC : Let us say the problem is to minimize score. Select some value MAX, which is large enough score for a test case. It has to be higher than the score generated by very simple brute force submission. Now if a test file generates any verdict other than AC, we will add MAX as score of that file instead of showing the verdict. This way always the verdict is AC only, and at the same time the wrong submissions will not have any good score. There are a lot of things we need to discuss about. Note that the user will never know if he is solving all the test data correctly, his submission may give WA to some test cases. As we will display score of only 20% test data (If we are using this method, we should display score for 40% test data, it is better). So for the rest of test data, even if the program is wrong and gives WA or RTE, he will never know. It is still okay. The tops should be able to write good programs which will not fail. Even if their program is failing on one or 2 cases, it is okay as we will add MAX as score and we will take average later. Good thing about this method is that the user cannot extract any details about hidden test data. One last doubt may be that, say a user did very well and managed to get low score on all test data, but getting WA on one test data. Even then he managed to get global best average score. In this situation the winner of challenge problem actually has a bug and gets WA on one test case. If we are okay to accept this as this method will make the programs more robust, we can think of implementing this method. Also because we want the solutions to follow the constraints, and we are not allowing them to extract any information about test data, we should provide with clear test generation plans.
Issues Found: Turned out to be a bad idea. The user will never know if his solution is correct or wrong and very hard to solve it.
And also what if the best solution is failing on one test case. Can we accept it as the best solution?