Problem description:
Given an integer K and integers A_i,\ldots, A_N, and primes P_1,\ldots, P_M, find
D_1,\ldots,D_N satisfying 0\le D_i\le K, such that \frac{1}{M}\sum_{j=1}^M B_j is as large as possible, where
B_j=\prod_{i=1}^N (A_i+D_i) \mod P_j.
We are given four different parameter sets. In each case P_1,\ldots, P_M are consecutive primes with P_1 between 1.1\times10^9 and 1.5\times10^9, and 1\le A_i\le 10^9 are independently random.
The four parameter types are (1) N=100, M=10000, 1\le K\le1000000, (2) N=1000, M=1000, 1\le K\le1000000, (3) N=1000, M=1000, 1\le K\le1000, (4) N=10000, M=100, 1\le K\le1000000.
First attempt
In general this problem is a bit awkward because we want to fix an additive property of the
$B_j$s (like being in the range 0.9P_j to P_j-1), but if we vary one of the D_i, it
changes all of the B_j multiplicatively, which more-or-less randomises B_j
additively. Changing a single D_i makes a huge change to the product of
(A_1+D_1)\ldots(A_N+D_N), and reducing a huge number (approximately 31N bits) modulo
P_j (approx 31 bits) is a “chaotic” operation.
So I believe that to first approximation this “machine” that we are given that generates
B_j behaves as a random number generator and we are being asked to choose a seed (the
initial A_i+D_i) to maximise the sum of uniform random numbers. This means that “hill
climbing” from one solution to another by small changes in the D_i won’t be any better
than starting afresh with a new configuration of $D_i$s.
Looking at it like this allows you to write a simple program that would end up in about 5th position in the combined final Div1+Div2 ranklist. Example program here.
This first attempt just generates instances of B_1,\ldots,B_M as fast as possible, and takes the best score it finds. You can do this moderately fast by fixing D_2,\ldots,D_N and making k=D_1 range over 0,\ldots,K. That way B_j is of the form AA_{0,j}+k.AA_{1,j} where AA_{i,j}=\prod_{i'=i+1}^N (A_{i'}+D_{i'}) \mod P_j (and D_1=0 in this product). If we invert the order of the inner two loops (over 1\le j\le M and 0\le k\le K) then we don’t need to do a multiplication or full modulo reduction. The inner loops look like (replacing K with K+1):
val=[0]*K
for j in range(M):
p=P[j]
b=AA[0][j];c=AA[1][j]
for k in range(K):
val[k]+=b
b+=c
if b>=p: b-=p
This inner loop goes at about 160\times10^6 loop iterations per second under pypy, and about 620\times10^6 iterations per second under C++ (measured on my desktop, which seems to be similar to CodeChef’s servers).
The factor of 5 that CodeChef allows Python over C++ seems to be slightly favour pypy here since the measured speed ratio is about 4, but this pypy speed is a bit fragile and depends on using integers in a simple way: the C++ version admits optimisations that wouldn’t work well under pypy, so I think the factor of 5 is reasonably fair here. In particular, oleg_b’s C++ program cleverly reorders the modular reduction in a way that avoids unpredictable branches and hops through the k loop in steps of about 4. This manages a hefty 2350\times10^6 loop iterations per second.
Anyway I prefer writing in Python, so I’ll stick with that here.
Analysis
Let’s rescale and pretend that we are adding up M independent uniform real random numbers in the range -1 to 1. The normal approximation works pretty well here, even at the tail. The variance of U[-1,1] is 1/3 and in the allowed time (20s for Python) we can generate about 166\times10^6\times20/M instances, so to work out the largest sum we might manage we take the M/(20.166\times10^6) point of the of tail of the normal distribution:
from scipy.stats import norm
for M in [10000,1000,100]: print M,-norm.ppf(M/(166e6*20))*(M/3.)**.5
which gives (unnormalised) scores of about 261, 91, and 31 for M equal to 10000, 1000, 100. The rules of the contest meant that
we had to divide these sums by M to get the actual scores (rather than \sqrt{M} which would equalise the variance),
which means that type 4 (M=100) is much more important to optimise than type 1 (M=10000), because the possible relative changes (after dividing by M) are about 10 times greater for type 4.
Notice that 31 in the above scale has the interpretation as the number of “top” values of B_j. Scoring a sum of 31 is equivalent (on average) to fixing 31 of the $B_j$s equal to their maximum value of P_j-1 and leaving the other $B_j$s as random.
It is relevant that this number, 31, is quite big. If instead of this random brute force we try to do something clever that fixes some of the B_j, then it’s no use just fixing, say, 15 of them if that’s all we do, because that will only score 15 on this scale and so will lose out to the sampling as above. We need to be able to fix some of the B_j in such a way that it is fast to try lots of random variations, and that is harder to do (but possible - see below).
Second attempt
If you imagine doing the inner loops in the other order, with the loop over primes in the inside, then it’s clear the process is wasteful. After you add up some of the $B_j$s it might that the total isn’t very high, and so it’s unlikely that the total for all of the M $B_j$s is to be exceptionally high, and we only care about exceptionally high totals. In such a situation it would pay to abandon the current instance midway and start a new one.
So the inner two loops should look something like this
for k in range(K):
v=0
for j in range(M):
if v < thr[j]: break
p=P[j];b=AA[0][j];c=AA[1][j]
v+=(b+k*c)%p
where thr[j] are some suitably-chosen thresholds. Unfortunately we more-or-less have to
write the loops this way around (K then M) to use this method and can’t take that much
advantage of the fact that k increments by 1, which means we take an efficiency hit
before we can realise the gains. But the gains are quite substantial, so we still win
out. There is a question of how to measure the effectiveness this method. It’s no longer
possible just to count inner loop iterations per second, because they are not all
equal. (Reaching large j is better than small j.) It’s also too random to measure how
well it does in terms of its final answer - that way it would take hours to collect
reliable statistics. One solution is to decide on a benchmark score, some quite rare
level, but not too rare, and count how many times this is achieved. For the discussion
here we can take the benchmark level to be
B=\frac{1}{2}MP(1+\left(\frac{2\log(10^6/M)}{3M}\right)^\frac{1}{2}), derived from the normal approximation mentioned above, where P
is a typical prime in the sequence P_1,\ldots,P_M (they are all very close).
The optimal threshold curve, thr[j], can be worked out from a standard Gittins Index calculation, but the formula
thr[j]=\frac{1}{2}(j-1-65/K)P+(1-(1-j/M)^\frac{1}{2})(B-\frac{1}{2}MP) seems to approximate this reasonably well.
For type 4, the first attempt above achieves about 13 benchmarks per second (bm/s), and this second attempt with the above threshold curve achieves about 28 bm/s.
Here is an example implementation.
Continued fraction method
The previous methods effectively improve the constant factors: they still require \Theta(M) time to evaluate a configuration of $(A_i+D_i)$s
because on average the product is evaluated modulo some definite proportion of the M primes, say M/10 primes on average.
It’s possible to reduce this to (amortised time) O(\sqrt{M(1+M/K)}). This sounds good, though actually it won’t turn out to be enormously useful here because the parameters for this challenge are not so favourable for this method.
Go back to the loop order from the first method (K-loop on the inside). We’d like to
calculate the KM values B_j+kC_j \mod P_j for 1\le j\le M and 0\le k\lt K as fast as
possible.
For each j, pick the best rational approximation to C_j/P_j, say c_j/p_j, with denominator p_j at most some fixed constant m
to be optimised. c_j, p_j can be calculated effectively using a continued fraction. Then B_j+kC_j\mod P_j
is nearly periodic in k with period p_j. More exactly, if k_0+k_1.p_j (0\le k_0< p_j), then
B_j+kC_j = B_j+(k_0+k_1p_j)C_j = B_j+k_0C_j+k_1(p_jC_j-P_jc_j) = B_j+k_0C_j+k_1\Delta_j \mod P_j,
where \Delta_j=p_jC_j-P_jc_j.
For each j, we would like to store some information that allows us to calculate
B_j+k_0C_j+k_1\Delta_j \mod P_j for all k< K, but do so in a way that only takes
O(p_j+K/p_j) time, rather than O(K) time, and also do so in a way that can be summed
over all j. This is possible because \Delta_j/P_j is usually small, so the
k_0 and k_1 pieces of B_j+k_0C_j+k_1\Delta_j \mod P_j, which interact via the \mod P_j, don’t
interact too much.
Fixing k_0 for the moment, the effect of the \mod P_j term is to subtract off
multiples of P_j at intervals of k_1, and these intervals be large if \Delta_j is
small, which it usually is, so we can use a sparse array. Instead of looping over k_1,
we record the $k_1$s which will require an extra multiple of P_j to be subtracted. When
we process this in the next phase, we will have to take cumulative sums to recover the
relevant multiple of P_j. In symbols,
(B_j+k_0C_j+k_1\Delta_j)\%P_j = (B_j+k_0C_j)\%P_j+k_1\Delta_j - P_j\sum_{n\ge1}I(k_1\ge (nP_j-(B_j+k_0C_j)\%P_j)/\Delta_j),
so we subtract P_j from an adjustment array in positions (nP_j-(B_j+k_0C_j)\%P_j)/\Delta_j for n=1,2,\ldots, until
this position reaches K/p_j. This first phase code could look something like this (a slight variation of the above):
D=p*C-P*c
for k0 in range(p):
vl[p][k0]+=B
dl[p][k0]+=D
if D>0:
x=P-1-B
while 1:
q=(x//D)*p+k0
if q>=K: break
adj[p][q]-=P
x+=P
elif D<0:
x=B
while 1:
q=(x//-D)*p+k0
if q>=K: break
adj[p][q]+=P
x+=P
B+=C
if B>=P: B-=P
When this has been done for all 1\le j\le M, we can recover the sum we want, val[k]=\sum_{j=1}^M (B_j+kC_j)%P_j by
adding over the k_0-based arrays and accumulating over the adjustment arrays, like this:
for p in range(1,maxp):
for k0 in range(p):
v=vl[p][k0];D=dl[p][k0]
for k in range(k0,K,p):
val[k]+=v
v+=D+adj[p][k]
There are about K/(P_jp_j/|\Delta_j|)+1 iterations in the inner loop of the first phase,
so about K|\Delta_j|/P_j+p_j steps altogether. Taking C_j/P_j to be a random number
in [0,1), the average value of |\Delta_j|/P_j=p_j|C_j/P_j-c_j/p_j| will be around
0.42/m and the average value of p_j around 0.47m, though the actual time constants
depend on the implementation. So the first phase might take arout M(K/m+m) or so steps.
The second phase takes K m steps, so the total time is around M(K/m+m)+Km=MK/m+(M+K)m.
The optimal value of m is \sqrt{MK/(M+K)} and overall it takes about \sqrt{MK(M+K)}
steps, which is a big saving over MK/10 (something like “second attempt” running time)
if M and K are both large.
This is an example implementation.
The main benefit of this method is for type 1 (M=10000) instances, but slightly
unfortunately for this method, the scoring system in this contest weights \sum_j B_j by
1/M (as opposed to 1/\sqrt{M} which would equalise the variances), so there is less
possible variation here. Also, this method doesn’t work well in conjunction with the
“second attempt” threshold-based speedups. Still, there is a large speedup benefit, even
if this doesn’t translate to a huge score gain: if K is moderately large
it manages about 400 bm/s for type 1 compared with about 42 bm/s from the “second attempt” above.
For type 4, it manages about 46 bm/s, which is better than the above, but we can do better
as we’ll see below.
Is there a way to “invert the random number generator”?
Is it possible to make some of the B_j equal to P_j-1 (the best value) or, for example,
force them to be greater than 0.9P_j by somehow solving for D_i in a way that doesn’t
require too big a search?
For example, it is suggested in the above editorial that you choose a set of N (out of
M) primes, say P_1,\ldots,P_N, and a smallish number x and try to factorise Z=P_1
P_2 \ldots P_N-x into (A_1+D_1)\ldots(A_N+D_N). The trouble with this is that it
requires (at a minimum) that the number Z doesn’t have any factors bigger than 1/N
times the size of Z. The chance of this happening is governed by the
Dickman function
and is roughly N^{-N}. In our examples,
N is at least 100, so it’s obviously out of the question to search for them one-by-one.
A variant of this would be to try to factorise \lambda.P_1\ldots P_r-x for some r
smaller than N, and some suitable multiple \lambda, but even though such choices of
\lambda and P_1,\ldots,P_r statistically should exist, it’s still no use trying to
factorise such a number head-on.
But there is something that can be done to give an edge over the direct searches mentioned
above, at least in the type 4 case.
The N-tree algorithm
We’d like to find a way of fixing some of the $B_j$s at good values (like P_j-1) while
also having the freedom to vary the D_i choice without affecting the fixed B_j. It’s
not enough to find a single example of B_1=P_1-1, B_2=P_2-1 if there were no
way of varying the other B_j other than rerunning the process that found the fixed B_j.
So the aim here is to find a collection of possible solutions that all have favourable $B_j$s
for some fixed set of j, and then to try each one to see which gives the highest total B_j
for the non-fixed j s.
Let’s stick with type 4 (M=100, N=10000, \max(K)=1000000) because more can be done
in this case.
Assume K\ge1000, which is usually true, and reduce K to (100P_1)^{1/4} (about 600). (The
exact number you should use here depends on how fast the program is going to run.)
Fix D_{17}, D_{18}, \ldots, D_N to some random values up to K.
Let R=\prod_{i=17}^N(A_i+D_i). We are going to use the birthday method.
Make a list of (A_1+D_1)(A_2+D_2)\mod P_1 for 0\le D_1,D_2\lt K, and
a second list of -(R(A_3+D_3)(A_4+D_4))^{-1}\mod P_1 for 0\le D_3,D_4\lt K. The lists contain K^2
elements each, and statistically you expect about K^4/P_1=100 of them to agree modulo P_1. These agreements
can be found in O(K^2) or so time, and each such pair corresponds to the relationship
(A_1+D_1)(A_2+D_2)(A_3+D_3)(A_4+D_4)R=-1\mod P_1. Do the same for D_5, D_6, D_7, D_8 except that instead
of -R, just use 1. So we end up with a hundred or so quadruples such that
(A_5+D_5)(A_6+D_6)(A_7+D_7)(A_8+D_8)=1\mod P_1. Similarly we get can 100 quadruples D_9, D_{10}, D_{11}, D_{12}
such that (A_9+D_9)(A_{10}+D_{10})(A_{11}+D_{11})(A_{12}+D_{12})=1\mod P_1 and 100 quadruples such that
(A_{13}+D_{13})(A_{14}+D_{14})(A_{15}+D_{15})(A_{16}+D_{16})=1\mod P_1.
Taken together, we have 100^4=10^8
combinations of D_1, \ldots, D_{16} such that \prod_{i=1}^N(A_i+D_i)=-1\mod P_1, and these can be checked
quickly, in the manner of the “second attempt” above, using only M modular multiplications to check each instance
(you precalculate the product of each quadruple modulo P_1).
This has given us a “free” maximum (meaning one of the $B_j$s is guaranteed to take its
maximum value), and so pushes out the distribution, resulting in about 43 bm/s for type 4,
compared with the previous best of 28 bm/s. Of course it’s not quite free because there is
a setup time, but using the above parameters this is only about 5% of the total time, so
there is plenty of time to find an extreme point of the remaining distribution.
This method can be pushed further, starting from 4^3=64 lists instead of 4^2=16, and
using two stages of “birthday-refinement” instead of one, which ends up fixing B_1=-1\mod
P_1 and B_2=-1\mod P_2. Unfortunately, the setup time, though it just about fits within
the allotted time limit, doesn’t allow enough search time for the second phase, so it’s
not worth fixing more than one B_j using this method.
This was as far as I got in the competition, and this is my best answer
which implements a single fixed B_j.
But after the competition ended, I got to thinking that it should be possible to do
better… and of course it is. The trouble with the above method is that the lists of
allowable D_i need to get quite big to overcome the “cutdown” factor of P_j, and if
you want to do several stages to fix several B_j, the setup costs will be prohibitive.
Fortunately for many P_j, P_j-1 will split into several factors so the multiplicative
group, which is cyclic of order P_j-1, will admit a series of subgroups. This
potentially allows us to “cutdown” by smaller factors at each stage in the tree-like
refinement process, which means we need fewer items in the list of allowable values at
each stage, so we have a chance of making the setup phase run in a reasonable time while
fixing more $B_j$s.
For example, if we are fixing with respect to P=1400000023, then
P-1=2.3.3.41.263.7213, so one stage might be to fix with respect to the factor
7213. This means we have two lists, L_1, L_2 of numbers modulo P_1 and a target
product, t. We aim to make a single list of pairs, (x_1,x_2), from L_1\times L_2
such that x_1 x_2 t^{-1} is in the index 7213 subgroup of the multiplicative group. We
can do this using the birthday method by working in the quotient group: make a list of
(t.x_1^{-1})^r and x_2^r where r=(P_1-1)/7213. Elements in common from both lists
must satisfy (x_1 x_2)^r = t^r which is what we want. Next stage would be to combine
pairs of lists where the r^\mathrm{th} powers are constant over each list to make a new
list whose s^\mathrm{th} powers are constant, where s divides r, e.g.,
s=(P_1-1)/(7213.3.263). In other words we’re fixing things in increasingly big quotient
groups, until you end up with the whole group. Eventually at the last stage you end up
with a few lists whose values are constant modulo a set of primes.
(It would be simpler to work additively, but it’s probably going to be too expensive to
take discrete logs modulo each P_j we are fixing with respect to, which we’d need to do
it for each i and each choice of D_i, so we have to stay in the multiplicative group.)
A good configuration (for the time limits for this problem)
is something like: at the i^\mathrm{th} stage (i=0,\ldots,11), you have 2^{13-i} lists
each of size roughly 40.2^i, except at i=11, the final stage, where there are four lists of size about 150.
150 is chosen so that 150^4\approx5\times10^8 is a good number of candidates to test in the second phase.
(That is simplifying slightly. If K is small, then you need to build up the sizes first before cutting down.)
This is very similar to Wagner’s algorithm
which uses a tree of lists of binary vectors to choose one element from each top-level list so that the choices
XOR to zero. (Though he uses equal size lists at all levels, which is not optimal as it doesn’t balance the work.)
This is an example implementation. It
tries to use this method for type 4, and for some values of K in types 2 and 3.
The program is a bit fiddly because it has to choose a suitable set of primes to
fix with respect to, put the factors in order to form a plan of action in such a way
that it won’t run out of list elements on the way, and do this robustly on-the-fly.
In the type 4 case it usually manages to fix B_j for four primes. This isn’t a lot, but
it’s enough to give it a big advantage over not doing this, because moving out a bit in
the extreme tail of a normal distribution makes a big difference to the probability. In
terms of the benchmarks per second mentioned above, the “unfixed” method gave about 28
bm/s, the single-fixed method about 42 bm/s, and this method about 200 bm/s. (It’s
possible the bm/s measure is slightly underestimating the benefit of this method, since
the further out in the tail of the distribution you go, the bigger the improvement.)