Hello guys… Recently I was studying Mo’s Algorithm from this blog and trying problems stated at the end. I was trying the problem Powerful Array. I saw that my solution barely passes the Time limit and that too I don’t know how as my same solution got TLE!
I saw that the blog clearly states that Mo’s Algorithm is the best way for this problem. It would be great if someone can point where I am going wrong. Any better approach or articles from which I can learn are also appreciated!! Thanks

Here’s a link to my AC code : LINK
Same code getting TLE : LINK

@dishant_18 I resubmitted your solution with changing the data type of input array as int instead of long long and passed in 4648ms which is an improvement. There are optimisations like these which can use to spped up your code.
I didn’t read through your code fully so there might be algorithmic optimizations in your logic.

Using left shift to multiply by 2 would probably further decrease the time.
Also @dishant_18 you might try varying the block size.

One more optimization: When sorting the queries, sort the right endpoints by ascending or descending depending on whether the left endpoint is in an odd or even block. This saves a lot of travel steps for the right pointer. I cannot say exactly where I found it, but I remember it was some Codeforces comment.

UPDATE: Solution using the above optimizations takes a little over 1 sec: 33239378

Thanks @meooow and @sdssudhu
Btw @meooow I learnt about this implementation from this blog(https://www.hackerearth.com/practice/notes/mos-algorithm/).
Is the sorting operation you mentioned always better??
And what exactly do you mean by varying block size… Sorry if its too trivial but I don’t get how much offset shall I give and how does that help?

Varying block sizes is…well XD. Let me illustrate this with an example. This long, that Chef and Xor queries.

Block size - \sqrt{N} = Gave me TLE.

Block size - 500 = TLE
Block size - 1000= TLE
Block size 2000 = TLE
Block size 800= Accepted at ~2.2seconds (had 1 good second left).

So yeah…:/. You cant do that in a CF contest though, since only 1 submission is allowed. You should, instead, focus on optimizations as @meooow mentioned.

From the math you can work out that the block size should be in the order of \sqrt N, but that does not mean exactly \sqrt N is the best. The best block size practically is k \sqrt N for some k, usually not very far from 1.
But as @vijju123 said, tuning the block size is not practical for short contests. You would either have to make multiple submissions or locally generate trial test data.

However the sorting order optimization always works. If you sort query endpoint in ascending order, for the first block the right pointer moves from start to end, then for next block it moves back to start and then back to end, then again back to start and back to end.
But if you alternate the sorting order as ascending/descending, then the pointer goes start to end for one block, and end to start for the next, and again start to end for the next. So nearly half as many steps are required.

PS: I’m using the terms “start” and “end” loosely here, but hopefully you understand what I’m saying.

What if the test cases have all queries such that difference between L and R is huge? Which performs better, block size of 1000 or block size of 100 ? (Hint: Greater block size means I will waste lesser iterations in jumping blocks to reach final R)

What if test cases are like, difference between L and R is not that big? Large block size can be inefficient, as if difference is less than the block size, then you are back to brute force. So a smaller block size helps here.