**Author:** Sidhant Bansal

**Tester:** Istvan Nagy

**Editorialist:** Misha Chorniy

## Difficulty:

Medium

# Pre-Requisites:

segment tree or related data structure

## Problem Statement

You are given array A consisting N elements. All the elements in the input are integers. You must be able to make two queries, first one - change the value of the element, second one - is to

present the product of elements with indices from l to r by modulo P in the sum of the squares of four non-negative integers.

## Explanation

We need to make some precalculation, for each number between 0 and 1000000 know to calculate his decomposition in the sum of four squares. The key idea in this problem, if we can decompose number A in the sum of four squares and B in the sum of four squares, then A*B can be decomposed in the sum of four squares.

A*B = (x_{1}^2+x_{2}^2+x_{3}^2+x_{4}^2)*(y_{1}^2+y_{2}^2+y_{3}^2+y_{4}^2) =

(x_{1}*y_{1}+x_{2}*y_{2}+x_{3}*y_{3}+x_{4}*y_{4})^2 +

(x_{1}*y_{2}-x_{2}*y_{1}+x_{3}*y_{4}-x_{4}*y_{3})^2 +

(x_{1}*y_{3}-x_{2}*y_{4}-x_{3}*y_{1}+x_{4}*y_{2})^2 +

(x_{1}*y_{4}-x_{4}*y_{1}+x_{2}*y_{3}-x_{2}*y_{3})^2

The second observation is Lagrange’s four-square theorem, it’s well-known fact in math. Every number can be represented in the sum of four squares, from here there are no -1 answers in the problem.

## Subtask 1

P is less or equal than 1000. Let’s make precalculation in a next way, iterate with four cycles, where each number less than sqrt(1000)<32

```
for i = 0..32
for j = i..32
for k = j..32
for l = k..32
s[i*i+j*j+k*k+l*l]=(i,j,k,l) //t is the array of tuples from 4 numbers
```

When we know this thing, numbers can be multiplied and values of the sum of 4 square squares can be recalculated. Of course, all operations must be processed by modulo P. After that queries can be simulated, the first one in O(1) time, the second one in O(N) time. Totally, it takes O(Q*N+32^4) time.

## Subtask 2

Precomputation with four for’s as in Subtask 1 get TL for P=10^6. How to speed up this one, let’s make next thing, find all numbers which can be represented as the sum of 3 square numbers.

```
for i=0..1000
if i*i>1000000
break
for j=i..1000
if i*i+j*j>1000000
break
for k=j..1000
if i*i+j*j+k*k>1000000
break
t[i*i+j*j+k*k]=(i,j,k) //t is the array of tuples from 3 numbers
s[i*i+j*j+k*k]=(i,j,k,0)
```

We will find almost 85% of representions after this preprocessing. How to find the rest:

```
for i=0..1000000
if s[i] != null
continue
for j=1..1000
if i - j * j<0
break
if s[i-j * j][3]==0 // If number i - j * j can be represented in a sum of 3 squares
s[i]=(s[i-j * j][0],s[i-j * j][1],s[i-j * j][2],j)
break
```

Let’s estimate total time of work of these two fragments of code, first fragment has three for’s each of those to 1000, but 0<=i<=j<=k<=1000, number of possible triplets (i, j, k) the same thing as 1000 \choose 3. The second fragment works in time which equal number of numbers which can’t be represented as a sum of 3 squares multiply by 1000 less than 2*10^5*1000 = 2*10^8, summary both fragments will process in time less than 5*10^8.

Now we have all decompositions for numbers in the range between 0 and 10^6, but we don’t know how to speed up processing of the 2nd query. A function of representation in a sum of 4 square is associative and multiplicative, we can use segment tree for solving this problem. In each node v which assigned with interval l..r storing decomposition of the product A_{l}*A_{l+1}*...*A_{r} in the sum of 4 squares by modulo P.

Now both queries can be done with complexity O(log N)

## Subtask 3

The difference of a subtask 3 from a subtask 2 that P can be up to 10^{12}. It does problems because 64-bit integers type can store numbers up to 1.8 * 10^{19}, how to cope with a multiplication of two numbers which can’t be stored in the 64-bit integer type, there are several ways:

### First way:

```
mult(a, b, p)
if b == 0
return 0
if b % 2 == 0
a2=mult(a,b/2,p) // Finding a * (b / 2), a * b = a * (b / 2) + a * (b / 2)
return (a2+a2) mod p // (a+...+a)+(a+...+a)
return (a+mult(a,b-1, p)) mod p //a * b = a * (b - 1) + a
```

Complexity of this algorithm is O(log N), unfortunately, this algorithm can’t pass TL

### Second way:

In C++ exists types like __int128, __int128_t, Python and Java has libraries for working with Long Arithmetics, but all of it is very slow to get AC.

### Third way:

Divide multipliers into two parts $X=A*2^{20}+B$, $0<=A, B<2^{20}$```
mult(a, b, p)
a0=a % (2^20)
a1=a / (2^20)
b0=b % (2^20)
b1=b / (2^20)
//(a1*(2^20)+a0)*(b1*(2^20)+b0)=a1*b1*(2^40)+(a1*b0+b1*a0)*(2^20)+(a0*b0)
return ((a1*b1)%p*(2^20)%p*(2^20)+(a1*b0+a0*b1)%p*2^20+a0*b0)%p
```

If change % and / into bit operations, correct solution must get AC.

### Fourth way:

To calcuate product in big real types(C++ version):```
//a * b mod p = a * b - [a * b / p] * p
long long mul(long long a, long long b, long long p) {
long long k = (long long)((long double)a * b / p + 0.5);
return a * b - k * p;
}
```

This optimization for multiplying numbers up to 10^{12} modulo 10^{12} is fastest.

The overall time complexity of this approach is O(N log N) but with huge constant.

## Solution:

Setter’s solution can be found here

Tester’s solution can be found here

**Please feel free to post comments if anything is not clear to you.**