Since these replies are listed as answers, not in the order they were posted, this is a follow-up to my previous answer.

Haskell is a fast language. I’ve gotten performance not far behind C or C++ when running the same algorithms on the same data structures. In every case I’ve found where an optimized Haskell solution wasn’t fast enough, the problem was with the algorithm or data structure I was using, and if you’re more than slightly behind, there’s a better approach that will get you the answer with a lower order of time complexity. In the most pathological case, a mutable `STUArray`

implementation of CSUBQ was good enough to get ten points, the same as its translation into C++, and that’s the single data structure Haskell is worst at. (In the real world, use `Data.Vector`

for mutable Haskell arrays. Unfortunately, it’s not available on CodeChef at present, but it’s supposed to be added to the default distribution soon.)

The lesson there, for me, was not to prematurly optimize so much. I spent a lot of time getting the mutable arrays in the ST monad right, and I learned something from it, but the real solution was to use a different data structure, segment trees.

You will, however, have to start with one optimization: fast I/O. The first snag you’re going to run into, on every single problem, is that the standard Haskell I/O library is extremely slow. Why is that? Remember, I said Haskell is fast *when using the same data structures* as other languages. The I/O library in `Prelude`

reads and writes `String`

data, which are lists of `Char`

. If you implemented your strings as singly-linked lists of Unicode characters, they’d be slow in C++ too!

The efficient data structure for output is arrays of characters, and ASCII is enough for (nearly?) any problem on this site. An even better one for this purpose is a list of buffers, dynamically allocated as needed. That exists in Haskell, as `Data.ByteString.Lazy`

. Specifically, since we only need to support ASCII, `Data.ByteString.Lazy.Char8`

. (In the real world, always use UTF-8 whenever possible; the library supports that too.) And the fastest way to write your output is to generate, at runtime, a data structure of chained operations to till these buffers. These are in `Data.ByteString.Builder`

. It even gives you back just a little bit of control over the size and behavior of your buffers.

An equivalent skeleton with fast I/O might be:

```
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = _
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
```

And it says a lot about Haskell that the operations to combine the optimized, low-level data structures from I/O are found in the typeclasses `Semigroup`

and `Monoid`

, not to be confused with `Monad`

. It’s actually really simple: a semigroup has a single binary operation `<>`

, where `(a <> b) <> c`

always equals `a <> (b <> c)`

, and a monoid also has an empty element such that `mempty <> a`

and `a <> mempty`

both equal `a`

. Making it that generic is what allows us to use Prelude functions like `foldMap`

and syntactic sugar like `<>`

with `Builder`

transparently. Still, it is a bunch of abstract algebra to express the concept of `append`

.

It turns out that `Data.List.sort`

with a variant of this wrapper was fast enough to be accapted. But that’s not how you’re supposed to solve the problem. Take a close look at the conditions: all the numbers are integers between 1 and 1,000,000. There’s a O(max (*m*, *n*)) algorithm to sort this array, where *n* is the size of the input and *m* the size of the domain. Since the maximum and minimum values of *t* were set so low, it’s great for this problem. It’s called counting sort. Make an array `a`

covering the entire domain, initialized to zeroes. Each time you encounter a number `i`

, increment `a[i]`

(in Haskell, `a!i`

). Finally, for every `i`

in the domain, output `i`

, `a[i]`

times (which could be zero).

Both updating an array and iterating through an array in Haskell are concepts from imperative languages, and we’ll use library functions to get out of having to do them. If you check the documentation for `Data.Array.IArray`

, you’ll find that there already is a function in the standard library to build an array from a default value and an accumulating function, `accumArray`

, and a function to read out an array as a list of (*index*, *value*) pairs, `assocs`

. There are also a few different kinds of array, but in this case, we want `UArray`

from `Data.Array.Unboxed`

. (Boxed arrays, where the elements are evaluated lazily, are great for dynamic programming tables, though!) That lets us partly fill in the hole:

```
import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = _ . assocs . countingSort . _ where
countingSort :: [(Int, Int)] -> UArray Int Int
countingSort = accumArray (+) 0 (lower, upper)
lower = 0
upper = 1000000
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
```

The first hole is now a function from `[Int] -> [(Int, Int)]`

, where the first member of the pair is an index and the second is a count of 1. We’ll turn this into a function on individual numbers with `map`

. The second hole is now a function from `[(Int, Int)] -> [Int]`

, where again, the first member of the pair is an index and the second is a count of 1. This time, each entry in the list could produce any number of output items, depending on the count, so we’ll combine them with `concatMap`

. This gives us:

```
import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = concatMap _ . assocs . countingSort . map _ where
countingSort :: [(Int, Int)] -> UArray Int Int
countingSort = accumArray (+) 0 (lower, upper)
lower = 0
upper = 1000000
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
```

For the first remaining hole, each time we encounter an `i`

, we want to increase its count by 1, so we want to add `(i,1)`

to the output list. For the last remaining hole, if we get back an association `(i,c)`

, we want to produce a list of `c`

copies of `i`

, which is `replicate`

. So, we end up with:

```
import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = concatMap expand . assocs . countingSort . map encode where
encode i = (i, 1)
countingSort :: [(Int, Int)] -> UArray Int Int
countingSort = accumArray (+) 0 (lower, upper)
lower = 0
upper = 1000000
expand (i,c) = replicate c i
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
```

I submitted a few different variants of it, for benchmarking. Subtle differences can have a big impact on performance, and since Haskell is such a high-level language, it’s often opaque which ones. (It took me a while to figure out that `foldMap replicate`

was twice as fast as `Data.Semigroup.stimes`

, and that’s probably a quirk in the implementation.) The main difference with this version is that it replicates the output of `perCase`

rather than the inputs. Or you might prefer this more concise style.

You’ll notice that it’s possible to pass flags to GHC inside the source code. I haven’t found that `-O3`

makes a big difference, but it can’t hurt. I would use `-llvm`

as well, since the LLVM code generator is better, but CodeChef doesn’t support it.