Range trees. Those of my friends who've had a lot of programming competition experience would be well-acquainted with them. They're neat recursive data structures which I thought would make an excellent toy project for me to learn a bit more about performance profiling in Haskell.

### ...range trees

?

A range trees is a binary-tree-based data structure used to implement *range queries*.

**1D range query problem.**Given a semigroup (i.e. an associative binary operator \(\bullet\) over a domain \(S\)), we seek a data structure that implements the following operations over some virtual array

`a`:

\(\textrm{update}(x,v)\): set \(\texttt{a}[x] := v\)

\(\textrm{query}(x_1,x_2)\) [given \(x_1 \leq x_2\)]: return \(\texttt{a}[x_1] \bullet \texttt{a}[x_1+1] \bullet \cdots \bullet \texttt{a}[x_2-1]\)

For example, the *range minimum query* problem requires a data structure that updates integers in an array and answers queries of the form What is the lowest element in the range \([x_1,x_2)\)?

. The *range sum query* problem requires a data structure answering queries like what is the sum of elements 4 to 17 inclusive?

, and so on.

Range trees offer us \(O(\lg W)\) updates and queries using \(O(W)\), where \(W\) is the size of the array. They're constructed as balanced binary trees where each leaf corresponds to an element in the array (ordered left to right, sensibly enough), and where each non-leaf node contains the sum of all the leaves underneath it.

When the value of an array element changes, only \(O(\lg W)\) tree nodes need to be updated. Likewise, querying for the sum of a range of elements also only requires looking at the values stored
in \(O(\lg n)\) nodes. (In the example diagram above, finding the sum from `a`[1] to `a`[4] inclusive requires looking at three nodes: the ones storing the sums for

, `a`[1]

, and `a`[2] + `a`[3]

.)
`a`[4]

### Haskell implementation

Here’s a first pass attempt at implementing a range tree in Haskell:

data RangeTree = RangeTree {width :: Int, tree :: RangeTreeNode} newRangeTree w = RangeTree {width = w, tree = newRangeTreeNode 0 w} update :: RangeTree -> Int -> Int -> RangeTree update rt x v = rt {tree = updateRangeTreeNode (tree rt) 0 (width rt) x v} query :: RangeTree -> Int -> Int -> Int query rt = queryRangeTreeNode (tree rt) 0 (width rt) -- Internals data RangeTreeNode = Leaf Int | Branch RangeTreeNode Int RangeTreeNode rtnValue :: RangeTreeNode -> Int rtnValue (Leaf v) = v rtnValue (Branch _ v _) = v newRangeTreeNode :: Int -> Int -> RangeTreeNode newRangeTreeNode x1 x2 | x1 == x2-1 = Leaf 0 | otherwise = Branch (newRangeTreeNode x1 ((x1+x2) `div` 2)) 0 (newRangeTreeNode ((x1+x2) `div` 2) x2) updateRangeTreeNode :: RangeTreeNode -> Int -> Int -> Int -> Int -> RangeTreeNode updateRangeTreeNode rtn x1 x2 x' v | x1 > x' || x2 <= x' = rtn | otherwise = case rtn of Leaf _ -> Leaf v Branch l _ r -> Branch l' (rtnValue l' + rtnValue r') r' where l' = updateRangeTreeNode l x1 ((x1+x2) `div` 2) x' v r' = updateRangeTreeNode r ((x1+x2) `div` 2) x2 x' v queryRangeTreeNode :: RangeTreeNode -> Int -> Int -> Int -> Int -> Int queryRangeTreeNode rtn x1 x2 x1' x2' | x1' >= x2 || x1 >= x2' = 0 | x1' <= x1 && x2' >= x2 = rtnValue rtn | otherwise = queryRangeTreeNode l x1 ((x1+x2) `div` 2) x1' x2' + queryRangeTreeNode r ((x1+x2) `div` 2) x2 x1' x2' where Branch l _ r = rtn

The implementation is quite standard. The module exposes the functions update

and query

,
which are wrappers for the recursive functions updateRangeTreeNode

and queryRangeTreeNode

.

Let's see how this runs on some random data sets:

width | # operations | runtime (s) |

1000 | 100000 | 0.23 |

1000 | 1000000 | 2.29 |

1000 | 2000000 | 4.74 |

1000 | 5000000 | 12.13 |

1000000 | 100000 | 0.79 |

1000000 | 1000000 | 6.62 |

1000000 | 2000000 | 13.34 |

1000000 | 5000000 | 33.08 |

Empirically it looks like we’re getting a total time complexity of \(O(Q \lg W)\), which is about what you’d expect for this data structure. What about space usage?

Turning on some profiling options and running the program with \(W = 1000\) and \(Q = 1000000\), we see the following space usage graph:

...well, that's odd. Our space consumption is *growing* over time with additional queries, even though range trees are supposed to have space consumption independent of the number of queries.
What's going on here?

Space leaks in languages with garbage collection (e.g. Haskell, Java, Python, to name a few) are the result of the runtime retaining more data than it actually needs. In the case of Haskell, this is often due to lazy evaluation.

(Lazy evaluation means what it says on the tin: functions and expressions aren’t evaluated until they’re absolutely needed. If you define

and print the first three values of `arr = map f [1..N]``arr`, the program will calculate what it needs — f(1), f(2) and f(3) — and leave the remainder of the list sitting as a giant unevaluated thunk

marked with the instructions

. If you later attempt to print out more of the list, more values will be calculated. On the other hand, if you never use the list again, then your memory usage is \(O(1)\) [not \(O(N)\)]. Either way, your results and thunks will be garbage collected the moment `map f [4..N]``arr` is out of scope.

When lazy evaluation is on your side, it allows for separation of concerns, namely definitions versus order/extent of calculation. One programmer on your team can define the entire game tree for Go or Chess, data structures of exponential, if not infinite, size. Later on, another programmer can take that definition and run height-bounded minimax on it (in which case only the requested levels are evaluated), or pick an arbitrary game path down the tree to visualise (in which case only that line down the tree is evaluated).

But lazy evaluation doesn’t always save space, especially in situations where a given definition is more complicated than its result. E.g. if your program is carrying around the definition

, it could save a lot of space by evaluating that expression straight away.)
`x = 1 + 2 + 3 + 4 + `…` + N`

Digression over. Range tree! Space leaks. What’s going on?

Here are the results of a second profiling run, this time only looking at memory dedicated to unevaluated thunks. The memory in use is coloured according to what purpose it's being retained for:

It looks like unevaluated calls to updateRangeTreeNode are responsible for our space leak.

On further reflection this makes total sense: if there’s an unevaluated call sitting at leaf [x,x+1), then the only way it’s going to get evaluated is when there’s a query so specific that it just includes [x,x+1) but not its sibling. Anything else would result in early termination for the query, and so wouldn’t force evaluation of the leaf. Those queries are going to be rare; hence in the meantime lots of unevaluated queries pile up.

(Don't forget that this is a *random* case! I believe it's possible to construct worst-case sequences of queries where the space usage is not just \(O(Q+W)\), but \(O(Q \lg W + W)\).

### Forcing tree evaluation with DeepSeq

The deepseq Haskell module provides methods for forcing a program to evaluate an expression *in full*. (That is, the entire data structure is traversed and any remaining thunks are evaluated before the rest of the program is allowed to continue.) For our purposes we consider the function `force`, whose meaning

is very similar to the identity function but which forces full evaluation of its input.

force :: a -> a force = a `deepseq` a

After tacking on a few calls to `force` (see the diff), we're left with a program which has the following space profile:

This is a much better looking space profile: constant with respect to the number of queries. Although our previous program version used less space for the first few hundred thousand queries, as soon as we've had enough updates to justify maintaining the entire tree we're suddenly doing *very* well for ourselves. We could run this program overnight and not see any space consumption issues.

(I'm guessing the tiny spike towards the right end of the graph there represents a point where a large number of updates happened in a row. If the outer part of the program didn't ask for their value for several milliseconds, then the `force` instruction *itself* wouldn't get evaluated — leading to the whole tree being evaluated — until a query required the program to peek inside the tree.)

Thus we've successfully— wait, what in *hell* is going on with that horizontal axis? Twenty-five seconds? Even accounting for the profiling overhead, that's way slower than we started with.

Benchmarking time:

width | # operations | runtime (s) |

100 | 1000000 | 2.20 |

700 | 1000000 | 4.10 |

400 | 1000000 | 5.84 |

1000 | 1000000 | 7.41 |

Our time complexity appears to have shot up to \(O(W \times Q)\). What’s going on?

Some different profiling reveals the number of calls to various functions:

INDIV INHERITED [function] [module] [id] [calls] [%time] [%mem] [%time] [%mem] force RangeTree 155 499912 0.1 0.0 80.8 0.0 rnf RangeTree 156 999324088 80.7 0.0 80.7 0.0

We're getting about \(\frac{Q}{2}\) calls to `force`, as expected (since this is how many times our random case generator updates the tree). But the `rnf` function, which
`force` uses to recursively force evaluation of the whole tree, is being called about 2000 times for each call to `force` — that is, it's traversing the entire tree.

For some silly reason I had hoped that the program wouldn't force evaluation of the same (unmodified) data structure more than once — memoisation and all that. Only at this point did
I realise I was confusing `force`'s traversal of the data structure with the data structure itself. Once the tree is evaluated, it sits in memory so long as there are still references to it.
But the *result* of the tree traversal (a zero-byte dummy value signifying all good

) isn't cached. A sufficiently clever compiler/interpreter might notice that the same function is being called on the same value twice; however there's nothing in the language spec to guarantee this.

Moral of the story: doing unnecessary tree traversals completely defeats the point of having a spiffy logarithmic data structure. Oops.

### Forcing evaluation with `seq`

Okay. Take two.

Instead of using the DeepSeq module, we'll take a look at its built-in, less ambitious relative, `seq`. `seq` forces evaluation of its first argument up to weak head normal form

(WHNF). Generally this means evaluating until the program reaches a
built-in constant or a constructor.

Here we're evaluating `RangeTree`s, whose constructors are either `Branch` or `Leaf`. So

will evaluate `tree `seq` val``tree` until
it encounters an expression of the form

or `Branch `(...) (...) (...)

, where the omitted parts can be arbitrarily complicated, before finally
returning `Leaf `(...)`val`.

Adding a call to `seq` in our wrapper function `update` is close to useless, since at best it will do enough work to determine that our top-level node is a `Branch`
before bailing out. Instead, we want to call this recursively inside `updateRangeTreeNode`.

Here's the patch I settled on:

- Branch l _ r -> Branch l' (rtnValue l' + rtnValue r') r' + Branch l _ r -> l' `seq` r' `seq` Branch l' (rtnValue l' + rtnValue r') r'

Here we force WHNF evaluation of the children before returning the constructor for the modified tree. Importantly, this means that if we reach this line of code while evaluating an instance
of `seq` further up the tree, we'll recursively call `seq` on our child nodes as well.

*However*, unlike the indiscriminate evaluation we got from `force`, we
*won't* need to recurse any deeper if it's a trivial case. (Our trivial cases are when the update is outside the scope of the node, and otherwise when the node is a leaf.)
This means that only \(O(\lg W)\) calls to `seq` (and hence forced partial evaluations) will occur for each `update` we perform on the tree.

Walls of text aside, let's look at our space performance:

(Looks constant. Excellent.)

And as for our time performance:

width | # operations | runtime (s) |

1000 | 1000000 | 2.00 |

1000 | 2000000 | 3.97 |

1000 | 5000000 | 9.99 |

1000000 | 1000000 | 5.99 |

1000000 | 2000000 | 11.58 |

1000000 | 5000000 | 29.08 |

These are the same times we had for our original \(O(Q \lg W)\) program, if not a little faster(!)

Lessons learned:

- Functional languages with algebraic data types are perfectly suited for tree-like data structures.
- Lazy evaluation can be bad, but even so that doesn't make
strictly evaluate ALL of the things!1

the best strategy.

(Source code and miscellaneous scaffolding here.)