Monday, October 7, 2013

Range trees and profiling in Haskell

Range trees. Those of my friends who've had a lot of programming competition experience would be well-acquainted with them. They're neat recursive data structures which I thought would make an excellent toy project for me to learn a bit more about performance profiling in Haskell.

...range trees?

A range trees is a binary-tree-based data structure used to implement range queries.

1D range query problem. Given a semigroup (i.e. an associative binary operator \(\bullet\) over a domain \(S\)), we seek a data structure that implements the following operations over some virtual array a:

\(\textrm{update}(x,v)\): set \(\texttt{a}[x] := v\)

\(\textrm{query}(x_1,x_2)\) [given \(x_1 \leq x_2\)]: return \(\texttt{a}[x_1] \bullet \texttt{a}[x_1+1] \bullet \cdots \bullet \texttt{a}[x_2-1]\)

For example, the range minimum query problem requires a data structure that updates integers in an array and answers queries of the form What is the lowest element in the range \([x_1,x_2)\)?. The range sum query problem requires a data structure answering queries like what is the sum of elements 4 to 17 inclusive?, and so on.

Range trees offer us \(O(\lg W)\) updates and queries using \(O(W)\), where \(W\) is the size of the array. They're constructed as balanced binary trees where each leaf corresponds to an element in the array (ordered left to right, sensibly enough), and where each non-leaf node contains the sum of all the leaves underneath it.

Schematic of a range tree that calculates range sums for an array a[7].

The root node contains the sum of the entire array, its children contain the sums of the left and right halves respectively, and so on. The leaves contain the values (sums, if you will) of individual array elements.

When the value of an array element changes, only \(O(\lg W)\) tree nodes need to be updated. Likewise, querying for the sum of a range of elements also only requires looking at the values stored in \(O(\lg n)\) nodes. (In the example diagram above, finding the sum from a[1] to a[4] inclusive requires looking at three nodes: the ones storing the sums for a[1], a[2] + a[3], and a[4].)

Haskell implementation

Here’s a first pass attempt at implementing a range tree in Haskell:

data RangeTree = RangeTree {width :: Int, tree :: RangeTreeNode}

newRangeTree w = RangeTree {width = w, tree = newRangeTreeNode 0 w}

update :: RangeTree -> Int -> Int -> RangeTree
update rt x v = rt {tree = updateRangeTreeNode (tree rt) 0 (width rt) x v}

query :: RangeTree -> Int -> Int -> Int
query rt = queryRangeTreeNode (tree rt) 0 (width rt)

-- Internals

data RangeTreeNode =
    Leaf Int
  | Branch RangeTreeNode Int RangeTreeNode

rtnValue :: RangeTreeNode -> Int
rtnValue (Leaf v) = v
rtnValue (Branch _ v _) = v

newRangeTreeNode :: Int -> Int -> RangeTreeNode
newRangeTreeNode x1 x2
  | x1 == x2-1 = Leaf 0
  | otherwise = Branch
    (newRangeTreeNode x1 ((x1+x2) `div` 2))
    (newRangeTreeNode ((x1+x2) `div` 2) x2)

updateRangeTreeNode :: RangeTreeNode -> Int -> Int -> Int -> Int -> RangeTreeNode
updateRangeTreeNode rtn x1 x2 x' v
  | x1 > x' || x2 <= x'
                = rtn
  | otherwise   = case rtn of
                    Leaf _ -> Leaf v
                    Branch l _ r -> Branch l' (rtnValue l' + rtnValue r') r'
                        l' = updateRangeTreeNode l x1 ((x1+x2) `div` 2) x' v
                        r' = updateRangeTreeNode r ((x1+x2) `div` 2) x2 x' v

queryRangeTreeNode :: RangeTreeNode -> Int -> Int -> Int -> Int -> Int
queryRangeTreeNode rtn x1 x2 x1' x2'
  | x1' >= x2 || x1 >= x2'
      = 0
  | x1' <= x1 && x2' >= x2
      = rtnValue rtn
  | otherwise 
      = queryRangeTreeNode l x1 ((x1+x2) `div` 2) x1' x2'
        + queryRangeTreeNode r ((x1+x2) `div` 2) x2 x1' x2'
        Branch l _ r = rtn

The implementation is quite standard. The module exposes the functions update and query, which are wrappers for the recursive functions updateRangeTreeNode and queryRangeTreeNode.

Let's see how this runs on some random data sets:

width# operationsruntime (s)
Table: runtime versus input dimensions for original program

Empirically it looks like we’re getting a total time complexity of \(O(Q \lg W)\), which is about what you’d expect for this data structure. What about space usage?

Turning on some profiling options and running the program with \(W = 1000\) and \(Q = 1000000\), we see the following space usage graph:

...well, that's odd. Our space consumption is growing over time with additional queries, even though range trees are supposed to have space consumption independent of the number of queries. What's going on here?

Space leaks in languages with garbage collection (e.g. Haskell, Java, Python, to name a few) are the result of the runtime retaining more data than it actually needs. In the case of Haskell, this is often due to lazy evaluation.

(Lazy evaluation means what it says on the tin: functions and expressions aren’t evaluated until they’re absolutely needed. If you define arr = map f [1..N] and print the first three values of arr, the program will calculate what it needs — f(1), f(2) and f(3) — and leave the remainder of the list sitting as a giant unevaluated thunk marked with the instructions map f [4..N]. If you later attempt to print out more of the list, more values will be calculated. On the other hand, if you never use the list again, then your memory usage is \(O(1)\) [not \(O(N)\)]. Either way, your results and thunks will be garbage collected the moment arr is out of scope.

When lazy evaluation is on your side, it allows for separation of concerns, namely definitions versus order/extent of calculation. One programmer on your team can define the entire game tree for Go or Chess, data structures of exponential, if not infinite, size. Later on, another programmer can take that definition and run height-bounded minimax on it (in which case only the requested levels are evaluated), or pick an arbitrary game path down the tree to visualise (in which case only that line down the tree is evaluated).

But lazy evaluation doesn’t always save space, especially in situations where a given definition is more complicated than its result. E.g. if your program is carrying around the definition x = 1 + 2 + 3 + 4 + + N, it could save a lot of space by evaluating that expression straight away.)

Digression over. Range tree! Space leaks. What’s going on?

Here are the results of a second profiling run, this time only looking at memory dedicated to unevaluated thunks. The memory in use is coloured according to what purpose it's being retained for:

It looks like unevaluated calls to updateRangeTreeNode are responsible for our space leak.

On further reflection this makes total sense: if there’s an unevaluated call sitting at leaf [x,x+1), then the only way it’s going to get evaluated is when there’s a query so specific that it just includes [x,x+1) but not its sibling. Anything else would result in early termination for the query, and so wouldn’t force evaluation of the leaf. Those queries are going to be rare; hence in the meantime lots of unevaluated queries pile up.

(Don't forget that this is a random case! I believe it's possible to construct worst-case sequences of queries where the space usage is not just \(O(Q+W)\), but \(O(Q \lg W + W)\).

Forcing tree evaluation with DeepSeq

The deepseq Haskell module provides methods for forcing a program to evaluate an expression in full. (That is, the entire data structure is traversed and any remaining thunks are evaluated before the rest of the program is allowed to continue.) For our purposes we consider the function force, whose meaning is very similar to the identity function but which forces full evaluation of its input.

force :: a -> a
force = a `deepseq` a

After tacking on a few calls to force (see the diff), we're left with a program which has the following space profile:

This is a much better looking space profile: constant with respect to the number of queries. Although our previous program version used less space for the first few hundred thousand queries, as soon as we've had enough updates to justify maintaining the entire tree we're suddenly doing very well for ourselves. We could run this program overnight and not see any space consumption issues.

(I'm guessing the tiny spike towards the right end of the graph there represents a point where a large number of updates happened in a row. If the outer part of the program didn't ask for their value for several milliseconds, then the force instruction itself wouldn't get evaluated — leading to the whole tree being evaluated — until a query required the program to peek inside the tree.)

Thus we've successfully— wait, what in hell is going on with that horizontal axis? Twenty-five seconds? Even accounting for the profiling overhead, that's way slower than we started with.

Benchmarking time:

width# operationsruntime (s)
Table: runtime versus input dimensions for DeepSeq program

Our time complexity appears to have shot up to \(O(W \times Q)\). What’s going on?

Some different profiling reveals the number of calls to various functions:

                                              INDIV        INHERITED
[function]  [module]    [id]    [calls]   [%time] [%mem] [%time] [%mem]
  force     RangeTree   155      499912    0.1    0.0    80.8    0.0
   rnf      RangeTree   156   999324088   80.7    0.0    80.7    0.0

We're getting about \(\frac{Q}{2}\) calls to force, as expected (since this is how many times our random case generator updates the tree). But the rnf function, which force uses to recursively force evaluation of the whole tree, is being called about 2000 times for each call to force — that is, it's traversing the entire tree.

For some silly reason I had hoped that the program wouldn't force evaluation of the same (unmodified) data structure more than once — memoisation and all that. Only at this point did I realise I was confusing force's traversal of the data structure with the data structure itself. Once the tree is evaluated, it sits in memory so long as there are still references to it. But the result of the tree traversal (a zero-byte dummy value signifying all good) isn't cached. A sufficiently clever compiler/interpreter might notice that the same function is being called on the same value twice; however there's nothing in the language spec to guarantee this.

Moral of the story: doing unnecessary tree traversals completely defeats the point of having a spiffy logarithmic data structure. Oops.

Forcing evaluation with `seq`

Okay. Take two.

Instead of using the DeepSeq module, we'll take a look at its built-in, less ambitious relative, seq. seq forces evaluation of its first argument up to weak head normal form (WHNF). Generally this means evaluating until the program reaches a built-in constant or a constructor.

Here we're evaluating RangeTrees, whose constructors are either Branch or Leaf. So tree `seq` val will evaluate tree until it encounters an expression of the form Branch (...) (...) (...) or Leaf (...), where the omitted parts can be arbitrarily complicated, before finally returning val.

Adding a call to seq in our wrapper function update is close to useless, since at best it will do enough work to determine that our top-level node is a Branch before bailing out. Instead, we want to call this recursively inside updateRangeTreeNode.

Here's the patch I settled on:

-    Branch l _ r -> Branch l' (rtnValue l' + rtnValue r') r'
+    Branch l _ r -> l' `seq` r' `seq` Branch l' (rtnValue l' + rtnValue r') r'

Here we force WHNF evaluation of the children before returning the constructor for the modified tree. Importantly, this means that if we reach this line of code while evaluating an instance of seq further up the tree, we'll recursively call seq on our child nodes as well.

However, unlike the indiscriminate evaluation we got from force, we won't need to recurse any deeper if it's a trivial case. (Our trivial cases are when the update is outside the scope of the node, and otherwise when the node is a leaf.) This means that only \(O(\lg W)\) calls to seq (and hence forced partial evaluations) will occur for each update we perform on the tree.

Walls of text aside, let's look at our space performance:

(Looks constant. Excellent.)

And as for our time performance:

width# operationsruntime (s)
Table: runtime versus input dimensions for `seq` program

These are the same times we had for our original \(O(Q \lg W)\) program, if not a little faster(!)

Lessons learned:

  • Functional languages with algebraic data types are perfectly suited for tree-like data structures.
  • Lazy evaluation can be bad, but even so that doesn't make strictly evaluate ALL of the things!1 the best strategy.

(Source code and miscellaneous scaffolding here.)