Optimising Haskell: parallelism and the Par monad

The last Advent of Code solution I'll optimise (as described in my Overview of Advent of Code 2023) is day 23, which I'll do by making it use more cores.

The problem is to find the longest path through a graph. Unlike the various shortest-path algorithms, there's no way to stop early or to use a heuristic to guide the search. You've got no real option but to generate all the possible routes and then pick the longest.

However, Haskell typically uses only a single thread for execution, and "finding all possible routes" is a task that is trivially parallelisable (you can find all extensions of a partial route in parallel). This idea of a dataflow network neatly fits the Par monad. This uses the GHC compiler's version of lightweight internal threads to provide parallel execution. (They're lightweight by being internal to the Haskell runtime, not spawning new OS threads.)

Profiling the program showed that it was spending all its time in the searchCompressed function calls, not any of the setup. This meant I only needed to fix one function to get a good speedup.

Refactoring

The first thing to do was refactor the search function to be something that could be easily made parallel. The existing approach, using an agenda of partial solutions, was inherently single-threaded.

I considered keeping the agenda, and having a pool of workers that updated it; each worker would take an partial solution from the agenda, produce a set of successor partial solutions, and put them back on the queue. However, that was  lot of work and didn't seem to fit the "Haskell way."

Instead, I took inspiration from Simon Marlow's book, Parallel and Concurrent Programming in Haskell and used the Par monad to exploit parallelism directly. It had several examples of using Par, including tree search. The way it made them parallel was to do direct recursion into the successors of each partial solution, rather than using an explicit agenda.

That meant changing the previous depth-first search of this (in MainOriginal.hs):

searchCompressed :: CompressedMap -> Position -> [[Position]] -> [[Position]] -> [[Position]]
searchCompressed _ _ found [] = found
searchCompressed map goal found (current:agenda) 
  | head current == goal = searchCompressed map goal (current:found) agenda
  | otherwise = searchCompressed map goal found (nextPositions ++ agenda)
  where neighbours0 = map M.! (head current)
        neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
        nextPositions = fmap ((: current) . _nextPos) neighbours

into this (in MainTree.hs):

searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current
  | head current == goal = [current]
  | otherwise = concatMap (searchCompressed map goal) nextPositions
  where neighbours0 = map M.! (head current)
        neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
        nextPositions = fmap ((: current) . _nextPos) neighbours

Performance

This change actually made the program run faster! However, it wasn't by a lot.

Program Time
  Original   12.2 seconds
  Tree   10.8 seconds

Parallelising

Now the code was refactored, I could introduce the parallelism. This was conceptually as simple as replacing the concatMap with a parConcatMap. Unfortunately, the Control.Monad.Par library doesn't provide that, so I split the operation into separate parMap and concat. I also kept the whole computation in the same Par monad, to avoid the overheads of creating many new monads.

I also needed a wrapper to call the parallel search. This was in MainNoDepthLimit.hs; you'll see why it's called that in a moment.

searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current = runPar $ searchCompressedM map goal current

searchCompressedM :: CompressedMap -> Position -> [Position] -> Par [[Position]]
searchCompressedM map goal current
  | head current == goal = return [current]
  | otherwise = 
    do  paths <- parMapM (searchCompressedM map goal) nextPositions
        return $ concat paths
  where neighbours0 = map M.! (head current)
        neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
        nextPositions = fmap ((: current) . _nextPos) neighbours

Performance

This change had a large increase in performance. It also clearly showed that all the cores on my PC were being used during the execution.

Program Time
  Original   12.2 seconds
  Tree   10.8 seconds
  Parallel   6.3 seconds

Optimising parallelism

The Parallel and Concurrent Programming in Haskell book mentions several times that the parallelism often benefits from being tamed. While the sparks are lightweight, they aren't cost-free. The ideal is to generate enough tasks so they can be allocated efficiently to different cores, but not so much that the runtime spends lots of time managing them all.

In this case, the code above does too much parallelism. I can tame it by using the technique mentioned in the book, of using a depth limit. I start by generating search tasks in parallel. But once I have enough of them, I can revert to the more efficient non-parallel process to evaluate each task and its sub-tasks. I use the size of each partial solution as a proxy for the number of tasks.

That means putting on a depth limit to the parallel search. If I'm below that depth limit, I solve sub-tasks in parallel. If I'm above the depth limit, I call the sequential version. This is in Main.hs, and uses the refactored searchCompressed from above, but renamed searchCompressedTree.

searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current = runPar $ searchCompressedM parallelDepthLimit map goal current

searchCompressedM :: Int -> CompressedMap -> Position -> [Position] -> Par [[Position]]
searchCompressedM depthLimit map goal current
  | head current == goal = return [current]
  | depthLimit == 0 = return $ searchCompressedTree map goal current
  | otherwise = 
    do  paths <- parMapM (searchCompressedM (depthLimit - 1) map goal) nextPositions
        return $ concat paths
  where neighbours0 = map M.! (head current)
        neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
        nextPositions = fmap ((: current) . _nextPos) neighbours

Performance

The correct depth limit has to be found experimentally. There doesn't seem to be any good way of predicting a good depth limit, depending on the hardware and task. Instead, I ran a few experiments and settled on a depth limit of 7 for this task.

That led to this performance and a factor of five speedup. That's decent, but not as much as I would have expected for spreading the task across 24 cores.

Program Time
  Original   12.2 seconds
  Tree   10.8 seconds
  Parallel   6.3 seconds
  Optimised parallel   2.4 seconds

Code

You can get the code from my locally-hosted Git repo, or from Gitlab.