Optimising Haskell: parallelism and the Par monad
The last Advent of Code solution I'll optimise (as described in my Overview of Advent of Code 2023) is day 23, which I'll do by making it use more cores.
The problem is to find the longest path through a graph. Unlike the various shortest-path algorithms, there's no way to stop early or to use a heuristic to guide the search. You've got no real option but to generate all the possible routes and then pick the longest.
However, Haskell typically uses only a single thread for execution, and "finding all possible routes" is a task that is trivially parallelisable (you can find all extensions of a partial route in parallel). This idea of a dataflow network neatly fits the Par
monad. This uses the GHC compiler's version of lightweight internal threads to provide parallel execution. (They're lightweight by being internal to the Haskell runtime, not spawning new OS threads.)
Profiling the program showed that it was spending all its time in the searchCompressed
function calls, not any of the setup. This meant I only needed to fix one function to get a good speedup.
Refactoring
The first thing to do was refactor the search function to be something that could be easily made parallel. The existing approach, using an agenda of partial solutions, was inherently single-threaded.
I considered keeping the agenda, and having a pool of workers that updated it; each worker would take an partial solution from the agenda, produce a set of successor partial solutions, and put them back on the queue. However, that was lot of work and didn't seem to fit the "Haskell way."
Instead, I took inspiration from Simon Marlow's book, Parallel and Concurrent Programming in Haskell and used the Par
monad to exploit parallelism directly. It had several examples of using Par
, including tree search. The way it made them parallel was to do direct recursion into the successors of each partial solution, rather than using an explicit agenda.
That meant changing the previous depth-first search of this (in MainOriginal.hs
):
searchCompressed :: CompressedMap -> Position -> [[Position]] -> [[Position]] -> [[Position]]
searchCompressed _ _ found [] = found
searchCompressed map goal found (current:agenda)
| head current == goal = searchCompressed map goal (current:found) agenda
| otherwise = searchCompressed map goal found (nextPositions ++ agenda)
where neighbours0 = map M.! (head current)
neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
nextPositions = fmap ((: current) . _nextPos) neighbours
into this (in MainTree.hs
):
searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current
| head current == goal = [current]
| otherwise = concatMap (searchCompressed map goal) nextPositions
where neighbours0 = map M.! (head current)
neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
nextPositions = fmap ((: current) . _nextPos) neighbours
Performance
This change actually made the program run faster! However, it wasn't by a lot.
Program | Time |
---|---|
Original | 12.2 seconds |
Tree | 10.8 seconds |
Parallelising
Now the code was refactored, I could introduce the parallelism. This was conceptually as simple as replacing the concatMap
with a parConcatMap
. Unfortunately, the Control.Monad.Par
library doesn't provide that, so I split the operation into separate parMap
and concat
. I also kept the whole computation in the same Par
monad, to avoid the overheads of creating many new monads.
I also needed a wrapper to call the parallel search. This was in MainNoDepthLimit.hs
; you'll see why it's called that in a moment.
searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current = runPar $ searchCompressedM map goal current
searchCompressedM :: CompressedMap -> Position -> [Position] -> Par [[Position]]
searchCompressedM map goal current
| head current == goal = return [current]
| otherwise =
do paths <- parMapM (searchCompressedM map goal) nextPositions
return $ concat paths
where neighbours0 = map M.! (head current)
neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
nextPositions = fmap ((: current) . _nextPos) neighbours
Performance
This change had a large increase in performance. It also clearly showed that all the cores on my PC were being used during the execution.
Program | Time |
---|---|
Original | 12.2 seconds |
Tree | 10.8 seconds |
Parallel | 6.3 seconds |
Optimising parallelism
The Parallel and Concurrent Programming in Haskell book mentions several times that the parallelism often benefits from being tamed. While the sparks are lightweight, they aren't cost-free. The ideal is to generate enough tasks so they can be allocated efficiently to different cores, but not so much that the runtime spends lots of time managing them all.
In this case, the code above does too much parallelism. I can tame it by using the technique mentioned in the book, of using a depth limit. I start by generating search tasks in parallel. But once I have enough of them, I can revert to the more efficient non-parallel process to evaluate each task and its sub-tasks. I use the size of each partial solution as a proxy for the number of tasks.
That means putting on a depth limit to the parallel search. If I'm below that depth limit, I solve sub-tasks in parallel. If I'm above the depth limit, I call the sequential version. This is in Main.hs
, and uses the refactored searchCompressed
from above, but renamed searchCompressedTree
.
searchCompressed :: CompressedMap -> Position -> [Position] -> [[Position]]
searchCompressed map goal current = runPar $ searchCompressedM parallelDepthLimit map goal current
searchCompressedM :: Int -> CompressedMap -> Position -> [Position] -> Par [[Position]]
searchCompressedM depthLimit map goal current
| head current == goal = return [current]
| depthLimit == 0 = return $ searchCompressedTree map goal current
| otherwise =
do paths <- parMapM (searchCompressedM (depthLimit - 1) map goal) nextPositions
return $ concat paths
where neighbours0 = map M.! (head current)
neighbours = neighbours0 ^.. folded . filtered ((`notElem` current) . _nextPos)
nextPositions = fmap ((: current) . _nextPos) neighbours
Performance
The correct depth limit has to be found experimentally. There doesn't seem to be any good way of predicting a good depth limit, depending on the hardware and task. Instead, I ran a few experiments and settled on a depth limit of 7 for this task.
That led to this performance and a factor of five speedup. That's decent, but not as much as I would have expected for spreading the task across 24 cores.
Program | Time |
---|---|
Original | 12.2 seconds |
Tree | 10.8 seconds |
Parallel | 6.3 seconds |
Optimised parallel | 2.4 seconds |
Code
You can get the code from my locally-hosted Git repo, or from Gitlab.