Optimising Haskell, example 1

    In my Advent of Code review, I mentioned that several of my solutions took an excessive amount of time and storage to complete. This is a worked example of how I addressed one of them.

    The day 15 problem involved reading the details of a bunch of sensors. Each sensor  had a position (on a regular 2d grid) and a "forbidden zone" where items could not be. The overall forbidden zone was the union of each sensor's forbidden zone.

    The first part of the problem was to find the number of forbidden cells in one particular row of the grid. The second part was to find the one position in a (very large) region that was not forbidden by any sensor.

    My initial solution took 2 minutes to execute and consumed 17.9Gb of RAM, far outside the "15 seconds on ten-year-old hardware" criterion for a decent solution. I could do better.

    Profiling

    The first step was to profile the program to see where it was spending its time. From a visual inspection of the output, it was obvious that the second part of the problem was one one taking most of the time. The code for this part was short enough to profile. (This is advent15/MainOriginal.hs.)

    part2 sensors = x * 4000000 + y
      where coverage = mconcat $ fmap nearby sensors
            boundaries = {-# SCC boundaries #-} S.filter (inRange searchRange) $ S.unions $ fmap justOutside sensors
            V2 x y = {-# SCC findMinV #-} S.findMin $ S.filter (not . (getRegion coverage)) boundaries

    The profiling report showed that finding the boundary regions and combining them was the main bottleneck in the program.

     Cost centre   %time   %alloc 
    justOutside 39.1 66.8
    boundaries 28.0 23.6
    nearby 14.1 0.0
    <> 8.5 0.0
    part1 6.4 1.2
    findMinV 2.1 0.0
    manhattan 1.8 8.4

    I decided to try optimising this program in stages of increasing complexity.

    1. Ordering how the regions are checked when testing if a point is forbidden
    2. Exploiting laziness when generating the candidate points to test
    3. Adding parallelism to test several points at once

    Ordering regions

    This was a simple optimisation. It was easy, even if the profiling suggested it wouldn't have a great effect.

    The mappend function for regions is basically a logical or:

    instance Semigroup Region where
      r1 <> r2 = Region (\p -> getRegion r1 p || getRegion r2 p)
    
    instance Monoid Region where
      mempty = Region (const False)

    This means that the getRegion function will short-circuit as soon as a point is in one region. Therefore, if I test the largest regions first, this should help the performance. That involved defining an Ord instance for sensors, then building the coverage region in order of size. (This is advent15/MainSorted.hs.)

     instance Ord Sensor where
       (Sensor s1 b1) `compare` (Sensor s2 b2) 
         = (s1 `manhattan` b1) `compare` (s2 `manhattan` b2)
    
    coverage = mconcat $ fmap nearby $ sortOn Down sensors

    This had little effect.

    Program  System time   User time   Wall time   Memory used (kb) 
    Original 18.47 202.19 2:02.91 17,888,168
    Sorted region 18.09 189.46 2:00.68 17,886,248

    (There is further optimisation of adding new regions in order of how much additional space they include in the region, but that requires a lot of consideration of how big the overlap is. I decided this was unlikely to make a difference.)

    Exploiting laziness

    My overall plan for the part 2 solution is to find the points of the borders of each sensor's forbidden region, then check each one. As several of these regions overlap, some points will be present in the border of more than one region.

    To avoid wasted work, I combine all the points-to-test into one Set and then test the points in that. This is a nice idea, except that the implementation of Set is strict. This means that I generate all 4.9 million points, store them in a set, then filter them for being outside all forbidden regions.

    However, generating then discarding those 4.9 million points takes a lot of time and spaces, especially as I only need to find one.

    Therefore, I tried to use Haskell's laziness to convert this into a pipeline, where points are generated only as needed by the filtering stage. This meant converting the intermediate storage of points from a Set to a List.

    The change in the code was small. (This is advent15\MainLazy.hs.)

    part2 sensors = x * 4000000 + y
      where coverage = mconcat $ fmap nearby $ sortOn Down sensors
            boundaries = fmap (filter (inRange searchRange)) $ fmap justOutside sensors
            V2 x y = head $ concat $ fmap (filter (not . (getRegion coverage))) boundaries

    This had a significant effect on the program's runtime (from two minutes to five seconds) and memory consumption (from 17.9Gb to 81Mb).

    Program  System time   User time   Wall time   Memory used (kb) 
    Original 18.47 202.19 2:02.91 17,888,168
    Sorted region 18.09 189.46 2:00.68 17,886,248
    Lazy boundary creation 0.10 5.69 0:05.64 80,708

    Parallelism

    The use of laziness increased the speed enormously, but the program was still only using one core. Could I decrease the clock time by using parallelism?

    The Control.Parallel.Strategies library handles many of the details of parallelism for you.

    (I also decided to move the creation of the coverage region into the main function, where it could be reused rather than re-created). (This is advent15/MainDirectParallel.hs.)

    part1, part2 :: [Sensor] -> Region -> Int
    part1 sensors coverage = length $ filter (\p -> p `notElem` occupied) 
                                    $ fmap snd $ filter fst forbidden
      where rowCoords = range ( (V2 (globalMinX sensors) thisY)
                              , (V2 (globalMaxX sensors) thisY)
                              )
            occupied = concatMap (\(Sensor s b) -> [s, b]) sensors
            forbidden = (fmap (\p -> (getRegion coverage p, p)) rowCoords) 
                          `using` (parList rdeepseq)
    
    part2 sensors coverage = x * 4000000 + y
      where boundaries = fmap (filter (inRange searchRange)) 
                            $ fmap justOutside sensors
            holes = (fmap (filter (not . (getRegion coverage))) boundaries) 
                      `using` (parList rpar)
            V2 x y = head $ concat holes

    The good news is that this had a huge effect on part 2, with all cores being fully used and the solution arriving quickly.

    The bad news is that part 1 was now slower and was still only using one core for most of the time.

    Program  System time   User time   Wall time   Memory used (kb) 
    Original 18.47 202.19 2:02.91 17,888,168
    Sorted region 18.09 189.46 2:00.68 17,886,248
    Lazy boundary creation 0.10 5.69 0:05.64 80,708
    Direct parallelism 3.34 27.14 0:11.49 1,322,200

    An investigation using Threadscope confirmed this was the case. The block of time from about 10 seconds on was the solution to part 2.

    A single core being used implies that a non-parallel part of the job is dominating the work. That is likely to be the filter part of part 1. All the points are created and evaluated, and the final list of points then has to pass through the one filter operation.

    That implies that a better way to execute the solution in part 1 is so split the range of points into chunks. Each chunk can be processed in a separate thread, to generate the number of forbidden points in that chunk. Adding up a few numbers will then be fast.

    That leads to an alternative solution to part 1. I find the list of positions to test, then chunk it into sections. Each section is then processed in parallel, with the processing converting the list of positions into a count of the forbidden positions in that chunk. (This is advent15/Main.hs.)

    part1, part2 :: [Sensor] -> Region -> Int
    part1 sensors coverage = sum (fmap countForbidden rowChunks `using` (parList rseq))
      where rowCoords = range ( (V2 (globalMinX sensors) thisY)
                              , (V2 (globalMaxX sensors) thisY)
                              )
            rowChunks = chunksOf 1000 rowCoords
            occupied = concatMap (\(Sensor s b) -> [s, b]) sensors
            countForbidden positions = 
              length $ filter (\p -> p `notElem` occupied) 
                     $ filter (getRegion coverage) positions
    
    part2 sensors coverage = x * 4000000 + y
      where boundaries = fmap (filter (inRange searchRange)) 
                          $ fmap justOutside sensors
            holes = fmap (filter (not . (getRegion coverage))) boundaries
                       `using` (parList rseq)
            V2 x y = head $ concat holes

    This gives a large speedup, with all cores being used for most of part 1. The first half-second or so is reading the input file then building the list of positions to test.

    This gives  the fastest runtime of all, but not the smallest memory footprint.

    Program  System time   User time   Wall time   Memory used (kb) 
    Original 18.47 202.19 2:02.91 17,888,168
    Sorted region 18.09 189.46 2:00.68 17,886,248
    Lazy boundary creation 0.10 5.69 0:05.64 80,708
    Direct parallelism 3.34 27.14 0:11.49 1,322,200
    Parallelism and chunking 0.59 14.82 0:02.36 672,592

    Better algorithms

    Of course, this post was an example of optimising the performance of this approach. Other approaches to the problem, and other algorithms, can yield different performance.

    For instance, consideration of this problem shows that the one point we have to find must be one of the intersections of the boundaries of two (or more) forbidden regions. A bit of algebra can find those intersections, giving many fewer points to inspect.

    Another approach is to consider the forbidden regions as sets of forbidden intervals on the rows of the underlying grid. For any given y position and sensor, you can quickly calculate the forbidden region on that row. You can then use interval relations to find the region forbidden by the union of sensors on that row.

    Code

    You can get the code from my locally-hosted Git repo, or from Gitlab.

    Neil Smith

    Read more posts by this author.