Honeycomb puzzle, part 1

    A little late to the day, but I eventually got around to addressing this "honeycomb puzzle" challenge. It's a simple puzzle, but there's scope for a little bit of optimisation.

    In this part, I solve the puzzle. In part 2, I look at a couple of optimisations for it.

    The puzzle was set by Five Thirty-Eight, based on a game in the New York Times. I'll just quote the rules from Five Thirty-Eight:

    The goal is to identify as many words that meet the following criteria:
    • The word must be at least four letters long.
    • The word must include the central letter.
    • The word cannot include any letter beyond the seven given letters.
    Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.

    The challenge is to find the honeycomb that gives the highest score (using a vocabulary from a cached version of the Enable word list), but with restrictions that the honeycomb not contain any duplicate letters (why would you?), cannot contain the letter "S", and must contain at least one pangram.

    That's the challenge, let's go about solving it.

    I'll do that in both Python and Haskell. The approach, algorithms, and data structures are the same in both versions, even if the surface level syntax is different.

    Imports

    In both Python and Haskell, I need to import a few libraries.

    Python uses some standard libraries.

    import string
    import collections
    from dataclasses import dataclass
    

    Haskell uses some standard libraries, including the containers library for sets and maps.

    import qualified Data.Set as S
    import qualified Data.Map.Strict as M
    import Data.Map.Strict ((!))
    import Data.List
    

    Key data structures

    This problem revolves around sets of letters and checking subsets of letters. These are sets in the mathematical sense: collections of letters that are unordered and without duplicates. The letters in the honeycomb can be used in any order. I can use each letter as many times as we like, so the only thing I care about is whether a letter exists in the honeycomb or not.

    Being a bit more precise, I can make a word from a honeycomb if

    1. The letters in the word are a subset of the letters in the honeycomb
    2. The central letter is in the word

    Honeycomb

    From the definition of "is a word present in a honeycomb", it follows that a honeycomb is something that has a set of letters and a mandatory letter. Let's define that.

    In Python, I'll use a dataclass, which I'll make immutable (frozen):

    @dataclass(frozen=True)
    class Honeycomb():
        mandatory: str
        letters: frozenset
          
        def __repr__(self):
            return f'{self.mandatory}|{"".join(sorted(self.letters - set(self.mandatory)))}'
    

    (I also define __repr__ to pretty-print honeycombs if need be.)

    In Haskell, it's an algebraic data type (and a type alias for convenience):

    type LetterSet = S.Set Char
    
    data Honeycomb = Honeycomb Char LetterSet
        deriving (Show, Eq, Ord)
    

    Word-in-honeycomb test

    Given a honeycomb, a word is present in that honeycomb given the two conditions above. They're directly translated into code.

    In Python

    def present(target, honeycomb):
        return (   (honeycomb.mandatory in target) 
               and target.issubset(honeycomb.letters)
               )
    

    and Haskell

    present :: LetterSet -> Honeycomb -> Bool
    present target (Honeycomb mandatory letters) =
        (mandatory `S.member` target) && (target `S.isSubsetOf` letters)
    

    Vocabulary

    The next thing is to read and store the vocabulary. I could just store a list of the words, but really I'm only interested in the sets of letters in a word. Instead, I'll group together all the words that have the same set of letters. I'll have a dict or map connecting sets of letters to the words that they can make.

    For instance, the set {a, e, g, p} can be used to make the words {peage, agapae, page, gape, agape, peag}. The dict will have an entry with key {a, e, g, p} and the value at that key will be {peage, agapae, page, gape, agape, peag}.

    Loading the words

    This is basic file reading (and a new type synonym for Haskell). There are a couple of constraints we can apply at this point, which rule out three classes of words:

    1. Words with fewer than four letters don't score
    2. Words with more than seven distinct letters can't be made by a honeycomb
    3. Words containing an 's', as the honeycomb we're aiming for doesn't contain an 's'
    with open('enable1.txt') as f:
        words = [line.strip() for line in f]
    words = set(w for w in words 
                if len(w) >= 4
                if len(frozenset(w)) <= 7
                if 's' not in w)
    
    type WordSet = M.Map LetterSet (S.Set String)
    
    main = do
      allWords <- readFile "enable1.txt"
      let validWords = [w | w <- words allWords, 
                            length w >= 4, 
                            (S.size $ S.fromList w) <= 7,
                            's' `notElem` w]
    

    This is enough to check which words are in any given honeycomb. But I want to build the word sets, to help me check many honeycombs.

    Building the word sets

    The basic idea is to walk across the collection of loaded words, inserting each one into the collection of word sets. The key is the set of letters, the value is the set of words that can be made from those letters.

    In Python, this looks like iterating over the list of words, inserting each word as I go. I use a collections.defaultdict as the data structure, as that makes it easier to insert each word (we assume there's always a set of words for each key, regardless of whether we've we've see the key before).

    word_sets = collections.defaultdict(set)
    for w in words:
        letters = frozenset(w)
        word_sets[letters].add(w)
    

    In Haskell, I'm folding the words into the map, starting with an empty map. I use the the insertWith function to add words to the map.

    mkWordSets :: [String] -> WordSet
    mkWordSets = foldr addWord M.empty
        where addWord w = M.insertWith S.union (S.fromList w) (S.singleton w)
    

    Note that there's a fair bit of partial application in this definition. addWord has type String -> WordSet -> WordSet and uses union to combine this word (as a singleton set) with the existing words (if any).

    I end up being able to feed a key of word_sets into present and that will tell me if those words can be made with that honeycomb.

    Scoring word sets

    The end goal is to find the highest-scoring honeycomb. That means I'll be presenting lots of word sets to lots of honeycombs and then scoring them. But each set of words has the same score each time it's present, so I can save time by calculating the score for a set of letters and store that score.

    I'll follow the scoring rules as given. A four-letter word scores 1 point. Longer words score their length. If all seven letters are used, score a bonus 7 points. We wrap those rules in a function that, when given a set of letters, finds the words that can be made from that set, finds the score for each word, and returns the sum of all scores.

    For Python, I use a comprehension to transform the word_sets to scored_sets, and note how the global variable word_sets is accessed inside score.

    def score(present_set):
        score = 0
        for w in word_sets[present_set]:
            if len(w) == 4:
                score += 1
            else:
                score += len(w)
        if len(present_set) == 7:
            score += len(word_sets[present_set]) * 7
        return score
        
    scored_sets = {s: score(s) for s in word_sets}
    

    For Haskell, first I define a new type for scored letter sets, and I convert a WordSet to a ScoredSet by using mapWithKey to convert each set of letters to a score..

    type ScoredSet = M.Map LetterSet Int
    
    let scoredSets = M.mapWithKey scoreLetterSet wordSets
    

    The function scoreLetterSet is given a set of letters and the words can can be made from it. It finds the score for each word and adds them up (by using a fold to combine the scores for each word).

    scoreLetterSet :: LetterSet -> S.Set String -> Int
    scoreLetterSet letterSet scoringWords = bonus + (S.foldr' (\w t -> t + scoreAWord w) 0 scoringWords)
        where scoreAWord w 
                | length w == 4 = 1
                | otherwise     = length w
              bonus = if (S.size letterSet) == 7 then (S.size scoringWords) * 7 else 0
    

    Scoring honeycombs

    Now I know the score that comes from any particular set of letters, I can score a honeycomb. The first thing to check is that a set of letters is present in a honeycomb: the honeycomb's mandatory letter is in the set, and all the letters in the set are in the honeycomb.

    def present(target, honeycomb):
        return (   (honeycomb.mandatory in target) 
               and target.issubset(honeycomb.letters)
               )
    
    present :: LetterSet -> Honeycomb -> Bool
    present target (Honeycomb mandatory letters) =
        (mandatory `S.member` target) && (target `S.isSubsetOf` letters)
    

    The Python method to score a honeycomb uses the iterator interface to walk over the scored_sets dictionary, filter them by whether letters are present in the honeycomb, then sum the scores.

    def score_honeycomb(honeycomb):
        return sum(sc for letters, sc in scored_sets.items() 
                   if present(letters, honeycomb)
                  )
    

    I could do something similar in Haskell, using a filter to find just the letter sets that are present in the honeycomb, then summing their scores. But really what we're doing is a fold, combining the elements of the scored letter set map into a single value (the overall score); the wrinkle here is that the combining function uses the value of the key to decide if this letter set's score should be included in the total.

    scoreHoneycomb :: ScoredSet -> Honeycomb -> Int
    scoreHoneycomb scoredSets honeycomb = M.foldrWithKey scoreLetters 0 scoredSets
        where scoreLetters letters score total
                | present letters honeycomb = score + total
                | otherwise = total
    

    Finding the best honeycomb

    I'm now in a position to solve the original problem, that of finding the honeycomb with the best score (from the dictionary I have). How can I find the honeycombs to test?

    One approach is to find all seven-letter subsets of the alphabet and, for each of those, build the seven possible honeycombs. This gives

    $$7 \times \frac{26!}{19! \times 7!} = 4,604,600$$

    possible honeycombs. That's a doable number, but still large.

    But the puzzle description says I'm after a honeycomb where every letter is used, which means I can base our honeycombs on those letter sets. There are 14,741 words with seven distinct letters, grouped into 7986 distinct groups. That means there are only 55,902 distinct honeycombs to investigate.

    I can find those "pangram" letter sets, and hence the possible honeycombs, by filtering the letter sets we have. I generate the plausible honeycombs by generating honeycombs for each letter set.

    pangram_sets = [s for s in word_sets.keys() if len(s) == 7]
    
    ps_honeycombs = [Honeycomb(mandatory=mand, letters=ps) 
                     for ps in pangram_sets
                     for mand in ps]
    
    pangramSets = S.filter ((7 == ) . S.size) $ M.keysSet scoredSets
    
    mkPlausibleHoneycombs :: S.Set LetterSet -> S.Set Honeycomb
    mkPlausibleHoneycombs pangramSets = S.foldr S.union S.empty honeycombSets
        where honeycombSets = S.map honeycombsOfLetters pangramSets
              honeycombsOfLetters ls = S.map (\l -> Honeycomb l ls) ls 
    

    Now I know all the plausible honeycombs, and can find the score for each, I can find the honeycomb with the best score. In Python, it's easy using the built-in max function:

    best_honyecomb = max(ps_honeycombs, key=score_honeycomb)
    print(best_honyecomb, partitioned_score_honeycomb(best_honyecomb))
    

    In Haskell, I have to build our own version of max that operates over the Set of honeycombs, using foldr to combine the honeycombs into just the one with the best score.

    findBestHoneycomb scoredSets honeycombs = 
        S.foldr (betterHc scoredSets) (0, initHc) honeycombs
        where initHc = Honeycomb 'a' $ S.singleton 'a'
    
    betterHc scoredSets hc (bestScore, bestHc) 
        | thisScore > bestScore = (thisScore, hc)
        | otherwise             = (bestScore, bestHc)
        where thisScore = scoreHoneycomb scoredSets hc
    

    And the best honeycomb is...

    Thankfully, both programs agree on the best honeycomb and its score!

    The best honecomb is "r|aegint" with a score of 3898.

    The bad news is that this takes a long time: about 5½ minutes for Python, and 4 minutes for Haskell. That's far too long. In the next part, I'll look at a couple of ways to optimise these programs.

    Code

    You can get the code from my locally-hosed Git repo, or from Gitlab.

    Neil Smith

    Read more posts by this author.