Optimising riddle generation

    My first is in flysheet but not in plushest
    My second is in protean but not in protect
    My third is in brassiere but not in brainier
    My fourth is in petty but not in puffy
    My fifth is in updated but not in updating
    My last is in scurvier and also in scummier

    Solution: Faster

    This is the third of a three-part series on solving and generating riddles like the one above. The first post was about solving riddles, the second post was about generating riddles, and this post is about making that generation faster.

    The riddle generation from the previous post works, but it takes time. A little bit of metering shows it takes on average 1.4 seconds to generate a riddle, with some taking up to 11 seconds! The main reason for this is that the generator is creating and rejecting about 1900 clues per line in the final riddle. That's a lot of wasted work.

    Histogram of times to generate a riddle

    This is mainly because of how clue words are chosen. I pick two random words then check that they're close enough (by edit distance) for form a clue pair. But, from a quick check, 99.7% of words are more than a distance 3 from each other, and distance 3 is the default limit.

    That suggests that an improvement would lie in finding and caching the "nearby" words for each word in the dictionary, then using those lists of nearby words to generate riddle clues.

    Space for time

    Caching "nearby" words is a classic trade-off of space for time. A new set of words, organised by distance, will take far more space to store, but it should make riddle generation faster.

    Generating the cached dictionary

    Generating the cache is a matter of comparing each word in the dictionary to each other word, finding the distance, and recording the connection if the words are close enough. To keep things simple, I do it word-by-word, returning the results as a pair of (word, related). I also include the "not a subset" constraint in the test.

    def find_neighbours(word: str, limit: int = 4) -> Tuple[str, List[str]]:
      sword = set(word)
      others = []
      for other in dictionary:
        if other != word:
          soth = set(other)
          if (not sword <= soth and 
              not soth <= sword and 
              edit_distance(word, other) <= limit):
            others.append(other)
      return word, others

    Returning the words in this form allows me to process the whole dictionary with a map function, which in turn makes it easy to use the multiprocessing library to spread the work across multiple CPU cores.

    with multiprocessing.Pool() as pool:
      word_other_pairs = pool.map(find_neighbours, dictionary)

    (I should have been able to use imap_unordered here instead of map, but that didn't seem to actually use multiple processsors when I tried it.)

    Finally, I write the cached dictionary directly as a compressed file.

    This cached dictionary takes about 100Mb compressed, about 370Mb uncompressed. That's a lot more room than the about 1Mb original dictionary.

    Using the cached dictionary

    Using this dictionary requires a change of approach in the clue-creation functions. Rather than the original approach of picking two random words and testing if they fit the criteria, I instead pick a random word, find the nearby words, then filter them, and finally pick one at random from the filtered words. If there are none in the filtered words, I pick another source word.

    def include_exclude_clue(letter: str, limit: int = 3) -> (RiddleClue, RiddleClue):
      finished = False
      while not finished:
    
        with_letter = random.choice([w for w in dictionary_neighbours if letter in w])
        without_letter = [w for w in dictionary_neighbours[with_letter] 
                          if letter not in w
                          if edit_distance(with_letter, w) <= limit]
        if without_letter:
          other = random.choice(without_letter)
          finished = True
    
      return (RiddleClue(word=with_letter, valence=RiddleValence.Include),
              RiddleClue(word=other, valence=RiddleValence.Exclude))
    

    The other clue-generation functions follow the same pattern. Apart from that, all the riddle creation program is the same.

    Performance

    This does much better. It now takes about 0.4 seconds to generate a riddle rather than the 1.4 seconds of before, about three and a half times faster.

    Times needed for original and related-word approaches

    That's good, but half a second for a riddle still seems like a long time. I profiled the code and that revealed where the code was spending most of its time. A lot of it was spent calculating the edit distance between words, as I expected. But a lot of it was spent in the filtering of the related words found in the cached dictionary.

    Each word has, on average, 940 "nearby" words. The code above processes each of those nearby words to find a group to choose from, then picks one of them. But, on average, each word only has about 150 words within an edit distance of 3, the typical limit used in riddle generation. That means that about one in seven words are close enough, or about one in fourteen if you account for having the required letter or not.

    That means that I should only need to find the edit distance of about those fourteen words, and the other 920-odd evaluations are wasted.

    Laziness

    How can I avoid finding the edit distance of words I don't need to? I take a leaf from Haskell's approach and treat the list of related words as a lazy list, only processing as many as I need to.

    I still need to shuffle the words so that I will eventually process all of them. But once they're in a random order, I walk down that list, testing each word, until I find one that satisfies the conditions.

    def include_exclude_clue(letter: str, limit: int = 3) -> (RiddleClue, RiddleClue):
      finished = False
    
      while not finished:
        has_first = False
        while not has_first:
          with_letter = random.choice(possible_riddle_solutions)
          has_first = letter in with_letter
        
        others = dictionary_neighbours[with_letter][:]
        random.shuffle(others)
        
        while not finished and others:
          other = others[0]
          if letter not in other and edit_distance(with_letter, other) <= limit:
            finished = True
          else:
            others = others[1:]
    
      return (RiddleClue(word=with_letter, valence=RiddleValence.Include),
              RiddleClue(word=other, valence=RiddleValence.Exclude))

    Again, the other clue-generation functions follow the same pattern and everything else stays the same.

    Performance

    This speeds things up a lot. The average time per riddle drops from 0.4 seconds per riddle to 0.05 seconds per riddle, better than a seven times speedup. It also has a much narrower spread of times, and doesn't seem to have any very long outlier times.

    Performance of all three variants

    In fact, the lazy version is so much faster than you can't really see any range on the histogram above. Even just plotting the two faster versions isn't much better.

    Here's a table of performance, including the speedup factors for the three different versions. As you can see, the lazy version of riddle generation is about 25 times faster than the original!

     Version   Time per riddle   Standard deviation   Speedup over original   Speedup over related 
    Original 1.41 1.13    
    Related 0.417 0.340 3.40  
    Lazy 0.0554 0.0430 25.6 7.53

    Code

    You can find the full code on my Git server, or over at Gitlab. Note that I developed this with Jupyter and Jupytext, so only Markdown files are in version control. Jupyter + Jupytext will convert these into Notebook files and Python scripts when they're opened.

    Neil Smith

    Read more posts by this author.