Neil's musings

Snippets from a computing educator and researcher.

Breaking keyword substitution ciphers

October 31, 2018 in #python #codes and ciphers

Now we know how to encipher and decipher with keyword substitution ciphers, the next step is to automatically break them.

The basic idea is the same as it was for breaking Caesar ciphers and affine ciphers: decipher the message with all the possible keys and see which one looks most like English.

That means the keyword_break function looks like this:

def keyword_break(message, wordlist=keywords, fitness=Pletters):
    best_keyword = ''
    best_wrap_alphabet = True
    best_fit = float("-inf")
    for wrap_alphabet in KeywordWrapAlphabet:
        for keyword in wordlist:
            plaintext = keyword_decipher(message, keyword, wrap_alphabet)
            fit = fitness(plaintext)
            if fit > best_fit:
                best_fit = fit
                best_keyword = keyword
                best_wrap_alphabet = wrap_alphabet
    return (best_keyword, best_wrap_alphabet), best_fit

All you need now is a list of words to act as possible keywords. If you have a dictionary file on your computer (perhaps in /usr/share/dict/), you can use that. If not, Peter Norvig has links to several word lists, mainly originally for use by wordgame players.

And that works, but it takes a long time. This is to some extent expected, as there are a lot of words to try. However, if you look at a system monitor while this code is executing, you see that all the work is being done by one CPU core. In a world of CPUs coming with four, six, or more cores, it's inefficient to leave three-quarters or five-sixths of the computer idle while looking for keywords.

Python (and most other programming languages) are single-threaded by default and you have to use some kind of additional feature to allow multi-threaded programs. This isn't helped that, in the general case, you have to write programs that understand how other threads can modify the values stored in variables while this thread is running.

Luckily for us, this cipher-breaking example doesn't require much in the way of interacting threads. But to understand how we're going to use multiple CPU cores to break this cipher efficiently, let's step back a bit.

General pattern: map

Conceptually, we have a long list of possible keys to try. We also have a "scoring function" which takes a key and a ciphertext, and returns that key and the score of that key (for how much the proposed plaintext looks like English). One we have our list of (key and score), we can find the element of that list with the highest score, and that gives us the best key.

cipher-keyword-map

This general pattern of applying a a function to each element of a list, and generating a list of results, is so common it has its own name: map. Most programming languages have a function called map which takes a list of items and a function to apply to them, and returns the list of results. (This makes map a higher order function, because it's a function that takes a function as an argument.)

The important thing for us is that each of these scoring operations is entirely independent of each other: if we had a thousand keys to try and a thousand CPU cores to use, we could get each core to work on one key, and the different processes wouldn't need to communicate with each other at all.

The Python multiprocessing library has its own version of the map function which does this for us. Almost. The multiprocessing map uses a function which takes just one argument, but our scoring function needs to take several. For instance, each keyword cipher key has two parts (the word and how the rest of the alphabet is used) and the scoring function itself needs to be told which language model to use to generate the score.

A Python trick

We can get around that limitation with the star operator (sometimes referred to as the delightfully-named splat operator). It's used for passing in varying numbers of arguments to a function call.

We can define a Python function add3 which just adds three numbers together:

def add3(x, y, z):
    return x + y + z

We can call it in the usual way:

>>> add3(5, 6, 7)
18

But if the arguments are in the form of a tuple, it doesn't work:

>>> triple = (2, 4, 6)
>>> pair = (6, 9)
>>> add3(triple)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: add3() missing 2 required positional arguments: 'y' and 'z'
>>> add3(1, pair)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: add3() missing 1 required positional argument: 'z'

But the star operator unpacks the tuple into the individual arguments for the function[^1]:

[^1] You can also use the star operator on function definitions to allow functions to take arbitrary numbers of arguments. When called, the arguments are packaged up into a list for use inside the function body.

>>> add3(*triple)
12
>>> add3(1, *pair)
16

It even gives the right answers!

Putting it all together

We can now put all this together with the starmap function in the multiprocessing library.

We define a helper function that takes all the parameters we want, scoring the effectiveness of this key on this ciphertext:

def keyword_break_worker(message, keyword, wrap_alphabet, fitness):
    plaintext = keyword_decipher(message, keyword, wrap_alphabet)
    fit = fitness(plaintext)
    return (keyword, wrap_alphabet), fit

This is called by the keyword_break_mp function (mp for "multiprocessing"). This function just assembles all the arguments for each helper into a list of tuples called helper_args, then uses starmap to do the work.

def keyword_break_mp(message, wordlist=keywords, fitness=Pletters,
                     chunksize=500):
    with multiprocessing.Pool() as pool:
        helper_args = [(message, word, wrap, fitness)
                       for word in wordlist
                       for wrap in KeywordWrapAlphabet]
        breaks = pool.starmap(keyword_break_worker, helper_args, chunksize)
        return max(breaks, key=lambda k: k[1])

And that's it! The multiprocessing.Pool() takes care of spreading the work across the different processors, and the max() function with the key parameter picks out the cipher key with the highest score.

Code

As always, you can find the code in cipher/keyword_cipher.py on Github.


Cover photo by unsplash-logoImmanuel Starshov

Share on Google+