# Breaking keyword substitution ciphers without a keyword

I previously blogged about keyword substitution ciphers, and breaking them by working through a dictionary of possible keywords. (The basic idea is to decrypt the ciphertext with each possible keyword, and pick the one that gives the result that looks most like English.) But what if the keyword isn't in our dictionary?

A keyword substitution cipher essentially creates a mapping for how to substitute the letters in the plaintext with different letters in the ciphertext. One approach is to try all possible mappings, but that quickly gets out of hand. There are 26! possible ways of arranging the letters of the alphabet, which is $4.03 \times 10^{26}$.  Even if we do one million checks per second, it will take 12,800 billion years.

This is the first of three connected posts. This post introduces the idea of hill climbing to use random search to find a good solution. The second post is about ways of refining and improving the base hillclimbing algorithm. The third post gives some empirical results of how these different variations work.

## Random choice with selection

An alternative approach is to pick some initial mapping, essentially at random, and then try to improve it. For instance, if we have some ciphertext that looks like this:

agkaz kzt agk

and a mapping of

plaintext letter a c e h t
ciphertext letter a g k z t

we would get a possible plaintext result of

aceah eht ace

which has a bigram score of -24.47. If we swap the t and k in the mapping, we get this mapping:

plaintext letter a c e h t
ciphertext letter a g t z k

and the result

actah the act

which has the better score of -23.34. On the other hand, swapping z and a in the original gives

plaintext letter a c e h t
ciphertext letter z g k a t

and the result

hceha eat hce

and a worse score of -26.42.

In this case, we should adopt the agtzk mapping as the best one and try variations on that. If we then swap the g and a, we end up with this mapping

plaintext letter a c e h t
ciphertext letter g a t z k

giving the result

catch the cat

with a score of -22.14. That's the plaintext.

We can summarise the steps in this table. In this (rather contrived) example, we make four trial mappings and get to the correct result in three steps.

## The hillclimbing algorithm

That is the essence of the hillclimbing idea: make random steps in the space of possible solutions, and keep the ones that have a better score. (It's called hillclimbing because of the metaphor of score being like heights in a landscape. The best solution is the peak of the tallest hill. The algorithm finds that peak by always walking uphill.)

We can formalise the idea of this search by writing a hillclimbing algorithm:

pick a random mapping and find its score
repeat many times:
tweak the mapping and find the new score
if new score > current score:
current mapping ← new mapping
current score ← new score
return current mapping


That's all there is to it.

What's different about this algorithm, when compared to all the other codebreaking algorithms I've talked about so far, is that this algorithm is a randomised algorithm, as it's based on making random choices at each step. It isn't guaranteed to find the solution, but it will find a good solution most of the time.

One of the main problems with hillclimbing is getting trapped on a local optimum.

## The problem of local optima

plaintext letter a c e h t
ciphertext letter t k g z a

It gives the result tecth cha tec (scoring -22.38), but every possible swap gives a worse score. How does hillclimbing deal with this situation?

The answer, for the simple hillclimbing algorithm above, is that it doesn't deal with the situation. It will try each step, perhaps many times, without moving away from this (incorrect) solution.

The way around this is to run the algorithm several times. Each time it runs, it starts from a different, random place. If you start the algorithm from enough different places, hopefully at least one of the runs will end up on the global optimum!

The follow-up post to this one talks about simulated annealing, another approach to gettting out of local optima.

## Tuning hillclimbing

There are a few things you can do to refine the basic hillclimbing algorithm. A few that are easy to implement are:

1. changing the stopping condition
2. picking a good starting position

### Stopping

There are two main ways to stop the search: either run for a fixed number of iterations, or stop when there's been no improvement for a while. Using a fixed number of iterations has the advantages of being both easy to implement (it's just a for loop) and having a predictable run time. In practice, something like 20,000 iterations is good enough.

Running until there's been no improvement isn't that much harder to implement. You have a counter that increments with every attempt to find a new solution. If the new solution is accepted, the counter is reset. The search terminates if the counter reaches some limit value.

pick a random mapping and find its score
unsuccessful_attempts ← 0
while unsuccessful_attempts < time_limit:
tweak the mapping and find the new score
if new score > current score:
current mapping ← new mapping
current score ← new score
unsuccessful_attempts ← 0
else:
increment unsuccessful_attempts
return current mapping


### Picking a good starting mapping

If we assume a monoalphabetic substitution, the chances are that the letter frequency distribution of the ciphertext will mostly follow the letter frequency distribtion of generic plaintext, even though the letters will be different. In other words, the most frequent letter in the ciphertext will be e, the most frequent letter in the plaintext, and so on.

That suggests that a good starting mapping for the hillclimbing would be to sort the plaintext letters by frequency and map them against the proposed plaintext letters of etoainhsrdlumwycfgpbvkxjqz (the order of frequency in English). It's unlikely to be perfect, but it's likely to be a better starting place than a random mapping.

This mapping will, by definition, give the best unigram count for the proposed plaintex, so this start implies we have to use bigram, trigram, or longer, counts for scoring proposed plaintexts.

If we assume that having the mapping in frequency order is close to the desired mapping, we don't want to stray too far from that mapping when trying new ones. That suggests that, when we pick two letters in the mapping to swap, we should prefer to swap letters near each other in the frequency order. We can do that by picking one letter uniformly at random, and using the random.gauss() function to pick the second letter from a normal distribution near the first.
The code for this post is in the keyword_cipher.py file on Github, but it includes the modifications for using simulated annealing search, as described in the next post.