In the previous post, I described how to use hillclimbing to break a substitution cipher when we don't know the key. But standard hillclimbing has weaknesses: it can get trapped in a local optimum (that is, trapped on the summit of a foothill in the fitness landscape); or it can be marooned on a fitness "plateau", where all changes give about the same result, so the search wanders aimlessly.
Exploration and exploitation
To think about how to fix this problem, have a think about restaurants.
Imagine you want to go out for dinner tonight. Do you try a new and different restaurant, or do you stick with one you already know is good?

The answer depends a lot on how long you've been in the area. If you've recently moved in, you don't know which are the good restaurants. In fact, you might not even know what counts as "good" around here! If you go to a new place, it might be rubbish, but it might also be better than anything you've visited in the area.
On the other hand, if you've been in the area for a long time, you know which the good places are. You've got your list of your favourite restaurants and a much longer list of not-so-good ones. A new restaurant you visit might be a hidden gem, but it's more likely to be average and therefore worse than your favourites.
Generalising, these two situations illustrate the balance between exploration and exploitation. When you know little about your situation, exploration is the best option. As you become more familiar with what's good, you should invest more time and effort exploiting what you know is best.
This is similar to what we want to do with the cipher breaking. We start with some random proposal of what the key would be, but we have no real idea if it's any good. At this stage, we want to explore the space of possible keys, looking for something promising. As the search progresses, we get more understanding of what the key is, so we want to move to more exploitation of a key that we know is pretty good.
Simulated annealing
Simulated annealing is one method of handling this. The term annealing comes from metallurgy. It's the process of hardening metal by cooling it slowly; as it cools, large crystals form in the metal, making the overall piece harder. The trick is that sometimes, some crystals get a bit smaller. This is a backward move in the short term; but it gives neighbouring crystals more room to grow, making the metal stronger in the long term.
That's the key idea of simulated annealing: sometimes we'll accept a worse solution than what we currently have, in the hope that it will lead to an even better solution in the long term. The simulated annealing algorithm is just a little bit different from the standard hillclimbing one:
current_solution ← some random solution
set a limit on number of iterations
set initial_temperature
while not reached iteration limit:
new_solution ← random_change(current_solution)
if goodness(new_solution) > goodness(current_solution):
current_solution ← new_solution
else:
MAYBE current_solution ← new_solution
decrease temperature a bit
The magic of the algorithm lies in that final MAYBE
statement. It sometimes chooses the new_solution
even if it's worse than the current_solution
. The probability it does so depends on both the goodness of the new_solution
and an overall temperature
. The better new_solution
is, the more likely it is to be chosen: very poor solutions are unlikely to ever be chosen. The temperature
is how the algorithm balances exploration with exploitation. When the temperature is high, MAYBE
is more likely to explore a given new_solution
; when the temperature
is low, the algorithm is more likely to reject a poor new_solution
and continue to explore the current_solution
it has.
The details of this algorithm can vary. Some variants don't automatically accept better solutions during the run. The method of varying the temperature
can change between implementations, though they nearly always end with temperature = 0
. And there is great variation in how the new solution
is selected if it's worse than the current_solution
.
The detail of how these factors are balanced is in the equation:
\[
p(\mathrm{new}) = \exp \left( \frac{\mathrm{good}(\mathrm{new}) - \mathrm{good}(\mathrm{current})}{T} \right)
\]
That gets implemented as this Python code (in keyword_cipher.py
on Github):
import math
import random
sa_chance = math.exp((new_fitness - current_fitness) / temperature)
if (new_fitness > current_fitness or random.random() < sa_chance):
current_solution = new_solution
Unfortunately, there are no hard-and-fast rules for how to set the initial temperature
or how to decrease it over the course of the execution. It needs to be high enough to allow a good amount of initial exploration, but needs to decrease over enough time to allow the algorithm to find a reasonable solution mid-run and refine it during the exploitation phase. I do it by a linear decrease in temperature, so the temperature decreases by a little bit each iteration until it reaches zero on the final iteration.
The algorithm still suffers from the overall flaw of hillclimbing, which is that there's no guarantee that it will find the overall (global) best solution. But done well, simulated annealing should be more likely to produce a sufficiently good solution.
In the next post, I'll test these claims by running both algorithms on some random enciphered texts.
Credits
Cover image: I took it from Instructables, but I'm far from sure that's the original source.
Restaurant photo by Alexis Pineaud