In part 1 of this post, I talked about the range of language models we could use. But that still leaves the question of which is best.
As it's not obvious which is the best langauge model, we'll perform an experiment to find out. We'll take some samples of text from our corpus of novels, encipher them with random keys, then try to break the key. We'll use different language models on each sample ciphertext and count how many each one gets.
When developing things like this, it's often easier to start from the end point and then build the tools we need to make that work.
As we're just running some tests in a small experimental harness, I'll break some rules of good programming and keep various constants in global variables, where it makes life easier.
Testing some models
Let's assume we have some models to test, in a list called models
and a list of message_lengths
to try. We want to build a dict
of dicts
: the outer dict
has one element for each model, and the inner dict
s have one element for each test message length.
Evaluating the models is easy with a pair of dict comprehensions:
def eval_models():
return {m['name']: {l: eval_one_model(m, l)
for l in message_lengths}
for m in models}
…but it does highlight that we need two pieces of information for each model
: a name
we can use when talking about it, and a func
, the function which we call to use that model.
Given that, we can eval_one_model
by just making trials
number of random ciphertexts, trying to break each one, and counting successes when the breaking function gets it right. (As we'll be testing tens of thousands of ciphertexts, the print
is there just to reassure us the experiment is making progress.)
def eval_one_model(model, message_length):
print(model['name'], message_length)
successes = 0
for _ in range(trials):
key, ciphertext = random_ciphertext(message_length)
found_key, _ = caesar_break(ciphertext, model['func'])
if found_key == key:
successes += 1
return successes
Generating random cipher text
The approach we'll use is to take a lot of real text and then pull samples out of it. On first sight, an alternative approach would be to generate random text from the letter frequencies, but that won't help when we come to test bigram and trigram models.
To read the text, we make use of the sanitise
function defined earlier. We just read the three novels we have lying around, join them together, sanitise them, and call that our corpus:
corpus = sanitise(cat([
open('support/shakespeare.txt').read(),
open('support/sherlock-holmes.txt').read(),
open('support/war-and-peace.txt').read()
]))
corpus_length = len(corpus)
To generate a random piece of ciphertext, we pick a random start position in the corpus (taking care it's not too close to the end of the corpus), pick out a slice of corpus of the appropriate length, pick a random key, and encipher the sample with that key. We return both the key and the ciphertext, so that eval_one_model
can work.
def random_ciphertext(message_length):
sample_start = random.randint(0, corpus_length - message_length)
sample = corpus[sample_start:(sample_start + message_length)]
key = random.randint(1, 25)
ciphertext = caesar_encipher(sample, key)
return key, ciphertext
We can test this:
>>> k, c = random_ciphertext(20)
>>> k, c, caesar_decipher(c, k)
(16, 'qhusedludjyedqbjxydw', 'areconventionalthing')
Seems to work.
Building a list of models
This is tricker. We need to end up with models
, a list
of two element dict
s: the name
and the func
to call. Building the name is easy. Building the function is harder.
The n-gram models are easy: we can define models
as:
models = [{'name': 'Pletters', 'func': Pletters},
{'name': 'Pbigrams', 'func': Pbigrams}
{'name': 'Ptrigrams', 'func': Ptrigrams}]
For the norm-based models, we have to define
- the norm to scale the message's letter counts
- the norm to scale the letter counts of standard English
- the norm to compare these two vectors.
We need to test all combinations of these. For the sake of consistency, we'll use the same norm for both vector scalings.
In addition, the norm-based measures return the distance between two vectors, while the cipher breaking method wants to maximise the similarity of the two pieces of text. That means that, for some comparisons, we want to invert the function result to turn the distance into a similarity.
Let's start with what we know. For each metric for comparing two vectors, we need the func
that does the comparison, an invert
flag to say if this is finding a distance not a similarity, and a name
. For each scaling, we need the corpus_frequency
for the English counts we're comparing to, the scaling
for scaling the sample text, and the name
for this scaling.
metrics = [
{'func': l1, 'invert': True, 'name': 'l1'},
{'func': l2, 'invert': True, 'name': 'l2'},
{'func': l3, 'invert': True, 'name': 'l3'},
{'func': cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}]
l2_scaled_english_counts = l2_scale(english_counts)
scalings = [
{'corpus_frequency': normalised_english_counts,
'scaling': l1_scale,
'name': 'l1_scaled'},
{'corpus_frequency': l2_scaled_english_counts,
'scaling': l2_scale,
'name': 'l2_scaled'} ]
We can use that information to build the models
we need:
models = (
[ {'func': make_frequency_compare_function(
s['corpus_frequency'], s['scaling'],
m['func'], m['invert']),
'name': '{} + {}'.format(m['name'], s['name'])}
for m in metrics
for s in scalings ]
+
[{'func': Pletters, 'name': 'Pletters'},
{'func': Pbigrams, 'name': 'Pbigrams'},
{'func': Ptrigrams, 'name': 'Ptrigrams'}]
)
All that's left is the make_frequecy_compare_function
. make_frequecy_compare_function
takes all sorts of parameters, but we want it to return a function that takes just one: the text being scored. The trick is that, inside make_frequecy_compare_function
, we can refer to all its parameters. If we create a function in that context and return it, the returned function can still access these parameters! This is termed a closure: the returned function encloses the parameters that were in scope when the closure was created. See the Wikibooks and Wikipedia articles.
An example might make it clearer (taken from Wikibooks). make_adder(x)
returns a function which adds x
to some other number. This returned function rembers the value of x
when it was created.
def make_adder(x):
def addX(y):
return x + y
return addX
We can check this:
>>> make_adder(3)
<function make_adder.<locals>.addX at 0x7f1ac1ab7488>
Yes, make_adder
returns a function. Let's give that returned function a name so we can call it later.
>>> add1 = make_adder(1) # Adds 1 to its argument
>>> add5 = make_adder(5) # Adds 5 to its argument
>>> add1(2)
3
>>> add1(4)
5
>>> add5(4)
9
>>>
That's the idea. The functions returned by make_adder
, which I've called add1
and add5
, remember the "number to add" which was used when the closure was created. Now, even outside make_adder
, we can use that closure to add 1 or 5 to a number.
This little example isn't that useful, but we use the same concept of closures to create the scoring function we need here. We build a closure which implements the scoring function we want, so that when the closure is passed a piece of text, it returns the appropriate score.
def make_frequency_compare_function(
target_frequency, frequency_scaling, metric, invert):
def frequency_compare(text):
counts = frequency_scaling(frequencies(text))
if invert:
score = -1 * metric(target_frequency, counts)
else:
score = metric(target_frequency, counts)
return score
return frequency_compare
We now have all the pieces in place to do the experiment! Apart from one thing…
Writing results
Now we've generated all the results with the call to eval_models
, we need to write them out to a file so we can analyse the results. The standard csv
library writes csv files for us, and just about every spreadsheet and data analysis package reads them. We use the library to create a csv.DictWriter
object, which writes dicts
to a csv file. The only tweak is that we add the name
to each row of results to that things appear nicely.
import csv
def write_results(scores):
with open('caesar_break_parameter_trials.csv', 'w') as f:
writer = csv.DictWriter(f, ['name'] + message_lengths,
quoting=csv.QUOTE_NONNUMERIC)
writer.writeheader()
for scoring in sorted(scores):
scores[scoring]['name'] = scoring
writer.writerow(scores[scoring])
The results
You can analyse the results with a spreadsheet, but here I'll use the pandas data processing library.
import pandas as pd
results = pd.read_csv('caesar_break_parameter_trials.csv').set_index('name')
results.sort_values('5')
results
We can plot the data easily:
ax = results.sort_values('5', ascending=False).T.plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
And here are the results (after 100,000 runs for each model):
(Note that the x-axis scale is nonlinear.)
name | 100 | 50 | 30 | 20 | 10 | 5 |
---|---|---|---|---|---|---|
Ptrigrams | 99.991% | 99.994% | 99.990% | 99.944% | 97.994% | 74.922% |
Pbigrams | 99.975% | 99.972% | 99.962% | 99.831% | 95.323% | 67.277% |
Pletters | 99.952% | 99.937% | 99.683% | 97.936% | 81.597% | 47.758% |
l1 + l1_scaled | 99.949% | 99.879% | 98.944% | 95.454% | 72.617% | 42.940% |
l1 + l2_scaled | 99.945% | 99.889% | 98.996% | 95.350% | 74.966% | 44.413% |
l2 + l1_scaled | 99.946% | 99.822% | 98.336% | 93.457% | 71.287% | 43.350% |
l2 + l2_scaled | 99.957% | 99.796% | 98.274% | 93.471% | 71.413% | 43.288% |
l3 + l1_scaled | 99.942% | 99.324% | 95.766% | 87.384% | 59.770% | 40.661% |
l3 + l2_scaled | 99.922% | 99.445% | 96.568% | 89.109% | 63.241% | 39.819% |
cosine_similarity + l1_scaled | 99.948% | 99.764% | 98.358% | 93.346% | 71.183% | 43.193% |
cosine_similarity + l2_scaled | 99.946% | 99.768% | 98.345% | 93.399% | 71.353% | 43.259% |
What does this show us? For "long" ciphertexts (20 letters or more) it doesn't really matter what langauge model we use, as all of them perform just about perfectly.
For short ciphertexts, the n-gram models significantly outperform the norm-based models. This means the n-gram models win out both on performance, and on ease of use and understanding.
But that's really surprising for me is how short the ciphertexts can be and still be broken. Even with just five characters of Caesar-enciphered text, the trigram model gets it right about 75% of the time, and even a very naïve unigram approach gets the right answer nearly 50% of the time.
Code
The code for this experiment is on Github, as is the code for the norms and the code for the n-gram models.
Acknowledgements
- Post cover image from Unsplash