What's the best language model?: part 2

    In part 1 of this post, I talked about the range of language models we could use. But that still leaves the question of which is best.

    As it's not obvious which is the best langauge model, we'll perform an experiment to find out. We'll take some samples of text from our corpus of novels, encipher them with random keys, then try to break the key. We'll use different language models on each sample ciphertext and count how many each one gets.

    When developing things like this, it's often easier to start from the end point and then build the tools we need to make that work.

    As we're just running some tests in a small experimental harness, I'll break some rules of good programming and keep various constants in global variables, where it makes life easier.

    Testing some models

    Let's assume we have some models to test, in a list called models and a list of message_lengths to try. We want to build a dict of dicts: the outer dict has one element for each model, and the inner dicts have one element for each test message length.

    Evaluating the models is easy with a pair of dict comprehensions:

    def eval_models():
        return {m['name']: {l: eval_one_model(m, l) 
                            for l in message_lengths}
                   for m in models}

    …but it does highlight that we need two pieces of information for each model: a name we can use when talking about it, and a func, the function which we call to use that model.

    Given that, we can eval_one_model by just making trials number of random ciphertexts, trying to break each one, and counting successes when the breaking function gets it right. (As we'll be testing tens of thousands of ciphertexts, the print is there just to reassure us the experiment is making progress.)

    def eval_one_model(model, message_length):
        print(model['name'], message_length)
        successes = 0
        for _ in range(trials):
            key, ciphertext = random_ciphertext(message_length)
            found_key, _ = caesar_break(ciphertext, model['func'])
            if found_key == key:
                successes += 1 
        return successes

    Generating random cipher text

    The approach we'll use is to take a lot of real text and then pull samples out of it. On first sight, an alternative approach would be to generate random text from the letter frequencies, but that won't help when we come to test bigram and trigram models.

    To read the text, we make use of the sanitise function defined earlier. We just read the three novels we have lying around, join them together, sanitise them, and call that our corpus:

    corpus = sanitise(cat([
    corpus_length = len(corpus)

    To generate a random piece of ciphertext, we pick a random start position in the corpus (taking care it's not too close to the end of the corpus), pick out a slice of corpus of the appropriate length, pick a random key, and encipher the sample with that key. We return both the key and the ciphertext, so that eval_one_model can work.

    def random_ciphertext(message_length):
        sample_start = random.randint(0, corpus_length - message_length)
        sample = corpus[sample_start:(sample_start + message_length)]
        key = random.randint(1, 25)
        ciphertext = caesar_encipher(sample, key)
        return key, ciphertext

    We can test this:

    >>> k, c = random_ciphertext(20)
    >>> k, c, caesar_decipher(c, k)
    (16, 'qhusedludjyedqbjxydw', 'areconventionalthing')

    Seems to work.

    Building a list of models

    This is tricker. We need to end up with models, a list of two element dicts: the name and the func to call. Building the name is easy. Building the function is harder.

    The n-gram models are easy: we can define models as:

    models = [{'name': 'Pletters', 'func': Pletters},
              {'name': 'Pbigrams', 'func': Pbigrams}
              {'name': 'Ptrigrams', 'func': Ptrigrams}]

    For the norm-based models, we have to define

    1. the norm to scale the message's letter counts
    2. the norm to scale the letter counts of standard English
    3. the norm to compare these two vectors.

    We need to test all combinations of these. For the sake of consistency, we'll use the same norm for both vector scalings.

    In addition, the norm-based measures return the distance between two vectors, while the cipher breaking method wants to maximise the similarity of the two pieces of text. That means that, for some comparisons, we want to invert the function result to turn the distance into a similarity.

    Let's start with what we know. For each metric for comparing two vectors, we need the func that does the comparison, an invert flag to say if this is finding a distance not a similarity, and a name. For each scaling, we need the corpus_frequency for the English counts we're comparing to, the scaling for scaling the sample text, and the name for this scaling.

    metrics = [
        {'func': l1, 'invert': True, 'name': 'l1'}, 
        {'func': l2, 'invert': True, 'name': 'l2'},
        {'func': l3, 'invert': True, 'name': 'l3'},
        {'func': cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}]
    l2_scaled_english_counts = l2_scale(english_counts)
    scalings = [
        {'corpus_frequency': normalised_english_counts, 
         'scaling': l1_scale,
         'name': 'l1_scaled'},
        {'corpus_frequency': l2_scaled_english_counts, 
         'scaling': l2_scale,
         'name': 'l2_scaled'} ]

    We can use that information to build the models we need:

    models = (
        [ {'func': make_frequency_compare_function(
                s['corpus_frequency'], s['scaling'], 
                m['func'], m['invert']),
           'name': '{} + {}'.format(m['name'], s['name'])}
            for m in metrics
            for s in scalings ] 
        [{'func': Pletters, 'name': 'Pletters'}, 
         {'func': Pbigrams, 'name': 'Pbigrams'},
         {'func': Ptrigrams, 'name': 'Ptrigrams'}]

    All that's left is the make_frequecy_compare_function. make_frequecy_compare_function takes all sorts of parameters, but we want it to return a function that takes just one: the text being scored. The trick is that, inside make_frequecy_compare_function, we can refer to all its parameters. If we create a function in that context and return it, the returned function can still access these parameters! This is termed a closure: the returned function encloses the parameters that were in scope when the closure was created. See the Wikibooks and Wikipedia articles.

    An example might make it clearer (taken from Wikibooks). make_adder(x) returns a function which adds x to some other number. This returned function rembers the value of x when it was created.

    def make_adder(x):
        def addX(y):
            return x + y
        return addX

    We can check this:

    >>> make_adder(3)
    <function make_adder.<locals>.addX at 0x7f1ac1ab7488>

    Yes, make_adder returns a function. Let's give that returned function a name so we can call it later.

    >>> add1 = make_adder(1) # Adds 1 to its argument
    >>> add5 = make_adder(5) # Adds 5 to its argument
    >>> add1(2)
    >>> add1(4)
    >>> add5(4)

    That's the idea. The functions returned by make_adder, which I've called add1 and add5, remember the "number to add" which was used when the closure was created. Now, even outside make_adder, we can use that closure to add 1 or 5 to a number.

    This little example isn't that useful, but we use the same concept of closures to create the scoring function we need here. We build a closure which implements the scoring function we want, so that when the closure is passed a piece of text, it returns the appropriate score.

    def make_frequency_compare_function(
            target_frequency, frequency_scaling, metric, invert):
        def frequency_compare(text):
            counts = frequency_scaling(frequencies(text))
            if invert:
                score = -1 * metric(target_frequency, counts)
                score = metric(target_frequency, counts)
            return score
        return frequency_compare

    We now have all the pieces in place to do the experiment! Apart from one thing…

    Writing results

    Now we've generated all the results with the call to eval_models, we need to write them out to a file so we can analyse the results. The standard csv library writes csv files for us, and just about every spreadsheet and data analysis package reads them. We use the library to create a csv.DictWriter object, which writes dicts to a csv file. The only tweak is that we add the name to each row of results to that things appear nicely.

    import csv
    def write_results(scores):
        with open('caesar_break_parameter_trials.csv', 'w') as f:
            writer = csv.DictWriter(f, ['name'] + message_lengths, 
            for scoring in sorted(scores):
                scores[scoring]['name'] = scoring

    The results

    You can analyse the results with a spreadsheet, but here I'll use the pandas data processing library.

    import pandas as pd
    results = pd.read_csv('caesar_break_parameter_trials.csv').set_index('name')

    We can plot the data easily:

    ax = results.sort_values('5', ascending=False).T.plot()
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

    And here are the results (after 100,000 runs for each model):


    (Note that the x-axis scale is nonlinear.)

    name 100 50 30 20 10 5
    Ptrigrams 99.991% 99.994% 99.990% 99.944% 97.994% 74.922%
    Pbigrams 99.975% 99.972% 99.962% 99.831% 95.323% 67.277%
    Pletters 99.952% 99.937% 99.683% 97.936% 81.597% 47.758%
    l1 + l1_scaled 99.949% 99.879% 98.944% 95.454% 72.617% 42.940%
    l1 + l2_scaled 99.945% 99.889% 98.996% 95.350% 74.966% 44.413%
    l2 + l1_scaled 99.946% 99.822% 98.336% 93.457% 71.287% 43.350%
    l2 + l2_scaled 99.957% 99.796% 98.274% 93.471% 71.413% 43.288%
    l3 + l1_scaled 99.942% 99.324% 95.766% 87.384% 59.770% 40.661%
    l3 + l2_scaled 99.922% 99.445% 96.568% 89.109% 63.241% 39.819%
    cosine_similarity + l1_scaled 99.948% 99.764% 98.358% 93.346% 71.183% 43.193%
    cosine_similarity + l2_scaled 99.946% 99.768% 98.345% 93.399% 71.353% 43.259%

    What does this show us? For "long" ciphertexts (20 letters or more) it doesn't really matter what langauge model we use, as all of them perform just about perfectly.

    For short ciphertexts, the n-gram models significantly outperform the norm-based models. This means the n-gram models win out both on performance, and on ease of use and understanding.

    But that's really surprising for me is how short the ciphertexts can be and still be broken. Even with just five characters of Caesar-enciphered text, the trigram model gets it right about 75% of the time, and even a very naïve unigram approach gets the right answer nearly 50% of the time.


    The code for this experiment is on Github, as is the code for the norms and the code for the n-gram models.


    Neil Smith

    Read more posts by this author.