Breaking ciphers and language models

    Now we can enciper and decipher with a Caesar cipher, the next step is to write a program to automatically break enciphered messages.

    The basic idea

    The basic model is simple: we'll get the computer to try all the keys and see which is the best.

    "Trying all the keys" is simple: for a Caesar cipher, we just run through the shifts 0–25.

    Finding which is "best" is somewhat tricker. How do we know which is the best possible decryption? For complex ciphers, and long ciphertexts, we don't want to rely on a human to make the decision.

    A language model

    Monkey typing My approach takes the idea from the apocryphal story that an infinite number of moneys, with an infinite number of typewriters, will eventually create the complete works of Shakespeare. As the computer tries each key, generating a possible plaintext, it can score the possible plaintext by how likely it would be for a monkey, completely randomly, to generate that plaintext.

    English letters by proportion That idea isn't that helpful when all the letters are equally likely. But if the money is using a keyboard where the keys are sized in proportion to how often they appear in English, we have something. The diagram to the right gives an idea of what such a keyboard could look like. A monkey using that keyboard, htting keys at random, will more likely produce something like treattlpis than nziuechjtk.

    That allows us to score how close a piece of text is to English. If we can work out the probability of how likely a random monkey would be to produce our possible plaintext, we can choose the key which produces the most likely plaintext.

    (This model is also called a bag of letters model, as it's the same as taking all the letters in the text and putting them in a bag, losing all idea of the order of letters. While we lose a lot of information with this, it's actually good enough for our purposes.)

    Finding letter probabilities

    How do we find these probabilities? The easy answer is to simply read a lot of text, counting the letters. Let's use three large texts: the complete works of Shakespeare, War and Peace, and The Adventures of Sherlock Holmes, all from Project Gutenberg.

    The Python collections.Counter() object from the standard library does a good job of counting letters for us. If we pass a Counter a sequence of characters, it will count them for us:

    Python 3.6.3 (default, Oct  3 2017, 21:45:48)
    [GCC 7.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import collections
    >>> collections.Counter('this is some test text')
    Counter({'t': 5, 's': 4, ' ': 4, 'e': 3, 'i': 2, 'h': 1, 'o': 1, 'm': 1, 'x': 1})
    

    If we update the Counter, it will update the counts for us:

    >>> counts = collections.Counter()
    >>> counts.update('this is some text')
    >>> counts
    Counter({'t': 3, 's': 3, ' ': 3, 'i': 2, 'e': 2, 'h': 1, 'o': 1, 'm': 1, 'x': 1})
    >>> counts.update('here is some more text')
    >>> counts
    Counter({' ': 7, 'e': 7, 't': 5, 's': 5, 'i': 3, 'o': 3, 'm': 3, 'h': 2, 'x': 2, 'r': 2})
    

    This allows us to easily combine the counts of letters in all the texts, then write them to a file. We put this in lettercount.py:

    import collections
    import string
    from utilities import sanitise
    
    corpora = ['shakespeare.txt', 'sherlock-holmes.txt', 'war-and-peace.txt']
    counts = collections.Counter()
    
    for corpus in corpora:
        text = sanitise(open(corpus, 'r').read())
        counts.update(text)
    
    sorted_letters = sorted(counts, key=counts.get, reverse=True)
    
    with open('count_1l.txt', 'w') as f:
        for l in sorted_letters:
            f.write("{0}\t{1}\n".format(l, counts[l]))
    

    (The sanitise function is defined in the post on sanitising text: it takes just the letters from some text, converting them to lowercase.)

    Using the language model

    Now we have the letter counts, we need to use them to score possible plaintexts. There are two stages here:

    1. Convert the counts into probabilities;
    2. Score a piece of text for how probable it is (by the random monkey metric).

    Converting the counts to probabilities

    This just requires working out what proportion each letter is of the total counts. This process is called normalisation. We can find the total of all the counts with sum(counts.values()), and finding the normalised version of the counts with:

    normalised_counts = {letter: count / sum(counts.values())
        for letter, count in counts.items() }
    

    (This is a dictionary comprehension, similar to a list comprehension.)

    We end up with something that looks like this:

    {'e': 0.12099426536374505, 
     't': 0.08946868868814231, 
     'o': 0.08052207518149467, 
     'a': 0.07822525209432887,
     ...
     }
    

    Score a piece of text

    We can use these to find the probability of a sequence of letters by finding the probability of each letter, then multiplying them all together:

    >>> import functools
    >>> import operator
    >>> functools.reduce(operator.mul, 
    ....    [normalised_counts[l] 
    ....     for l in 'hello'])
    ....
    1.106482390681202e-06
    >>> functools.reduce(operator.mul, 
    ....    [normalised_counts[l] 
    ....     for l in 'hellothere'])
    ....
    2.3105390256926316e-13
    

    (operator.mul is the multiplication operator, but able to be used as a function. functools.reduce uses the function it's passed to reduce the sequence of items to a single value. In this case, functools.reduce(operator.mul, list_of_numbers) multiplies all the numbers together.)

    This suggests a problem we'll have. The final values we have are very small, and will be much smaller for longer texts. There's a danger that with a longer text, we'll end up with a number that's too small to represent.

    We can get around this by using the "trick" of taking logarithms of probabilities. As numbers get smaller, their logarithms get smaller, but much less quickly. This means we can still handle the probability of long texts, while still being able to see which is the most probable.

    To find the log probability of a sequence of letters, we find the log probability of each letter, then just add them up. This has the advantage that we can use Python's built-in sum() function rather than faffing around with functools and operator!

    We find the log probabilities of each letter in much the same way as before. We define a convenience function Pletters which finds the log probability of a letter sequence. Note that Pletters assumes it is only passed lower-case letters: anything else will cause it to raise an error.

    import math
    Pl = {letter: math.log10(count / sum(counts.values()))
        for letter, count in counts.items() }
        
    def Pletters(letters):
        return sum(Pl[letter] for letter in letters)
    

    We can use that to find some log probabilities:

    >>> Pletters('hello')
    -5.956055493341638
    >>> Pletters('hellothere')
    -11.240897611186083
    

    Much better values!

    We've ended up with a simple model of the English language, which we can use to judge how likely a given piece of text is actually English.

    Breaking Caesar ciphers

    Finally, we can use Pletters to score possible plaintexts, and hence automatically break Caesar ciphers!

    def caesar_break(message, fitness=Pletters):
        """Breaks a Caesar cipher using frequency analysis"""
        sanitised_message = sanitise(message)
        best_shift = 0
        best_fit = float('-inf')
        for shift in range(26):
            plaintext = caesar_decipher(sanitised_message, shift)
            fit = fitness(plaintext)
            if fit > best_fit:
                best_fit = fit
                best_shift = shift
        return best_shift, best_fit
    

    Let's see if it works. We encipher a message, then see if caesar_break returns the correct key.

    >>> from cipher.caesar import *
    >>> caesar_encipher('this is a sample message to be decrypted', 17)
    'kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu'
    >>> caesar_break('kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu')
    (17, -41.43440679319847)
    >>> 
    

    Success!

    Let's try it on something larger: the ciphertext from the National Cipher Challenge 2016 challenge 1.

    >>> c1a = open('2016/1a.ciphertext').read()
    >>> print(c1a)
    PIZZG,
    Q PIDM AKIVVML BPM MVKZGXBML VWBM BPM XWTQKM NWCVL WV RIUMTQI'A LMAS IVL
    IBBIKPML QB NWZ GWC BW TWWS IB. BPM XWTQKM LMKZGXBML QB NWZ BPMUAMTDMA (QB QA
    DMZG ABZIQOPBNWZEIZL WVKM GWC ZMITQAM BPIB QB PIA JMMV EZQBBMV JIKSEIZLA - QB
    RCAB CAMA I KIMAIZ APQNB KQXPMZ). BPM WNNQKMZ QV KPIZOM WN BPM QVDMABQOIBQWV
    UILM QB KTMIZ BW UM BPIB PM BPQVSA BPQA XZWDMA RIUMTQI'A LMIBP QA "RCAB" I
    XMZAWVIT BZIOMLG. KIZMTMAA CAM WN BPM EWZL "RCAB" MDMV QN PM QA ZQOPB, JCB Q
    LWV'B BPQVS PM QA. Q PIDM AXWSMV BW PMZ KWTTMIOCMA, IVL RIUMTQI LWMAV'B ABZQSM
    UM IA I RCUXMZ. APM EIA XZMBBG LZQDMV IVL PMZ EWZS EIA OWQVO MFBZMUMTG EMTT.
    IXXIZMVBTG APM EIA CVPIXXG IJWCB PMZ JWGNZQMVL TMIDQVO, JCB I YCQKS AKIV WN
    PMZ AMIZKP PQABWZG ACOOMABA APM EIA XZMBBG IKBQDM QV BZGQVO BW BZIKS PQU LWEV.
    BPM XWTQKM BPQVS BPIB APWEA PWE LMAXMZIBM APM EIA. Q BPQVS QB APWEA BPIB APM
    EIAV'B BPM AWZB BW OQDM CX MIAQTG.
    WV WVM BPQVO Q LW IOZMM EQBP BPM XWTQKM, QB LWMAV'B AMMU DMZG TQSMTG BPIB I
    XPGAQKQAB EWZSQVO WV OZIDQBG EIDMA QA KICOPB CX QV IVGBPQVO BWW ACAXQKQWCA.
    PMZ IZMI QA EMTT NCVLML IVL AQVKM BPM LQAKWDMZG WN OZIDQBG EIDMA I NME UWVBPA
    IOW QB QA QV BPM AXWBTQOPB. PMZ PMIL WN LMXIZBUMVB AIGA RIUMTQI EIA LMABQVML
    NWZ I OWWL KIZMMZ, IVL Q KIV'B AMM IVGBPQVO QV PMZ EWZS BPIB EWCTL JM WN
    QVBMZMAB BW LIZSVMB WZ OWDMZVUMVB IKBWZA.
    BW JM PWVMAB Q IU CVACZM QN Q IU KPIAQVO APILWEA PMZM, JCB BPMV APILWE KPIAQVO
    QA WVM WN GWCZ AXMKQITQBQMA AW Q EWCTL JM ZMITTG OZIBMNCT QN GWC KWCTL BISM I
    TWWS IVL TMB UM SVWE QN GWC BPQVS Q IU EIABQVO GWCZ BQUM.
    
    BPIVSA,
    
    KPIZTQM
    
    >>> key, fitness = caesar_break(c1a)
    >>> key
    8
    >>> print(caesar_decipher(c1a, key))
    HARRY,
    I HAVE SCANNED THE ENCRYPTED NOTE THE POLICE FOUND ON JAMELIA'S DESK AND
    ATTACHED IT FOR YOU TO LOOK AT. THE POLICE DECRYPTED IT FOR THEMSELVES (IT IS
    VERY STRAIGHTFORWARD ONCE YOU REALISE THAT IT HAS BEEN WRITTEN BACKWARDS - IT
    JUST USES A CAESAR SHIFT CIPHER). THE OFFICER IN CHARGE OF THE INVESTIGATION
    MADE IT CLEAR TO ME THAT HE THINKS THIS PROVES JAMELIA'S DEATH IS "JUST" A
    PERSONAL TRAGEDY. CARELESS USE OF THE WORD "JUST" EVEN IF HE IS RIGHT, BUT I
    DON'T THINK HE IS. I HAVE SPOKEN TO HER COLLEAGUES, AND JAMELIA DOESN'T STRIKE
    ME AS A JUMPER. SHE WAS PRETTY DRIVEN AND HER WORK WAS GOING EXTREMELY WELL.
    APPARENTLY SHE WAS UNHAPPY ABOUT HER BOYFRIEND LEAVING, BUT A QUICK SCAN OF
    HER SEARCH HISTORY SUGGESTS SHE WAS PRETTY ACTIVE IN TRYING TO TRACK HIM DOWN.
    THE POLICE THINK THAT SHOWS HOW DESPERATE SHE WAS. I THINK IT SHOWS THAT SHE
    WASN'T THE SORT TO GIVE UP EASILY.
    ON ONE THING I DO AGREE WITH THE POLICE, IT DOESN'T SEEM VERY LIKELY THAT A
    PHYSICIST WORKING ON GRAVITY WAVES IS CAUGHT UP IN ANYTHING TOO SUSPICIOUS.
    HER AREA IS WELL FUNDED AND SINCE THE DISCOVERY OF GRAVITY WAVES A FEW MONTHS
    AGO IT IS IN THE SPOTLIGHT. HER HEAD OF DEPARTMENT SAYS JAMELIA WAS DESTINED
    FOR A GOOD CAREER, AND I CAN'T SEE ANYTHING IN HER WORK THAT WOULD BE OF
    INTEREST TO DARKNET OR GOVERNMENT ACTORS.
    TO BE HONEST I AM UNSURE IF I AM CHASING SHADOWS HERE, BUT THEN SHADOW CHASING
    IS ONE OF YOUR SPECIALITIES SO I WOULD BE REALLY GRATEFUL IF YOU COULD TAKE A
    LOOK AND LET ME KNOW IF YOU THINK I AM WASTING YOUR TIME.
                                   
    THANKS,
    
    CHARLIE
    
    >>>
    

    Code

    All the code about the language models is in the support.language models.py file. The Caesar cipher breaking routine is in cipher.caesar.py.

    Acknowledgments

    Neil Smith

    Read more posts by this author.