Neil's musings

Snippets from a computing educator and researcher.

Breaking ciphers and language models

March 16, 2018 in #codes and ciphers #python

Now we can enciper and decipher with a Caesar cipher, the next step is to write a program to automatically break enciphered messages.

The basic idea

The basic model is simple: we'll get the computer to try all the keys and see which is the best.

"Trying all the keys" is simple: for a Caesar cipher, we just run through the shifts 0–25.

Finding which is "best" is somewhat tricker. How do we know which is the best possible decryption? For complex ciphers, and long ciphertexts, we don't want to rely on a human to make the decision.

A language model

Monkey typing My approach takes the idea from the apocryphal story that an infinite number of moneys, with an infinite number of typewriters, will eventually create the complete works of Shakespeare. As the computer tries each key, generating a possible plaintext, it can score the possible plaintext by how likely it would be for a monkey, completely randomly, to generate that plaintext.

English letters by proportion That idea isn't that helpful when all the letters are equally likely. But if the money is using a keyboard where the keys are sized in proportion to how often they appear in English, we have something. The diagram to the right gives an idea of what such a keyboard could look like. A monkey using that keyboard, htting keys at random, will more likely produce something like treattlpis than nziuechjtk.

That allows us to score how close a piece of text is to English. If we can work out the probability of how likely a random monkey would be to produce our possible plaintext, we can choose the key which produces the most likely plaintext.

(This model is also called a bag of letters model, as it's the same as taking all the letters in the text and putting them in a bag, losing all idea of the order of letters. While we lose a lot of information with this, it's actually good enough for our purposes.)

Finding letter probabilities

How do we find these probabilities? The easy answer is to simply read a lot of text, counting the letters. Let's use three large texts: the complete works of Shakespeare, War and Peace, and The Adventures of Sherlock Holmes, all from Project Gutenberg.

The Python collections.Counter() object from the standard library does a good job of counting letters for us. If we pass a Counter a sequence of characters, it will count them for us:

Python 3.6.3 (default, Oct  3 2017, 21:45:48)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> collections.Counter('this is some test text')
Counter({'t': 5, 's': 4, ' ': 4, 'e': 3, 'i': 2, 'h': 1, 'o': 1, 'm': 1, 'x': 1})

If we update the Counter, it will update the counts for us:

>>> counts = collections.Counter()
>>> counts.update('this is some text')
>>> counts
Counter({'t': 3, 's': 3, ' ': 3, 'i': 2, 'e': 2, 'h': 1, 'o': 1, 'm': 1, 'x': 1})
>>> counts.update('here is some more text')
>>> counts
Counter({' ': 7, 'e': 7, 't': 5, 's': 5, 'i': 3, 'o': 3, 'm': 3, 'h': 2, 'x': 2, 'r': 2})

This allows us to easily combine the counts of letters in all the texts, then write them to a file. We put this in lettercount.py:

import collections
import string
from utilities import sanitise

corpora = ['shakespeare.txt', 'sherlock-holmes.txt', 'war-and-peace.txt']
counts = collections.Counter()

for corpus in corpora:
    text = sanitise(open(corpus, 'r').read())
    counts.update(text)

sorted_letters = sorted(counts, key=counts.get, reverse=True)

with open('count_1l.txt', 'w') as f:
    for l in sorted_letters:
        f.write("{0}\t{1}\n".format(l, counts[l]))

(The sanitise function is defined in the post on sanitising text: it takes just the letters from some text, converting them to lowercase.)

Using the language model

Now we have the letter counts, we need to use them to score possible plaintexts. There are two stages here:

  1. Convert the counts into probabilities;
  2. Score a piece of text for how probable it is (by the random monkey metric).

Converting the counts to probabilities

This just requires working out what proportion each letter is of the total counts. This process is called normalisation. We can find the total of all the counts with sum(counts.values()), and finding the normalised version of the counts with:

normalised_counts = {letter: count / sum(counts.values())
    for letter, count in counts.items() }

(This is a dictionary comprehension, similar to a list comprehension.)

We end up with something that looks like this:

{'e': 0.12099426536374505, 
 't': 0.08946868868814231, 
 'o': 0.08052207518149467, 
 'a': 0.07822525209432887,
 ...
 }

Score a piece of text

We can use these to find the probability of a sequence of letters by finding the probability of each letter, then multiplying them all together:

>>> import functools
>>> import operator
>>> functools.reduce(operator.mul, 
....    [normalised_counts[l] 
....     for l in 'hello'])
....
1.106482390681202e-06
>>> functools.reduce(operator.mul, 
....    [normalised_counts[l] 
....     for l in 'hellothere'])
....
2.3105390256926316e-13

(operator.mul is the multiplication operator, but able to be used as a function. functools.reduce uses the function it's passed to reduce the sequence of items to a single value. In this case, functools.reduce(operator.mul, list_of_numbers) multiplies all the numbers together.)

This suggests a problem we'll have. The final values we have are very small, and will be much smaller for longer texts. There's a danger that with a longer text, we'll end up with a number that's too small to represent.

We can get around this by using the "trick" of taking logarithms of probabilities. As numbers get smaller, their logarithms get smaller, but much less quickly. This means we can still handle the probability of long texts, while still being able to see which is the most probable.

To find the log probability of a sequence of letters, we find the log probability of each letter, then just add them up. This has the advantage that we can use Python's built-in sum() function rather than faffing around with functools and operator!

We find the log probabilities of each letter in much the same way as before. We define a convenience function Pletters which finds the log probability of a letter sequence. Note that Pletters assumes it is only passed lower-case letters: anything else will cause it to raise an error.

import math
Pl = {letter: math.log10(count / sum(counts.values()))
    for letter, count in counts.items() }
    
def Pletters(letters):
    return sum(Pl[letter] for letter in letters)

We can use that to find some log probabilities:

>>> Pletters('hello')
-5.956055493341638
>>> Pletters('hellothere')
-11.240897611186083

Much better values!

We've ended up with a simple model of the English language, which we can use to judge how likely a given piece of text is actually English.

Breaking Caesar ciphers

Finally, we can use Pletters to score possible plaintexts, and hence automatically break Caesar ciphers!

def caesar_break(message, fitness=Pletters):
    """Breaks a Caesar cipher using frequency analysis"""
    sanitised_message = sanitise(message)
    best_shift = 0
    best_fit = float('-inf')
    for shift in range(26):
        plaintext = caesar_decipher(sanitised_message, shift)
        fit = fitness(plaintext)
        if fit > best_fit:
            best_fit = fit
            best_shift = shift
    return best_shift, best_fit

Let's see if it works. We encipher a message, then see if caesar_break returns the correct key.

>>> from cipher.caesar import *
>>> caesar_encipher('this is a sample message to be decrypted', 17)
'kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu'
>>> caesar_break('kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu')
(17, -41.43440679319847)
>>> 

Success!

Let's try it on something larger: the ciphertext from the National Cipher Challenge 2016 challenge 1.

>>> c1a = open('2016/1a.ciphertext').read()
>>> print(c1a)
PIZZG,
Q PIDM AKIVVML BPM MVKZGXBML VWBM BPM XWTQKM NWCVL WV RIUMTQI'A LMAS IVL
IBBIKPML QB NWZ GWC BW TWWS IB. BPM XWTQKM LMKZGXBML QB NWZ BPMUAMTDMA (QB QA
DMZG ABZIQOPBNWZEIZL WVKM GWC ZMITQAM BPIB QB PIA JMMV EZQBBMV JIKSEIZLA - QB
RCAB CAMA I KIMAIZ APQNB KQXPMZ). BPM WNNQKMZ QV KPIZOM WN BPM QVDMABQOIBQWV
UILM QB KTMIZ BW UM BPIB PM BPQVSA BPQA XZWDMA RIUMTQI'A LMIBP QA "RCAB" I
XMZAWVIT BZIOMLG. KIZMTMAA CAM WN BPM EWZL "RCAB" MDMV QN PM QA ZQOPB, JCB Q
LWV'B BPQVS PM QA. Q PIDM AXWSMV BW PMZ KWTTMIOCMA, IVL RIUMTQI LWMAV'B ABZQSM
UM IA I RCUXMZ. APM EIA XZMBBG LZQDMV IVL PMZ EWZS EIA OWQVO MFBZMUMTG EMTT.
IXXIZMVBTG APM EIA CVPIXXG IJWCB PMZ JWGNZQMVL TMIDQVO, JCB I YCQKS AKIV WN
PMZ AMIZKP PQABWZG ACOOMABA APM EIA XZMBBG IKBQDM QV BZGQVO BW BZIKS PQU LWEV.
BPM XWTQKM BPQVS BPIB APWEA PWE LMAXMZIBM APM EIA. Q BPQVS QB APWEA BPIB APM
EIAV'B BPM AWZB BW OQDM CX MIAQTG.
WV WVM BPQVO Q LW IOZMM EQBP BPM XWTQKM, QB LWMAV'B AMMU DMZG TQSMTG BPIB I
XPGAQKQAB EWZSQVO WV OZIDQBG EIDMA QA KICOPB CX QV IVGBPQVO BWW ACAXQKQWCA.
PMZ IZMI QA EMTT NCVLML IVL AQVKM BPM LQAKWDMZG WN OZIDQBG EIDMA I NME UWVBPA
IOW QB QA QV BPM AXWBTQOPB. PMZ PMIL WN LMXIZBUMVB AIGA RIUMTQI EIA LMABQVML
NWZ I OWWL KIZMMZ, IVL Q KIV'B AMM IVGBPQVO QV PMZ EWZS BPIB EWCTL JM WN
QVBMZMAB BW LIZSVMB WZ OWDMZVUMVB IKBWZA.
BW JM PWVMAB Q IU CVACZM QN Q IU KPIAQVO APILWEA PMZM, JCB BPMV APILWE KPIAQVO
QA WVM WN GWCZ AXMKQITQBQMA AW Q EWCTL JM ZMITTG OZIBMNCT QN GWC KWCTL BISM I
TWWS IVL TMB UM SVWE QN GWC BPQVS Q IU EIABQVO GWCZ BQUM.

BPIVSA,

KPIZTQM

>>> key, fitness = caesar_break(c1a)
>>> key
8
>>> print(caesar_decipher(c1a, key))
HARRY,
I HAVE SCANNED THE ENCRYPTED NOTE THE POLICE FOUND ON JAMELIA'S DESK AND
ATTACHED IT FOR YOU TO LOOK AT. THE POLICE DECRYPTED IT FOR THEMSELVES (IT IS
VERY STRAIGHTFORWARD ONCE YOU REALISE THAT IT HAS BEEN WRITTEN BACKWARDS - IT
JUST USES A CAESAR SHIFT CIPHER). THE OFFICER IN CHARGE OF THE INVESTIGATION
MADE IT CLEAR TO ME THAT HE THINKS THIS PROVES JAMELIA'S DEATH IS "JUST" A
PERSONAL TRAGEDY. CARELESS USE OF THE WORD "JUST" EVEN IF HE IS RIGHT, BUT I
DON'T THINK HE IS. I HAVE SPOKEN TO HER COLLEAGUES, AND JAMELIA DOESN'T STRIKE
ME AS A JUMPER. SHE WAS PRETTY DRIVEN AND HER WORK WAS GOING EXTREMELY WELL.
APPARENTLY SHE WAS UNHAPPY ABOUT HER BOYFRIEND LEAVING, BUT A QUICK SCAN OF
HER SEARCH HISTORY SUGGESTS SHE WAS PRETTY ACTIVE IN TRYING TO TRACK HIM DOWN.
THE POLICE THINK THAT SHOWS HOW DESPERATE SHE WAS. I THINK IT SHOWS THAT SHE
WASN'T THE SORT TO GIVE UP EASILY.
ON ONE THING I DO AGREE WITH THE POLICE, IT DOESN'T SEEM VERY LIKELY THAT A
PHYSICIST WORKING ON GRAVITY WAVES IS CAUGHT UP IN ANYTHING TOO SUSPICIOUS.
HER AREA IS WELL FUNDED AND SINCE THE DISCOVERY OF GRAVITY WAVES A FEW MONTHS
AGO IT IS IN THE SPOTLIGHT. HER HEAD OF DEPARTMENT SAYS JAMELIA WAS DESTINED
FOR A GOOD CAREER, AND I CAN'T SEE ANYTHING IN HER WORK THAT WOULD BE OF
INTEREST TO DARKNET OR GOVERNMENT ACTORS.
TO BE HONEST I AM UNSURE IF I AM CHASING SHADOWS HERE, BUT THEN SHADOW CHASING
IS ONE OF YOUR SPECIALITIES SO I WOULD BE REALLY GRATEFUL IF YOU COULD TAKE A
LOOK AND LET ME KNOW IF YOU THINK I AM WASTING YOUR TIME.
                               
THANKS,

CHARLIE

>>>

Code

All the code about the language models is in the support.language models.py file. The Caesar cipher breaking routine is in cipher.caesar.py.

Acknowledgments

Share on Google+