Git for beginners part 3: moving HEAD
All about reset and why its useful.
Cryptanalysis, just not using a programming langauge.
Just as you don't always need to use a programming language for cryptography, you also don't need to use a programming language for cryptanalysis. In this post, I describe how to break a Caesar cipher with just a spreadsheet and a few formulas, using the same approach as before of a bagofletters language model.
You can follow along with a copy of the spreadsheet on Google Drive. You'll find this post makes a lot more sense if you can see how all the bits fit together in the spreadsheet, as well as being able to see the results of the formulas used. Make your own copy of it if you want to break your own messages.
Describing how the spreadsheet works is tricky, as you can't just quote snippets of program code and discuss them. I've broken the spreadsheet down into regions (see the diagram below) and will go through them, region by region, describing how it all fits together.
(Contents)
At the top of the spreadsheet, in cells B2 to B6, is the "dashboard" summarising the cipher breaking.
=CF44
)=AA48
).To use the spreadsheet, you put your ciphertext, in all uppercase, in cell B2. The spreadsheet will then break the Caesar cipher and show you the key and broken plaintext in cells B4 and B6.
In this spreadsheet, we use case to separate plaintext and ciphertext. Plaintext is lowercase; ciphertext is uppercase. This becomes important when it comes to converting between the two in the spreadsheet.
(Contents)
Just below the dashboard area is a copy of the alphabet, the log probabilities of each letter, and the counts of each letter in the ciphertext.
This fragment of spreadsheet looks like this, where the bold letters and numbers are the cell references. In this example, cell B9 holds the number 2.54
and cell D10 holds the letter c
.
B  C  D  E  …  AA  

9  2.54…  4.21…  3.79…  3.15…  …  7.46… 
10  a  b  c  d  …  z 
11  12  24  58  81  …  110 
Row 9 is the log probabilities of each letter. These numbers have been truncated in the fragment above, just for clarity.
Row 11 is the count of each letter in the ciphertext. Each cell in the range B11:AA11 contains the formula
=LEN($B$2)  LEN(SUBSTITUTE(LOWER($B$2), B10, ""))
where the B10
in the formula is the cell above^{[1]}.
The formula finds the length of the ciphertext. It then replaces each occurrence of the letter in cell B10 (in this case, the letter A
) with nothing, and finds the length of the modified ciphertext. The difference between these two lengths is the number of A
s in the ciphertext.
I calculated these number from the list of letter counts^{[2]}. I loaded the list of letter counts into a new sheet of spreadsheet, sorted the counts by letter, then calculated the log probabilities like this:
D  E  F  G  

3  Letter  Sample count  Log probability  
4  
5  a  489107  2.54…  
6  b  92647  4.21…  
7  c  140497  3.79…  
8  d  267381  3.15…  
\( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  
30  z  3575  7.46… 
Each of cells G5 to G30 contains the formula
=LN(E5/SUM(E$5:E$30))
To put these log probabilities into the main sheet, select cells in column G, copy them, then go back to the main sheet. Select cell B9 then Paste Special, making sure you do paste numbers, do not paste formulae, and check the transpose option. That should put all the probabilities in the main sheet in row 9.
(Contents)
Simply all the shifts for the Caesar cipher, 0–25.
(Contents)
Each cell in this block says what each plaintext letter would be, with the Caesar shift given in the corresponding row in A16:A41.
A  B  C  D  …  Z  AA  

10  a  b  c  …  y  z  
\( \vdots \)  
16  0  =B10 
=C10 
=C10 
…  =Z10 
=AA10 
17  1  =C16 
=D16 
=E16 
…  =AA16 
=B16 
18  2  =C17 
=D17 
=E17 
…  =AA17 
=B17 
\( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \) 
41  25  =C40 
=D40 
=E40 
…  =AA40 
=B40 
Cells B16 to AA16 just contain the formulas =B10
to =AA10
.
Cells B17 to Z17 each contain the formula =C16
to =AA16
: in other words, the content of the cell one above and to the right. Cell AA17 contains the formula =B16
. Together, these formulas rotate the row above one space to the left, with the leftmost cell of the block moving to the right.
Cut and paste row B17:AA17 into the remaining rows of the block, B18:AA41. You'll end up with all the shifts of the alphabet, with row 41 containing z a b c … w x y
.
(Contents)
This block calculates how many of each letter would have been in the plaintext, for a given key.
For instance, consider the ciphetext CZGGJ RJMGY
.
czggj rjmgy
.byffi qilfx
(a shift of 1 would take plaintext b
to ciphertext C
, y
to Z
, and so on).CZGGJ RJMGY
is the result of a Caesar shift of 2, the plaintext would have been axeeh phkew
.hello world
.For each of these possible plaintexts, we want to count the number of a
s, b
s, and so on which occur in that plaintext.
At first sight, you'd think we'd need to do 26 decipherings and then count the letters. But we can shortcut that process by relying on the regular pattern of the different Caesar shifts.
AC16:BB16 holds the counts for a Caesar shift of 0. This is just the same counts as we calculated in the alphabets and counts section where we counted the letters in the ciphertext.
If the plaintext was enciphered with a Caesar shift of 1, the number of a
s in the plaintext is the same as the number of B
s in the ciphertext, and the number of b
s in the plaintext is the same as the number of C
s in the ciphertext, and so on.
But we know the number of B
s and C
s (and so on) in the ciphertext: they're the same as the number b
s and c
in the plaintext if there was a shift of 0. In other words, the counts we need for row 17 are the counts we have in row 16, but one cell to the right. And to avoid losing counts, the number we want in cell BB17 is the same as the count in cell AC16.
Similarly, the counts we need for row 18 are the counts we have in row 17, but one cell to the right.
Therefore, and in much the same way as we did with the shifted plaintext letters, we can fill this block with copies of the cells AC17:BB17.
A  AC  AD  AE  …  BA  BB  

10  a  b  c  …  y  z  
\( \vdots \)  
16  0  =B11 
=C11 
=D11 
…  =Z11 
=AA11 

17  1  =AD16 
=AE16 
=AF16 
…  =BB16 
=AC16 

18  2  =AD17 
=AE17 
=AF17 
…  =BB17 
=AC17 

\( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \) 
41  25  =AD40 
=AE40 
=AF40 
…  =BB40 
=AC40 
(Contents)
Now we know how many there are of each letter in the different possible plaintexts, we can work out how much each letter contributes to the overall probability of that plaintext occurring by chance. We do that by just multiplying the count of each letter with the probabilities in the Alphabet and counts block.
A  BD  BE  BF  …  CB  CC  

10  a  b  c  …  y  z  
\( \vdots \)  
16  0  =AC16 * B$9 
=AD16 * C$9 
=AE16 * D$9 
…  =BA16 * Z$9 
=BB16 * AA$9 

17  1  =AC17 * B$9 
=AD17 * C$9 
=AE17 * D$9 
…  =BA17 * Z$9 
=BB17 * AA$9 

18  2  =AC18 * B$9 
=AD18 * C$9 
=AE18 * D$9 
…  =BA18 * Z$9 
=BB18 * AA$9 

\( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \)  \( \vdots \) 
41  25  =AC41 * B$9 
=AD41 * C$9 
=AE41 * D$9 
…  =BA41 * Z$9 
=BB41 * AA$9 
(Contents)
We're past the worst of it now. This block just sums up each row from the plaintext probability block. Column CF is just there to make the next step a little easier.
CE  CF  

16  =SUM(BD16:CC16) 
=A16 
17  =SUM(BD17:CC17) 
=A17 
18  =SUM(BD18:CC18) 
=A18 
\( \vdots \)  \( \vdots \)  \( \vdots \) 
41  =SUM(BD41:CC41) 
=A41 
(Contents)
This is where we find the shift which corresponds to the proposed plaintext which looks most like English.
CE  CF  

41  =MAX(CE16:CE41) 
=VLOOKUP(CE43, CE16:CF41, 2, 0) 
Cell CE41 just finds the highest probability of all the plaintexts.
Cell CF41 uses the VLOOKUP
("vertical lookup") function to find the shift which generated this probability. It makes part of a spreadsheet behave like a Python dict
or, more generally, a lookup table.
The VLOOKUP
function takes four parameters:
=VLOOKUP(search_key, range, index, is_sorted)
The range
is the important part: it's our lookup table. As this is a vertical lookup, the table should be organised as a bunch of rows^{[3]}. In each row, there is a key and a value. We'll use the key to find the corresponding value. The range
parameter should encompass the whole lookup table, keys and values together.
The keys have to be in the first column of the range
, the lookup table as a whole. The index
tells the function which column in the table holds the value to return, counting from 1. This is for cases where the lookup table cells contain some intermediate values in columns between the keys and values we want.
The is_sorted
parameter should be True
or 1
if the keys are in sorted order. If the keys are sorted, searching is faster. In this case, the keys aren't sorted so the spreadsheet has to look through them all.
Finally, the search_key
is the value of the key we're looking up.
Putting that all together, the formula =VLOOKUP(CE43, CE16:CF41, 2, 0)
says. "A key is held in CE43. Look for that key in the cells in the range CE16:CE41; they're not sorted, by the way. When you find the key, return the value you find in the same row, but the next column over."
In this situation, what this means is, "look for the highest probability in the list of probabilites, and return the shift in the same row." In other words, find the best Caesar shift of our ciphertext.
(Contents)
Now we have the key for this ciphertext, we need to use it to recover the plaintext. To do that, we need to find the correspondences between the ciphertext letters and the plaintext letters. The shifted plaintext block does exactly that: for each key, that block shows the ciphertext letters which correspond to the plaintext letters. We know the key; the alphabet and counts block has the plaintext letters. We now treat the shifted plaintext block as a large lookup table, giving us the ciphertext letters we need.
Each column in the shifted plaintext block shows how a single plaintext letter is transformed with different shifts. Going down column B, you can see how a
is transformed into a
, b
, c
, … as the Caesar shift changes from 0, 1, 2, …
This is another vertical lookup. But this time, it's convenient to use the LOOKUP
function, as that allows us to more easily copy and paste the function from cell to cell^{[4]}.
LOOKUP
behaves similarly to VLOOKUP
. It can be used in a couple of ways. The way I'm using it here has this signature:
LOOKUP(search_key, search_range, result_range)
The function looks for the search_key
in the search_range
(either a row or a column). If it finds the key, it moves along the result_range
the same number of steps and returns the value it finds there. The result_range
can be either a row or a column, but need not be the same shape as the search range
.
In this case, the search_key
is the identified best shift, found in cell B4, itself a copy of CF43. The search_range
is the list of possible shifts in the cells A16:A41. The result_range
depends on the plaintext letter we're after. It's the column under the appropriate plaintext letter from the alphabet and counts block. For a
, the result_range
is B16:B41. For b
, the result range is C16:C41, and so on, up until AA16:AA41 for the column corresponding to z
.
Therefore, we can say the formula LOOKUP($B$4, $A$16:$A$41, B16:B41)
will give the ciphertext letter corresponding to plaintext a
for the best Caesar shift we've identified.
As we're using uppercase letters for ciphertext, we wrap the result in a call to UPPER()
to convert the ciphertext letter to uppercase.
That explains row 44: it's the ciphertext letters for each . Row 45 is just those plaintext letters, copied down from B10:AA10.
B  C  D  E  …  AA  

44  =UPPER(LOOKUP($B$4, $A$16:$A$41, B16:B41)) 
=UPPER(LOOKUP($B$4, $A$16:$A$41, C16:C41)) 
=UPPER(LOOKUP($B$4, $A$16:$A$41, D16:D41)) 
=UPPER(LOOKUP($B$4, $A$16:$A$41, E16:E41)) 
…  =UPPER(LOOKUP($B$4, $A$16:$A$41, AA16:AA41)) 
45  =B10 
=C10 
=D10 
=E10 
…  =AA10 
46  
47  =SUBSTITUTE($B$2, B44, B45) 
=SUBSTITUTE(B47, C44, C45) 
=SUBSTITUTE(C47, D44, D45) 
=SUBSTITUTE(D47, E44, E45) 
…  =SUBSTITUTE(Z47, AA44, AA45) 
Now for row 47.
We've already seen the SUBSTITUTE
command, back in the alphabet and counts block, when we used it to eliminate letters to count how many were in the ciphertext. Now we use it to replace the ciphertext letters with the plaintext letters we've found.
Cells B44:AA44 and B45:AA45 together show how each ciphertext letter is converted to a plaintext letter. The cells in B47:AA47 do that conversion. If we've identified a shift of 21, B44:B45 will say that ciphertext V
corresponds to plaintext a
^{[5]}. The formula in B47 converts all the V
s in the ciphertext into a
s. The formula in C47 takes the result of in B47 and then converts all the W
s into b
s. D47 takes the result in C47 and converts all the X
s into c
s, and so on until AA47 converts all the U
s into z
s. The result in AA47 is the final, extracted plaintext, and that result is copied up to the dashboard by cell B6.
And that's the result! That's how you can use a spreadsheet to automatically break Caesar ciphers. According to the post on evaluating langauge models, having 20 letters of ciphertext will allow you to break the code 98% of the time, and even 10 letters will allow you to get it right 82% of the time.
Post cover image by Tyler Easton.
In other words, enter the formula above into cell B11, then copy and past it into cells C11:AA11. When you enter the formula, note the $
symbols which signify an absolute cell location, not relative to the current cell. ↩︎
How to find the list of letter counts is descibed in the post on breaking ciphers ↩︎
Yes, there is a HLOOKUP
function for lookup tables organised as a bunch of columns, and a more general LOOKUP
function for more strangelyshaped lookup tables. ↩︎
If you wanted to use the VLOOKUP
function here, you could. What would you need to do to make it work? ↩︎
This is why the use of case to distinguish plaintext and ciphertext is important. If we just converted lowercase ciphertext v
to lowercase plaintext a
, when we came to convert lowercase ciphertext a
to lowercase ciphertext f
, the SUBSTITUTE
command wouldn't easily be able to distingush between the letter a
s which were still ciphertext and which were previouslyconverted plaintext. ↩︎