Keyword substitution ciphers

    In this post, I look at keyword substitution ciphers, and show a couple of Python techniques that can make things easier.

    The cipher

    The cipher itself is relatively simple. It's still a monoalphabetic substitution cipher, which means each plaintext letter turns into a specific ciphertext letter, and back again. While the Caesar cipher has the ciphertext alphabet shifted by a few spaces, the keyword cipher uses a keyword to scramble the ciphertext alphabet.

    It's easier to see with an example. Here we use the keyword remember to scramble the ciphertext alphabet.

    Plaintext A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    Ciphertext r e m b a c d f g h i j k l n o p q s t u v w x y z

    As you can see, duplicate letters in the keyword are removed, so remember becomes remb. The alphabet is given after the keyword, but missing out the letters which appeared in the keyword, so the alphabet continues 'acdf…'.

    Using the cipher is much the same as the Caesar and affine ciphers: you go through the plaintext letter by letter, looking up the ciphertext equivalent of each letter. From the example above, the message this is a test message will encipher to tfgs gs r tast kassrda.

    Variants of keyword alphabets

    However, before we get into the details of how to do the keyword cipher, we have to realise that there's more than one way to make the ciphertext alphabet. The example above starts the alphabet from a. But as you can see, that makes the end of the ciphertext alphabet almost the same as the plaintext alphabet. That measn that words with a lot of letters at the end of the alphabet, such as heavyweight, are enciphered to be something very similar, such as farvywagdft.

    To keep the latter half of the cipher alphabet different from the plaintext, you need a different way to extend the alphabet from the keyword. One approach is to continue the alphabet from the next letter after the final keyword letter. With the keyword remember, that means using the c which follows from the b and gives this alphabet:

    Plaintext A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    Ciphertext r e m b c d f g h i j k l n o p q s t u v w x y z a

    Another approach is to find the largest letter in the keyword and continue from that onwards. With remember, that mean continuing from the r to give this ciphertext alphabet:

    Plaintext A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    Ciphertext r e m b s t u v w x y z a c d f g h i j k l n o p q

    When it comes to Python code, we could keep track of which variant we're using with integer values, such as using 1 for the "alphabet from a" option and 2 for the "alphabet from last" option. The trouble with that approach is that the numbers 1, 2, and 3 are hardly clear when it comes to understanding the code.

    A better approach is to declare some constants, such as FROM_A or FROM_LAST and use those constants in the code. It has the advantage of more clearly describing what we're doing, as well as offering some degree of syntax checking: if we type FORM_A instead of FROM_A, the Python interpreter will let us know.

    However, Python 3.4 and later have the enum library available. This is explicitly designed for situation like this, where we want to enumerate a fixed set of options.

    We can use the enum library like this, to define the types of keyword alphabet generation:

    from enum import Enum
    
    class KeywordWrapAlphabet(Enum):
        from_a = 1
        from_last = 2
        from_largest = 3
    

    We can use these values like this:

    if wrap_alphabet == KeywordWrapAlphabet.from_a:
        cipher_alphabet = ...
    elif wrap_alphabet == KeywordWrapAlphabet.from_last:
        cipher_alphabet = ...
    

    We can also iterate over all the possible values:

    for wrap_alphabet in KeywordWrapAlphabet:
        ...
    

    Generating ciphertext alphabets

    A key part of generating ciphertext alphabets isn't so much generating the alphabets as it is deduplicating the 'alphabets' we generate. After all, if we were talking to a person, not a machine, the high-level way we'd describe generating a ciphertext alphabet could well be:

    Write out the keyword, followed by the whole alphabet. But don't write a letter if you've already written it.

    For example, the ciphertext alphabet from the keyword remember is rembacdfghijklnopqstuvwxyz. If we write out the word remember followed by the alphabet, we get remEMbERaBcdEfghijklMnopqRstuvwxyz where I've capitalised the second or subsequent occurrence of a letter.

    This gives us a skeleton of the function that generates keyword ciphertext alphabets. For the "from a" version of the alphabet generation, it's just what I described above:

    def keyword_cipher_alphabet_of(word, wrap_alphabet=KeywordWrapAlphabet.from_a):
        if wrap_alphabet == KeywordWrapAlphabet.from_a:
            cipher_alphabet = deduplicate(word + string.ascii_lowercase)
        elif wrap_alphabet == KeywordWrapAlphabet.from_last:
            cipher_alphabet = ...
        ...
        return cipher_alphabet
    

    What does deduplicate() look like? The obvious approach is walk along the source string a character at a time, building the deduplicated version as we go. Each character is added the deduplicated version if it's not already there.

    That gives us this:

    def deduplicate(text):
        deduplicated = ''
        for c in text:
            if c not in deduplicated:
                deduplicated += c
        return deduplicated
    

    However, there's a different way using the OrderedDict class from the collections standard library. The basic dict maintains a set of keys, so duplicate keys are not allowed. The OrderedDict is similar, but it remembers the order that keys were added.

    We can deduplicate some text by creating an OrderedDict from the letters: each letter will be a key in the OrderedDict, and the value will be None. This removes duplicates while keeping the order the same. We turn that back into a list, which gives just the keys in order.

    def deduplicate(text):
        return list(collections.OrderedDict.fromkeys(text))
    

    What about the other two variants of ciphtext alphabet generation, the "last letter" and "largest letter"? Now we have deduplicate, both of these variants become easy.

    For the "last letter" variant, we find the last letter in the (deduplicated) keyword. The "rest of the alphabet" is the alphabet from that letter onwards, and then wrapping around to abc.... That means the whole ciphertext alphabet is deduplicated keyword + alphabet from letter on + whole alphabet, with the whole bundle then deduplicated.

    The "largest letter" variant is exactly the same, but rather than picking out the last letter of the keyword, we pick out the largest letter. That leads to an implmementation of keyword_cipher_alphabet_of which looks like this:

    def keyword_cipher_alphabet_of(word, wrap_alphabet=KeywordWrapAlphabet.from_a):
        if wrap_alphabet == KeywordWrapAlphabet.from_a:
            cipher_alphabet = cat(deduplicate(sanitise(word) + string.ascii_lowercase))
        else:
            if wrap_alphabet == KeywordWrapAlphabet.from_last:
                last_keyword_letter = deduplicate(sanitise(keyword))[-1]
            else:
                last_keyword_letter = sorted(sanitise(keyword))[-1]
            alphabet_from_position = string.ascii_lowercase.find(
                last_keyword_letter) + 1
            cipher_alphabet = cat(
                deduplicate(sanitise(keyword) + 
                            string.ascii_lowercase[alphabet_from_position:] + 
                            string.ascii_lowercase))
        return cipher_alphabet
    

    Note the occasional cat and sanitise scattered around to ensure we accept any word and always return a string.

    Enciphering and deciphering

    Now we have the ciphertext alphabet, we can encipher and decipher in much the same way as we did for the Caesar cipher: to encipher the message, encipher one character at a time; to encipher a character, look up the corresonponding letter in the cipher alphabet.

    How do we do the "look up the corresponding letter" bit? We could do something involving str.find() to find the position of the letter in the string, then look up the letter at that position in the ciphertext alphabet.

    But what we're really doing is defining a mapping from a plaintext letter to a ciphertext letter. That suggests we should be using a Python dict: the keys are the plaintext letters and the values are the ciphertext letters. What we need to do is walk along the standard ordinary plaintext alphabet (i.e. string.ascii_lowercase) in lockstep with the ciphertext alphabet, and use that to define the dict. We want to do something like

    cipher_translation = {}
    for p in string.ascii_lowercase and c in ciphertext_alphabet:
        cipher_translation[p] = c
    

    The bad news is that for construction is illegal in Python (and most other programming languages). The good news is the function zip() does what we want.

    When zip() is given a bunch of iterables, it returns a sequence of tuples, where each tuple takes the corresponding parts of each iterable. It's easier to see with an example:

    >>> list(zip('abcd', 'wxyz', [1, 2, 3, 4]))
    [('a', 'w', 1), ('b', 'x', 2), ('c', 'y', 3), ('d', 'z', 4)]
    

    As you can see, we passed in three iteratbles of four items each, and got back a list of four 3-tuples, each with one element from the three arguments.

    (We need the list() because zip() is lazy and only creates the actual tuples when they're needed. The list() forces that creation for us.)

    We can now write our keyword cipher alphabet mapping in legal Python:

    cipher_translation = {}
    for p, c in zip(string.ascii_lowercase, cipher_alphabet):
        cipher_translation[p] = c
    

    From there, we can build the whole keyword cipher function:

    def keyword_encipher(message, keyword, wrap_alphabet=KeywordWrapAlphabet.from_a):
        cipher_alphabet = keyword_cipher_alphabet_of(keyword, wrap_alphabet=wrap_alphabet)
        cipher_translation = {}
        for p, c in zip(string.ascii_lowercase, cipher_alphabet):
            cipher_translation[p] = c
        ciphertext = ''
        for letter in message:
            if letter in cipher_translation:
                ciphertext += cipher_translation[letter]
            else:
                ciphertext += letter
        return ciphertext
    

    The keyword_decipher function is just about the same, but with a different cipher alphabet.

    def keyword_decipher(message, keyword, wrap_alphabet=KeywordWrapAlphabet.from_a):
        cipher_alphabet = keyword_cipher_alphabet_of(keyword, wrap_alphabet=wrap_alphabet)
        cipher_translation = {}
        for p, c in zip(string.ascii_lowercase, cipher_alphabet):
            cipher_translation[c] = p   # This line is different
        plaintext = ''
        for letter in message:
            if letter in cipher_translation:
                plaintext += cipher_translation[letter]
            else:
                plaintext += letter
        return plaintext
    

    Refactoring

    There are a couple of ways to refactor these functions. One is to replace the creation of cipher_translation with a dictionary comprehension:

    cipher_translation = {c: p for p, c in zip(string.ascii_lowercase, ciphertext_alphabet
    

    (I mentioned dictonary comprehensions in the post on language models.)

    However, Python has a better way to translate text from one form to another. str.translate() will do all the conversion for us in one method call: message.translate(cipher_translation). Letters (or other characters) in the cipher_translation table will be converted; everything else will be left as it is.

    To create the cipher_translation table we use the str.maketrans() method. It can be called in several ways, but the most convenient for us is to pass in two strings: the plaintext alphabet and the ciphertext alphabet:

    def keyword_encipher(message, keyword, wrap_alphabet=KeywordWrapAlphabet.from_a):
        cipher_alphabet = keyword_cipher_alphabet_of(keyword, wrap_alphabet)
        cipher_translation = ''.maketrans(string.ascii_lowercase, cipher_alphabet)
        return unaccent(message).lower().translate(cipher_translation)
    

    Deciphering is the same, but just swapping the arguments in maketrans():

    def keyword_decipher(message, keyword, wrap_alphabet=KeywordWrapAlphabet.from_a):
        cipher_alphabet = keyword_cipher_alphabet_of(keyword, wrap_alphabet)
        cipher_translation = ''.maketrans(cipher_alphabet, string.ascii_lowercase)
        return message.lower().translate(cipher_translation)
    

    This has the advantage of being both easier to understand. It's also slightly faster (about 20%), which is useful when it comes to automatically breaking these ciphers.

    Code

    The code is in cipher/keyword_cipher.py on Github.

    Acknowlegements

    Neil Smith

    Read more posts by this author.