Sanitising text

    In the Breaking ciphers and language models post, I skipped over the definition of sanitise, a function which takes some text and "sanitises" it. A sanitised text is just the letters in the original text, converted to lowercase.

    For example,

    Original Sanitised
    hello there hellothere
    Hello There! hellothere
    For example, some text to sanitise forexamplesomtexttosanitise
    A naïve café anaivecafe

    This is useful when trying the break ciphers, as we can concentrate on the letters that make up the ciphertex, rather than being confused by any breaking of the texting into different block of letters.

    In this post, I describe sanitise in all its gory detail.

    The simple case: ASCII characters

    If we assume that the text we're sanitising contains only ASCII characters, the process is fairly simple. Python's string library defines ascii_letters, and we can use that to identify the characters we need:

    import string
    def letters(text):
        return cat(l for l in text 
                   if l in string.ascii_letters)
    

    The function is just a list comprehension that operates on the text, returning a list of letters in the text. The convenience function cat is used to convert the list of letters back into a string.

    The joys of Unicode

    The above code is all well and good, but what about characters like é, ä, or ø, which we want to convert to e, a, and o?[1] Ascii only defines the 26 basic, unaccented letters a–z and A–Z.

    Code points and combining characters

    Luckily for us, Python3 doesn't treat strings as sequences of Ascii characters. Instead, it does things properly, treating strings as sequences of Unicode codepoints. Therefore, the character é can be represented by the Unicode codepoint
    LATIN SMALL LETTER E WITH ACUTE (é, Unicode character U+00E9). But it can also be represented by the two characters LATIN SMALL LETTER E (e, U+0065) and COMBINING ACUTE ACCENT (◌́, U+0301); when displayed, these two characters will be combined into one glyph on the screen or page. Most of the time, it makes sense to combine letters and accents, but there are cases where you'll want to separate them, mainly when dealing with legacy systems.

    We'll use that to remove the accents from letters. The Unicode standard defines how to convert from the combined form and back again, and that's implmented in Python in the normalise function in the unicodedata standard library module. The "composed" form is called NFC and the "decomposed" form NFD.

    Encoding Unicode as bytes

    The other thing to know about Unicode is the idea of an encoding. Unicode, inside the computer, is treated just as a sequence of abstract code points. But when it comes to writing data to a file, or sending it over a network, that sequence of code points must be encoded into a series of bytes for the physical hardware to manipulate. When those bytes are read back in, they're decoded back into abstract Unicode code points.

    There are many ways of encoding Unicode data into bytes[2]. Just as Ascii is a mapping from letters to bytes (so that the letter A is represented by the byte with value 65 (in hex, 0x41; in binary, 0b01000001), UTF-8 is a commonly used encoding for Unicode. The nice thing about UTF-8 is that it's a strict superset of pure Ascii (the first 128 code points only): any sequence of bytes which is uses only the first 128 Ascii characters is also valid UTF-8.

    Python knows about both Ascii and UTF-8 encodings, and several others. Not every code point fits in every encoding. For instance, accented characters and combining accent marks can't be encoded into strict Ascii. We can tell Python what to do in this situation: be default, Python will raise an exception, but we can also tell it to silently ignore these errors.

    Removing accents

    Armed with the above, we can see how to strip accents from letters.

    1. Convert the text into the decomposed NFD form, so that accents are separate code points from letters.
    2. Encode the text into Ascii, silently dropping errors; the accents will disappear.
    3. Decode the Ascii byte sequence, containing just the letters, back into a Unicode string.

    And that's what we do:

    import unicodedata
    
    def unaccent(text):
        """Remove all accents from letters. 
        It does this by converting the unicode string to decomposed compatability
        form, dropping all the combining accents, then re-encoding the bytes.
        """
        translated_text = text.translate(unaccent_specials)
        return unicodedata.normalize('NFD', translated_text).\
            encode('ascii', 'ignore').\
            decode('utf-8')
    

    Sanitising text is then just removing the accents and keeping just the letters from what's left, and converting everything to lowercase:

    def sanitise(text):
        """Remove all non-alphabetic characters and convert the text to lowercase"""
        return letters(unaccent(text)).lower()
    

    Code

    The code here is on Github, in support/utilities.py.

    See also

    Acknowledgements

    Post photo by Zenhappy from Unsplash.


    1. For our purposes at least: if you're working with text in langauges other than English, where the accented letters are distinct characters, you'll want to preserve, for instance, that the Swedish alphabet has 29 letters, and a, ä, and å are all different. ↩︎

    2. Why are there different encodings for Unicode? It's a long story. As Unicode code points fit in four bytes, the most obvious encoding us UTF-32, which uses four bytes for each code point. But this tends to waste a lot of space, as most codepoints, for text which uses the Latin alphabet, will lie in the first 256 codepoints. UTF-8 is a variable length encoding: if the code point is less than 128, UTF-8 uses jut one byte. Code points up to 2048 use two bytes, and so on up to four bytes. ↩︎

    Neil Smith

    Read more posts by this author.