In the Breaking ciphers and language models post, I skipped over the definition of
sanitise, a function which takes some text and "sanitises" it. A sanitised text is just the letters in the original text, converted to lowercase.
|For example, some text to sanitise||forexamplesomtexttosanitise|
|A naïve café||anaivecafe|
This is useful when trying the break ciphers, as we can concentrate on the letters that make up the ciphertex, rather than being confused by any breaking of the texting into different block of letters.
In this post, I describe
sanitise in all its gory detail.
The simple case: ASCII characters
If we assume that the text we're sanitising contains only ASCII characters, the process is fairly simple. Python's
string library defines
ascii_letters, and we can use that to identify the characters we need:
import string def letters(text): return cat(l for l in text if l in string.ascii_letters)
The function is just a list comprehension that operates on the text, returning a list of letters in the text. The convenience function
cat is used to convert the list of letters back into a string.
The joys of Unicode
The above code is all well and good, but what about characters like é, ä, or ø, which we want to convert to e, a, and o? Ascii only defines the 26 basic, unaccented letters a–z and A–Z.
Code points and combining characters
Luckily for us, Python3 doesn't treat strings as sequences of Ascii characters. Instead, it does things properly, treating strings as sequences of Unicode codepoints. Therefore, the character é can be represented by the Unicode codepoint
LATIN SMALL LETTER E WITH ACUTE (é, Unicode character U+00E9). But it can also be represented by the two characters
LATIN SMALL LETTER E (e, U+0065) and
COMBINING ACUTE ACCENT (◌́, U+0301); when displayed, these two characters will be combined into one glyph on the screen or page. Most of the time, it makes sense to combine letters and accents, but there are cases where you'll want to separate them, mainly when dealing with legacy systems.
We'll use that to remove the accents from letters. The Unicode standard defines how to convert from the combined form and back again, and that's implmented in Python in the
normalise function in the
unicodedata standard library module. The "composed" form is called
NFC and the "decomposed" form
Encoding Unicode as bytes
The other thing to know about Unicode is the idea of an encoding. Unicode, inside the computer, is treated just as a sequence of abstract code points. But when it comes to writing data to a file, or sending it over a network, that sequence of code points must be encoded into a series of bytes for the physical hardware to manipulate. When those bytes are read back in, they're decoded back into abstract Unicode code points.
There are many ways of encoding Unicode data into bytes. Just as Ascii is a mapping from letters to bytes (so that the letter
A is represented by the byte with value
65 (in hex,
0x41; in binary,
0b01000001), UTF-8 is a commonly used encoding for Unicode. The nice thing about UTF-8 is that it's a strict superset of pure Ascii (the first 128 code points only): any sequence of bytes which is uses only the first 128 Ascii characters is also valid UTF-8.
Python knows about both Ascii and UTF-8 encodings, and several others. Not every code point fits in every encoding. For instance, accented characters and combining accent marks can't be encoded into strict Ascii. We can tell Python what to do in this situation: be default, Python will raise an exception, but we can also tell it to silently ignore these errors.
Armed with the above, we can see how to strip accents from letters.
- Convert the text into the decomposed
NFDform, so that accents are separate code points from letters.
- Encode the text into Ascii, silently dropping errors; the accents will disappear.
- Decode the Ascii byte sequence, containing just the letters, back into a Unicode string.
And that's what we do:
import unicodedata def unaccent(text): """Remove all accents from letters. It does this by converting the unicode string to decomposed compatability form, dropping all the combining accents, then re-encoding the bytes. """ translated_text = text.translate(unaccent_specials) return unicodedata.normalize('NFD', translated_text).\ encode('ascii', 'ignore').\ decode('utf-8')
Sanitising text is then just removing the accents and keeping just the letters from what's left, and converting everything to lowercase:
def sanitise(text): """Remove all non-alphabetic characters and convert the text to lowercase""" return letters(unaccent(text)).lower()
The code here is on Github, in
- Nathan Reed's article on Unicde is an excellent guide to this in more depth.
- David C. Zentgraf's article on Unicode covers similar ground.
- Python's Unicode HOWTO
- Python's Unicode library
- Python's encoding method for strings
For our purposes at least: if you're working with text in langauges other than English, where the accented letters are distinct characters, you'll want to preserve, for instance, that the Swedish alphabet has 29 letters, and a, ä, and å are all different. ↩︎
Why are there different encodings for Unicode? It's a long story. As Unicode code points fit in four bytes, the most obvious encoding us UTF-32, which uses four bytes for each code point. But this tends to waste a lot of space, as most codepoints, for text which uses the Latin alphabet, will lie in the first 256 codepoints. UTF-8 is a variable length encoding: if the code point is less than 128, UTF-8 uses jut one byte. Code points up to 2048 use two bytes, and so on up to four bytes. ↩︎