Strings and unicode in Python

Monday, Sep 20 2010

This post wil be about strings and encodings in Python, with an eye to Python 3 migration. Hopefully, you will find some useful rules of thumbs to to navigate these sometimes perplexing questions.

'String' types

What you commonly call a string can be two different things:

Now this is extremely rough and even incorrect (unicode is just one standard, there are non unicode encodings) , but you the aim is to get a working mental model. The important thing is that, seen from this very high altitude, both these concepts exist in both versions of Python, but they have changed name.

Encoding/decoding

Where the practical problems arise is the encoding/decoding dance.

Examples

>>> type(text.decode('utf-8'))

Conclusion

Postscript: some more encodings

You might wonder what that 'utf-8' parameter means. Well, as I said at the beginning, there various ways of encoding characters. Most of the time in new programs you will want to use utf-8, but you might also need to handle other encodings. Some are defined in Unicode, most not. It really does not matter from the practical standpoint. For instance, if you work with files originating from Western Europe, you will often be confronted with the ISO-8859-1 (commonly named latin-1) and Windows-1252 (which, characteristically, is almost the same). These are not defined in the Unicode standard, but they are handled in exactly the same way in Python programs. That is, replace 'utf-8' with the appropriate string you can obtain from this handy reference.