Encoding in Python 2 and 3

In this post I an encoding behavior in python 2 and their differences in python 3

If you’re still using python 2, which many folks are, you may run into encoding issues when processing data.

Let’s say we have a file called translations.txt that contains translations between English and Mandarin (from the Oxford dictionary):

"The book has 500 pages of text.","这本书正文有500页。"
"I'll send you a text as soon as I have any news.","我一得到任何消息，就立刻给你发短信。"

Now, say we try to read the file in python 2:

>>> path = "/home/my/path/to/translations.txt"
>>> with open(path) as fp:
...     data = fp.read()
...
>>> print(data)
"The book has 500 pages of text.","这本书正文有500页。"
"I'll send you a text as soon as I have any news.","我一得到任何消息，就立刻给你发短信。"

So far, so good. In fact, it seems that everything is working as expected.

Now, say we wish to encode our data in a specific coding system, like utf-8.

>>> data.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 35: ordinal not in range(128)

Ooops! Something is amiss. The error message tells us that it tried to decode an ASCII character and failed.

The problem here is that in python 2, the built-in open function only supports the ASCII character set, for which it returns a str.

A way to avoid these problems is to use the open function from the io module instead.

>>> import io
>>> with io.open(path) as fp:
...     data = fp.read()
...
>>> print(data)
"The book has 500 pages of text.","这本书正文有500页。"
"I'll send you a text as soon as I have any news.","我一得到任何消息，就立刻给你发短信。"

>>> data.encode('utf-8')
'"The book has 500 pages of text.","\xe8\xbf\x99\xe6\x9c\xac\xe4\xb9\xa6\xe6\xad\xa3\xe6\x96\x87\xe6\x9c\x89500\xe9\xa1\xb5\xe3\x80\x82"\n"I\'ll send you a text as soon as I have any news.","\xe6\x88\x91\xe4\xb8\x80\xe5\xbe\x97\xe5\x88\xb0\xe4\xbb\xbb\xe4\xbd\x95\xe6\xb6\x88\xe6\x81\xaf\xef\xbc\x8c\xe5\xb0\xb1\xe7\xab\x8b\xe5\x88\xbb\xe7\xbb\x99\xe4\xbd\xa0\xe5\x8f\x91\xe7\x9f\xad\xe4\xbf\xa1\xe3\x80\x82"\n'

That works as expected! To understand these differences better, we can observe the types returned by each option. First, for the built-in open:

>>> with open(path) as fp:
...     data = fp.read()
...
>>> type(data)
<type 'str'>

And for the io.open function:

>>> with io.open(path) as fp:
...     data = fp.read()
<type 'unicode'>

The good news is that in python 3, the str behaves more like unicode from python 2. So encoding strings in utf-8 or other encoding schemes should be more straight forward. Similarly, the built-in open function behaves like the io.open function. For instance, both take an encoding parameter to define the encoding for the input file.

This is worth noting because when converting code from python 2 to python 3, string or text transformations won’t always yield the same result, as these are different types in the two versions of the language.

Summary:

Use io when reading non-ascii files in python 2 — avoids assuming a limited character set (see SO question on unicode errors)
Be careful when migrating code, because str conversion can be tricky, as it exhibits different behavior from python 2 to 3.