Presentation: Why you should use Python 3 for text processing

Saturday 12:10 p.m.–12:55 p.m.

Why you should use Python 3 for text processing

David Mertz

Audience level:: Intermediate
Category:: Core Python (Language, Stdlib)

Description

Python is a great language for text processing. Each new version of Python--but especially the 3.x series--has enhanced this strength of the language. String (and byte) objects have grown some handy methods and some built-in functions have improved or been added. More importantly, refinements and additions have been made to the standard library to cover the most common tasks in text processing.

Abstract

This talk, by its nature, will be a somewhat impressionistic review of nice-to-have improvements to text processing that have come to python--in part in the long time frame since my book on the topic, but with an emphasis on 3.x features.

Improvements to collections help with many things, but seem to come up particularly often as nice ways to do text processing tasks: e.g. namedtuple; Counter; OrderedDict; defaultdict.
Lots of improvements and rationalization of email package (mailbox too).
Unicode handling--sometimes an important aspect of text processing--remains unwieldy, but has at least entered the domain of "possible to do right" (usually).
Codecs improvements
Relatively old but continues to improve: textwrap.
ElementTree as standard library high-level option for XML handling (with various tweaks in 3.x version).
str.format(); technically back ported to 2.x also, but a good option that wasn't in historical python versions.
Miscellaneous improvements to datetime.
logging has become good enough that it should be a standard tool for logging (also backported generally).
hashlib
csv improvements.
Not only in 3.x, but json as a standard module is wonderful for serialization and data sharing.
Ancient but little known tip: use str.startswith([list,of,values]).