Change the future

Saturday 12:10 p.m.–12:40 p.m.

Building an image processing pipeline with Python

Franck Chastagnol

Audience level:
Distributed Computing


This talk covers the details of how to build a highly scalable image processing pipeline using Python and third-party open source libraries and tools such as OpenCV, NumPy, Tesseract, ImageMagick, Tornado, Nginx and MySQL.


We’ll take as an example the Python based pipeline we built at which processes hundreds of thousands of receipt pictures sent by our users via their mobile phone. Images get processed by a distributed architecture that extracts the product level purchase data and stores it in our back-end storage for handling by our downstream business layer.

After an overview of the architecture of the system, we will dive into the specifics of each area:

  • Building a reliable and high performance image upload flow for a mobile application using Nginx, Tornado and MySQL
  • Implementing a processing pipeline resilient to node failure and capable of running multi-pass algorithms
  • Using OpenCV, NumPY and ImageMagick for performing image filtering, image analysis
  • Training and using Tesseract as an Optical Character Recognition engine
  • Why regex don’t work and how to use fuzzy matching for extracting structured text data

We will then open up for a Q&A session with the audience.