Change the future

Wednesday 1:20 p.m.–4:40 p.m.

Python beyond the CPU

Andy Terrel, Travis Oliphant, Mark Florisson

Audience level:
High Performance Computing


Accelerators are the hottest tool in high performance computing but applicable to all fields. We present how to use Python's amazing ability to abstract away the low-level boiler-plate code turning accelerators from an exotic curiosity to a daily tool.


The world of computing from mobile apps to high-end servers is rapidly shifted from the beefy x86 processors. Perhaps the most disruptive of these changes are accelerators, e.g. GPUs and Intel MICs. Accelerators have become a main stream technology useful to everyone not just HPC practitioners. Additionally, learning how to utilize accelerators better will also teach you how to use x86 processors better as well.

In this tutorial, we present several methods from practical experience to utilize accelerators in Python code. You will learn how to take basic algorithms, both structured loops and dynamic iterations, and turn them into algorithms appropriate for highly parallel architectures. This process is an iterative process stressing fine grained profiling along the way to demonstrate the effects of each transformation. The basic outline will follow:

  • Current and coming architectures
    • Vectorization
    • Parallel Instruction Execution
  • A survey of Python projects utilizing accelerators
    • Numba
    • Theano
    • Copperhead
    • Py{CUDA,OpenCL}
    • Cython OpenMP extension
  • The CUDA and OpenCL languages
  • Structured algorithms on CUDA and GPUs
  • Transforming Sorting and Graph algorithms


This material is targeted towards the experienced programmer who wants to try out a new piece of technology. We aim to encourage beginners with examples everyone should understand, but also challenge experts with deep dives into the interactions of the machine, interpreter, and code. The examples, e.g. mandelbrot viewers, image processing, sorting, graph processing, all start with a basic algorithm and iteratively add parallelism via multicore cpu's or gpu's.

We will point out that not all code will run faster using an accelerator. While accelerators give roughly ten times as much compute power, they usually only increase the bandwidth to memory by three times. This ratio must also be measured especially when targeting an accelerator connect via a PCI bus. This means that algorithms that unnecessarily traverse through memory randomly will not be able to make use of this model. For this reason there will be a strong focus on transforming your algorithm into something that will run on an accelerator well. We expect for most people, the ability to treat your CPU as an accelerator via PyOpenCL will be the feature you take home.


Participants will be provided logins to machines with the appropriate hardware or can use any multicore CPU or CUDA/OpenCL enabled GPU.