Thursday 9 a.m.–12:20 p.m.

Python for Data Analysis

Travis Oliphant, Peter Wang, Benjamin Zaitlen

Audience level:: Intermediate
Category:: Big Data

Description

Python has long played a role in analyzing large scale data. From tightly-knit super-computers running MPI-based applications to heterogeneous clusters woven together with scripts, Python has had a role to play in making it easier to processes data. This tutorial will cover the tried and true techniques as well as introduce new trends.

Abstract

Outline

Introduction

Tools For Analysis

NumPy
SciPy
Pandas
SciKit-learn
Disco
IPython Parallel

Example Problems

Large scale image processing
Text Analysis
Time-series analysis
Wiki Log File Analysis
Simple DNA Analysis

Emerging Trends in Data Analysis

Blaze: an open-source generalization of NumPy which maps easily onto distributed data-sets and allows distributed computations to be expressed at a high-level
Bokeh: web-based visualization of distributed data sets

Prerequisites for the tutorial

Students should be fairly comfortable with Python and have used NumPy/SciPy in the past. Students should also come prepared with working laptops and the latest version Anaconda CE installed.

This tutorial is a crash course on Python for Data Analysis. We will explore a wide variety of domains and data types (text, images, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. In the first half of the course we will explore common tools and design patterns for single machine compute. Starting with NumPy and SciPy we will begin to build a foundation for scientific computation. Next, we will explore two modules for data analysis which build upon NumPy and SciPy: Pandas and SciKit-Learn.

Pandas layers a DataFrame object on top of NumPy arrays. This enables easy spelling and manipulations of table like objects. Pandas has been widely adopted as the Python library of choice for manipulating Time-Series data sets and as such we will demonstrate and number of common analyses using Pandas.

We will also present a number of examples from the domain of Machine Learning. SciKit-Learn, like Pandas, is an extension of NumPy and Scipy with routines commonly found in classical Machine Learning. Using SciKit-Learn we will develop a few common machine learning techniques and apply them to novel data sets.

In the second half of the course, we introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and Disco. IPython Parallel enables parallel computation to be expressed entirely in Python. We will develop several routines for commonly used for simultaneous calculations and analysis. Using Disco -- a Python Map Reduce framework -- we introduce the concept of Map Reduce and build up several Map Reduce scripts which can process a variety of public data sets found on Amazon’s S3.

In the final 30 minutes of the course we will introduce the next-generation of open-source tools for distributed data processing that we have been developing at Continuum. These tools enable you to map onto large distributed data and express many calculations as easily as Pandas and NumPy but allowing the data to be out-of-core and on multiple machines.