PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Saturday 1:20 p.m.–4:40 p.m.

Python for Social Scientists: Cleaning and Prepping Data

Renee Chu

Audience level:
Novice
Category:
Best Practices & Patterns

Description

If you're learning to code, working with data is a great way to implement your new skills. However, before you can do analysis or visualization, you must have a cleaned, prepped data set. This tutorial uses Python basics to unify data sets from disparate sources. It also shows you to write your programs as modules so you can re-use them for future projects.

Abstract

* Intro: * Who you are, why you are here, what we will learn * Introduce the project: we are interested in sovereign debt, i.e. money borrowed by the governments of countries. What are the factors in debt and credit rating, we want to analyze them side-by-side * Time: 10 min, 10 min total * Download data * Debt as % of GDP in 2014, reported by the World Bank (http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS). * Import it to python using csv DictReader class. * Briefly discuss Python data structures, lists vs dicts, representing CSV rows as dictionaries. * Time: 20 min, 30 min total * Clean data * Write a function that gets rid of extraneous fields and returns tabular data only. * Discuss functions, modularity * Time: 20 min, 50 min total * Merging data. * Download data set of GDP % growth Y/Y for 2014, clean it as well (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG). * Download a data set of inflation % Y/Y for 2014, clean it up also (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG) * Write a function that takes all cleaned csv objects and creates a master table of all indicators for each country * Time: 20 min, 1 hr 10 min total * Review what we did so far - 5 min, 1:15 total * Break : 15 min, 1:30 total * Importing data that's a different format. * Download Moody's credit ratings for each country (http://www.theguardian.com/news/datablog/2010/apr/30/credit-ratings-country-fitch-moodys-standard) * Import and clean Moody's data as learned before the break. * Discuss issues with adding Moody's data to master table (standardization of country names) * Time: 20 min 1:50 total * Write a class that resolves country names (all valid variations) to ISO codes * Get ISO codes from Wikipedia (https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes) * Discuss Python classes, modularity, re-use * Time: 30 min 2:20 total * Create unified table, with the help of your name standardizer * Modify the exiting World Bank importer to use the name standardizer. * Add Moody's data to master table, also using name standardizer * Time: 20 min, 2:40 total * Review everything we did: 10 min, 2:50 total * Questions: 10 min, 3:00 total

Student Handout

No handouts have been provided yet for this tutorial