Saturday 1:55 p.m.–2:25 p.m.
Other people's messy data (and how not to hate it!)
- Audience level:
- Python Libraries
Have you ever viscerally hated a dataset? Do you want to just get data cleaning out of the way? Are you always left wondering how it consumes most of your time? Whether you work in the sciences, work with government data or scrape websites, data cleaning is a necessary evil. We'll share our woes and check out state of the art in day to day data cleaning tools and strategies.
Everyone who has to deal with data eventually has to deal with messy data. This task often takes over 50% of the effort yet is often billed as "not the meat of the work" and no one gets trained in it. Government data consumers, social scientists, other scientists, and even you, dear data consumer, might like this talk! You'll learn how to tackle day to day data cleaning. Spotting issues with data, dealing with missing data and merging datasets are among the topics. I'll mention the deep, dark parts of pandas that help specifically with different types of cleaning, go over some lesser known but neat libraries and tools like Sunlight Labs' jellyfish, messytables, chardet, etc. I'll mention some thoughts on data collection, and finally go over a demo of cleaning a real life dataset!