top band

Friday 2:35 p.m.–3:05 p.m.

Grids, Streets and Pipelines: Building a linguistic street map with scikit-learn

Michelle Fullwood

Audience level:
Python Libraries


Have you built a classifier in scikit-learn with out-of-the-box features, been disappointed with the results, and wanted to know where to go next? This talk shows how to add your own feature Pipelines and how to tune hyperparameters using GridSearchCV. We'll apply this to the problem of classifying streetnames in Singapore by linguistic origin, and turn the results into a colour-coded street map.


**Introduction** We start by motivating the running example I'll be using throughout this talk to illustrate the techniques: making a "linguistic" street map of Singapore. Why make this map? Why is there so much linguistic diversity among Singapore street names in the first place? I'll answer the latter question with a 2-minute history of Singapore presented via maps. **Building a baseline classifier with scikit-learn** Next we'll very rapidly go through the steps of building a baseline classifier with scikit-learn: this is basically the contents of the "Working with Text Data" tutorial. We'll do data wrangling with GeoPandas, establish a classification schema, build character n-gram features, select a classifier, perform the classification, and evaluate the baseline result. **Adding custom feature Pipelines** We now look at ways of improving the classifier over the baseline. I'll show how to add custom features beyond those included in scikit-learn, how to build Pipelines for those features, and how to use FeatureUnion to glue them together. **Tuning hyperparameters with GridSearchCV()** We then look at how to tune hyperparameters using GridSearchCV(). We'll discuss what happens under the hood when you use GridSearchCV(), and how to choose which hyperparameters to experiment with, focusing on Linear SVC as an example classifier. **Making the map** I'll outline the steps needed to go from the classification results to a whole map using OpenStreetMap data, with the heavy lifting largely provided by Mapnik, a C++ tool for developing mapping applications with Python bindings. It's actually easier than you might think! **Conclusion** We'll recap what we've done and review which method of improving the baseline classifier worked best: more data, adding features, hyperparameter tuning, or swapping out classifiers?
bottom band background