MAKING BIG DATA A LITTLE BIT SMALLER

Katya Vasilaky

Description

You have an ill-conditioned data set, and you’re ready to make some predictions about what your consumers should buy, or who will win the NFL. Wait, what’s ill conditioned mean? Fortunately, as Jake Vanderplas pointed out last Pycon, doing statistics has now become accessible to a much larger set of programmers. Python’s Pandas, Numpy, Scikitlearn, and SciPy, among others, have all made data analysis much more accessible to the non-statistician. Regularizing one’s data is one of these steps that might be recommended before running a neural network, for example, and it essentially dampens the effect of certain predictors. But how does this work? And when should we do it? What exactly are the pros (less variance in the solution) and cons (more biased estimates)? This poster will present the least squares problem (inverse problem) and introduce the concept of ill-conditioned data, and the technique used to deal with ill-conditioned data – in statistics or machine learning – known as regularization. Regularization dampens the effect of features or covariates that are highly correlated (essentially using a filter or re-weighting of covariates). The poster will highlight common regularization techniques (e.g. Lasso, Tikhanov, Elastic Net), as well as a new method called Iterative (L2) Tikhanov. It will compare the mean squared prediction error from cross validated experiments with real data sets (NFL wins as as well as Kaggle's Rossman Sales data).