faNFL - Exploring the possibilities of predicting NFL player performance for Fantasy NFL
- Audience level:
How far can we get with statistical and machine learning tools of the Python eco system to tackle an interesting real world question: predicting the performance of individual NFL players based on historic data. In the rise (hype?) of “big-data”, how important are good models to train a predictor vs. just taking the brute-force approach of checking all correlations to perform the predictions?
How good can one jumpstart to do interesting real world analysis and prediction with in the python eco system? How close can we get to yahoo’s predictions? Can we beat them with open source machine learning/statistics tools? (we have not yet bet them as the time of writing.) Fantasy Football is an online competition where users compete against one another as general managers for a virtual team. The players in the virtual team's performance is based on their real world performance. Each week, users are able to perform different actions, simulating professional football organization. Fantasy football has vastly increased in popularity, mainly because fantasy football providers such as ESPN, Yahoo! Fantasy Sports, and the NFL are able to keep track of statistics entirely online. The virtual teams are ranked by using the performance of the real world games, therefore predicting the real world performance of players is can lead to an advantage for the virtual general manager. Using our fork of NFLGame (we ported the library to Python 3) to directly get statistics from NFL Game Center, we are able to produce a big pandas panel data structure of historic performance of players. This data structure is much more convenient for explorative data analysis and further processing than REST (web) APIs. We started directly with Python 3.4 for this project and the libs and tools we use include IPython, numpy, scipy, pandas, seaborn/matplotlib, sklearn, requests and python-yahooapi. From simple counting over correlation analysis to building models as a basis for statistical evaluation and machine learning tools (provided by sklearn), we are addressing our main question: How important are carefully hand-crafted performance models for the different learning algorithms vs. how far can we get by "counting numbers"? We plan to open source the IPython notebook, since this setup and data preparation took a significant amount of time (longer than we estimated! Surprise, surprise!), before we were able to start with the more interesting part of this project. In the future others may perhaps want to reuse this basis to improve the predictions or try out other statistical models.