Saturday 4:30 p.m.–5 p.m.

Beyond scraping: how to use machine learning when you're not sure where to start

Julie Lavoie


Scraping one web site for information is easy, scraping 10000 different sites is hard. Beyond page-specific scraping, how do you extract the publication date of (almost) any news article online, no matter the web site? We’ll discuss when to use machine learning versus humans or heuristics for extracting web data, the different steps of phrasing the problem in terms of machine learning with lxml and nltk, including feature selection on news articles, and issues that arise when turning research into production code.