Change the future

Friday 11:30 a.m.–noon

Scrapy: it GETs the web

Asheesh Laroia

Audience level:
Intermediate
Category:
Best Practices/Patterns

Description

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.

If you use requests, mechanize, or celery for HTTP, you should probably switch to scrapy.

Abstract

Extracting data from the web is often error-prone, hard to test, and slow. Scrapy changes all of that.

In this talk, we take two different types of web data retrieval -- one that scrapes data out of HTML, and another that uses a RESTful API -- and show how both can be improved by Scrapy.

Part I: Scraping without Scrapy

  • Web pages render into DOM nodes
  • Demonstrate a basic way to scrape a page: urllib2.urlopen() + lxml.html
  • Send the data somewhere by a synchronous call

Part II: Importing Scrapy components for programmer sanity

  • Using scrapy.items.Item to define what you are scraping out
  • Using scrapy.spider.BaseSpider to clarify the code
  • Running spiders: You just got async for free
  • Discussion: What does async buy you? Quick benchmarks of 200 simultaneous connections with Scrapy and without.
  • Sending data out the data pipeline

Part III: Everyone "loves" JavaScript

  • spidermonkey with Scrapy
  • Automating an entire Firefox with Selenium RC

Part IV: Automated testing when using Scrapy

  • Why testing is hard with synchronous scrapers
  • How to run scrapy.spider.BaseSpiders from Python unittest
  • How to test offline (by keeping a copy of needed pages)
  • No synchronous calls, so tests run fast

Part V: Improving a Wikipedia API client with Scrapy

  • Start with a synchronous API client
  • When the web service is down, watch it crash
  • Make it a "scrapy.spider", and get automatic retry on failure
  • Configure the request scheduler to not hammer Wikipedia

Part VI. Scrapy with Django

  • Like ModelForms, but for scraping: Scrapy's DjangoItem
  • Imperfect integration, but discipline gives results
  • Think in terms of scrapy.items.Item

Conclusion:

Asheesh's rules for sane scraping

  • Separate downloading from parsing.
  • Maintain high test coverage.
  • Be explicit about what data you pass from the wild, wild web into your application code.
  • Coding with Scrapy gives you all of these, unlike other scraping libraries.
  • When Scrapy isn't appropriate
  • Short scripts sometimes feel a serious burden from the verbose API.
  • If you really want exceptions on failure.
  • Even if you use something else, you will love Scrapy's documentation on scraping in general.