Presentation: Scrapy: it GETs the web

Friday 11:30 a.m.–noon

Scrapy: it GETs the web

Asheesh Laroia

Audience level:: Intermediate
Category:: Best Practices/Patterns

Description

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.

If you use requests, mechanize, or celery for HTTP, you should probably switch to scrapy.

Abstract

Extracting data from the web is often error-prone, hard to test, and slow. Scrapy changes all of that.

In this talk, we take two different types of web data retrieval -- one that scrapes data out of HTML, and another that uses a RESTful API -- and show how both can be improved by Scrapy.

Part I: Scraping without Scrapy

Web pages render into DOM nodes
Demonstrate a basic way to scrape a page: urllib2.urlopen() + lxml.html
Send the data somewhere by a synchronous call

Part II: Importing Scrapy components for programmer sanity

Using scrapy.items.Item to define what you are scraping out
Using scrapy.spider.BaseSpider to clarify the code
Running spiders: You just got async for free
Discussion: What does async buy you? Quick benchmarks of 200 simultaneous connections with Scrapy and without.
Sending data out the data pipeline

Part III: Everyone "loves" JavaScript

spidermonkey with Scrapy
Automating an entire Firefox with Selenium RC

Part IV: Automated testing when using Scrapy

Why testing is hard with synchronous scrapers
How to run scrapy.spider.BaseSpiders from Python unittest
How to test offline (by keeping a copy of needed pages)
No synchronous calls, so tests run fast

Part V: Improving a Wikipedia API client with Scrapy

Start with a synchronous API client
When the web service is down, watch it crash
Make it a "scrapy.spider", and get automatic retry on failure
Configure the request scheduler to not hammer Wikipedia

Part VI. Scrapy with Django

Like ModelForms, but for scraping: Scrapy's DjangoItem
Imperfect integration, but discipline gives results
Think in terms of scrapy.items.Item

Conclusion:

Asheesh's rules for sane scraping

Separate downloading from parsing.
Maintain high test coverage.
Be explicit about what data you pass from the wild, wild web into your application code.
Coding with Scrapy gives you all of these, unlike other scraping libraries.
When Scrapy isn't appropriate
Short scripts sometimes feel a serious burden from the verbose API.
If you really want exceptions on failure.
Even if you use something else, you will love Scrapy's documentation on scraping in general.