Scrape the Web: Strategies for programming websites that don't expect it

Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots?

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable. You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation.

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes.

Presenter

By day, Asheesh Laroia is a software engineer at Creative Commons, where he uses Python extensively. He began scraping web sites in 2001 and honed his skills in 2008 when confronted with a US Patent Office website the Electronic Frontier Foundation needed information from. At other times, he juggles, maintains software in Debian, and leads the Students for Free Culture web team. Asheesh lives in San Francisco with his stuffed dog Herbert.

Requirements

Attendees are welcome to bring their laptops with Python installed (version 2.5 or higher, preferably 2.6). You will want BeautifulSoup <http://crummy.com/software/BeautifulSoup> and mechanize <http://wwwsearch.sourceforge.net/mechanize/> installed. Having Firefox and Firebug <http://www.getfirebug.com/> installed is a bonus.

Suggestions: Attendees are encouraged to email (scrape-pycon@asheesh.org) me before the talk with suggestions of websites they want to see scraped.

Class Outline

  • Introduction: Be nice to the web, but get the better of it
  • Structure of HTML and XHTML
  • Extracting information with regular expressions, parsers, and XPath
  • HTTP: Setting User-Agent, dealing with cookies, and handling errors
  • Filling out forms with urllib2 and mechanize
  • Discuss example applications: Text-to-speech, music store scraping, and more
  • In-depth BeautifulSoup query discussion
  • Expanding to more computers with Python 2.6 multiprocess and keeping your scraping stable across time
  • Scraping dynamic "AJAXy" websites by truly automating a web browser
  • Q&A

Diamond

  • White Oak Technologies Inc. - Diamond

Platinum

  • Google - Platinum
  • Sun - Platinum

Gold

  • ESRI - Gold
  • CCP Games - Gold
  • Visual Numerics, Inc. - Gold
  • Microsoft - Gold
  • Slide.com - Gold
  • Walt Disney Animation Studios - Gold

Silver

  • PSC Group - Silver
  • Enthought - Silver
  • Canonical - Silver
  • Imaginary Landscape - Silver
  • Wingware - Silver
  • ITA Software - Silver
  • Accense - Silver
  • Resolver Systems - Silver
  • Leapfrog Online - Silver
  • Emma Email Marketing - Silver
  • ZeOmega - Silver
  • Oracle Technology Network - Silver
  • VMware - Silver
  • Tummy.com - Silver

Vendor I

  • O'Reilly - Vendor I
  • Informit - Pearson Publishing - Vendor I