Usage

First demonstration

A code sample tells more than thousand words:

import dryscrape

search_term = 'dryscrape'

# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()

# extract all links
for link in sess.xpath('//a[@href]'):
  print link['href']

# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"

In this sample, we use dryscrape to do a simple web search on Google. Note that we set up a Webkit driver instance here and pass it to a dryscrape Session in the constructor. The session instance then passes every method call it cannot resolve – such as visit(), in this case – to the underlying driver.

A more complex example

There was nothing much special about the example above. Let’s look at a more advanced example that actually works on a Javascript-only application: GMail.

import time
import dryscrape

#==========================================
# Setup
#==========================================

email    = 'YOUR_USERNAME_HERE@gmail.com'
password = 'YOUR_PASSWORD_HERE'

# set up a web scraping session
sess = dryscrape.Session(base_url = 'https://mail.google.com/')

# there are some failing HTTP requests, so we need to enter
# a more error-resistant mode (like real browsers do)
sess.set_error_tolerant(True)

# we don't need images
sess.set_attribute('auto_load_images', False)

# if we wanted, we could also configure a proxy server to use,
# so we can for example use Fiddler to monitor the requests
# performed by this script
#sess.set_proxy('localhost', 8888)

#==========================================
# GMail send a mail to self
#==========================================

# visit homepage and log in
print "Logging in..."
sess.visit('/')

email_field    = sess.at_css('#Email')
password_field = sess.at_css('#Passwd')
email_field.set(email)
password_field.set(password)

email_field.form().submit()

# find the COMPOSE button and click it
print "Sending a mail..."
compose = sess.at_xpath('//*[contains(text(), "COMPOSE")]')
compose.click()

# compose the mail
to      = sess.at_xpath('//*[@name="to"]', timeout=10)
subject = sess.at_xpath('//*[@name="subject"]')
body    = sess.at_xpath('//*[@name="body"]')

to.set(email)
subject.set("Note to self")
body.set("Remember to try dryscrape!")

# send the mail

# seems like we need to wait a bit before clicking...
# Blame Google for this ;)
time.sleep(3)
send = sess.at_xpath('//*[normalize-space(text()) = "Send"]')
send.click()

# open the mail
print "Reading the mail..."
mail = sess.at_xpath('//*[normalize-space(text()) = "Note to self"]',
                     timeout=10)
mail.click()

# sleep a bit to leave the mail a chance to open.
# This is ugly, it would be better to find something
# on the resulting page that we can wait for
time.sleep(3)

# save a screenshot of the web page
print "Writing screenshot to 'gmail.png'"
sess.render('gmail.png')

This just works.

There are some things to note about it, though:

  • at_xpath() and at_css() take an optional timeout argument that can be used to leave the application a bit of time to load content
  • XPath is really useful, you should make yourself familiar with it. You can also use CSS, however.