Usage¶
First demonstration¶
A code sample tells more than thousand words:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[@href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
In this sample, we use dryscrape to do a simple web search on Google.
Note that we set up a Webkit driver instance here and pass it to a dryscrape
Session
in the constructor. The session instance
then passes every method call it cannot resolve – such as
visit()
, in this case – to the
underlying driver.
A more complex example¶
There was nothing much special about the example above. Let’s look at a more advanced example that actually works on a Javascript-only application: GMail.
import time
import dryscrape
#==========================================
# Setup
#==========================================
email = 'YOUR_USERNAME_HERE@gmail.com'
password = 'YOUR_PASSWORD_HERE'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'https://mail.google.com/')
# there are some failing HTTP requests, so we need to enter
# a more error-resistant mode (like real browsers do)
sess.set_error_tolerant(True)
# we don't need images
sess.set_attribute('auto_load_images', False)
# if we wanted, we could also configure a proxy server to use,
# so we can for example use Fiddler to monitor the requests
# performed by this script
#sess.set_proxy('localhost', 8888)
#==========================================
# GMail send a mail to self
#==========================================
# visit homepage and log in
print "Logging in..."
sess.visit('/')
email_field = sess.at_css('#Email')
password_field = sess.at_css('#Passwd')
email_field.set(email)
password_field.set(password)
email_field.form().submit()
# find the COMPOSE button and click it
print "Sending a mail..."
compose = sess.at_xpath('//*[contains(text(), "COMPOSE")]')
compose.click()
# compose the mail
to = sess.at_xpath('//*[@name="to"]', timeout=10)
subject = sess.at_xpath('//*[@name="subject"]')
body = sess.at_xpath('//*[@name="body"]')
to.set(email)
subject.set("Note to self")
body.set("Remember to try dryscrape!")
# send the mail
# seems like we need to wait a bit before clicking...
# Blame Google for this ;)
time.sleep(3)
send = sess.at_xpath('//*[normalize-space(text()) = "Send"]')
send.click()
# open the mail
print "Reading the mail..."
mail = sess.at_xpath('//*[normalize-space(text()) = "Note to self"]',
timeout=10)
mail.click()
# sleep a bit to leave the mail a chance to open.
# This is ugly, it would be better to find something
# on the resulting page that we can wait for
time.sleep(3)
# save a screenshot of the web page
print "Writing screenshot to 'gmail.png'"
sess.render('gmail.png')
This just works.
There are some things to note about it, though:
at_xpath()
andat_css()
take an optional timeout argument that can be used to leave the application a bit of time to load content- XPath is really useful, you should make yourself familiar with it. You can also use CSS, however.