API Documentation

This documentation also contains the API docs for the webkit_server module, for convenience (and because I am too lazy to set up dedicated docs for it).

Overview

Inheritance diagram of dryscrape.session, dryscrape.mixins, dryscrape.driver.webkit, webkit_server

Module dryscrape.session

class dryscrape.session.Session(driver=None, base_url=None)[source]

Bases: object

A web scraping session based on a driver instance. Implements the proxy pattern to pass unresolved method calls to the underlying driver.

If no driver is specified, the instance will create an instance of dryscrape.session.DefaultDriver to get a driver instance (defaults to dryscrape.driver.webkit.Driver).

If base_url is present, relative URLs are completed with this URL base. If not, the get_base_url method is called on itself to get the base URL.

complete_url(url)[source]

Completes a given URL with this instance’s URL base.

interact(**local)[source]

Drops the user into an interactive Python session with the sess variable set to the current session instance. If keyword arguments are supplied, these names will also be available within the session.

visit(url)[source]

Passes through the URL to the driver after completing it using the instance’s URL base.

Module dryscrape.mixins

Mixins for use in dryscrape drivers.

class dryscrape.mixins.AttributeMixin[source]

Bases: object

Mixin that adds [] access syntax sugar to an object that supports a set_attr and get_attr method.

class dryscrape.mixins.HtmlParsingMixin[source]

Bases: object

Mixin that adds a document method to an object that supports a body method returning valid HTML.

document()[source]

Parses the HTML returned by body and returns it as an lxml.html document. If the driver supports live DOM manipulation (like webkit_server does), changes performed on the returned document will not take effect.

class dryscrape.mixins.SelectionMixin[source]

Bases: object

Mixin that adds different methods of node selection to an object that provides an xpath method returning a collection of matches.

at_css(css)[source]

Returns the first node matching the given CSSv3 expression or None.

at_xpath(xpath)[source]

Returns the first node matching the given XPath 2.0 expression or None.

children()[source]

Returns the child nodes.

css(css)[source]

Returns all nodes matching the given CSSv3 expression.

form()[source]

Returns the form wherein this node is contained or None.

parent()[source]

Returns the parent node.

class dryscrape.mixins.WaitMixin[source]

Bases: dryscrape.mixins.SelectionMixin

Mixin that allows waiting for conditions or elements.

at_css(css, timeout=1, **kw)[source]

Returns the first node matching the given CSSv3 expression or None if a timeout occurs.

at_xpath(xpath, timeout=1, **kw)[source]

Returns the first node matching the given XPath 2.0 expression or None if a timeout occurs.

wait_for(condition, interval=0.5, timeout=10)[source]

Wait until a condition holds by checking it in regular intervals. Raises WaitTimeoutError on timeout.

wait_for_safe(*args, **kw)[source]

Wait until a condition holds and return None on timeout.

wait_while(condition, *args, **kw)[source]

Wait while a condition holds.

exception dryscrape.mixins.WaitTimeoutError[source]

Bases: exceptions.Exception

Raised when a wait times out

Module dryscrape.driver.webkit

Headless Webkit driver for dryscrape. Wraps the webkit_server module.

class dryscrape.driver.webkit.Driver(**kw)[source]

Bases: webkit_server.Client, dryscrape.mixins.WaitMixin, dryscrape.mixins.HtmlParsingMixin

Driver implementation wrapping a webkit_server driver.

Keyword arguments are passed through to the underlying webkit_server.Client constructor. By default, node_factory_class is set to use the dryscrape node implementation.

class dryscrape.driver.webkit.Node(client, node_id)[source]

Bases: webkit_server.Node, dryscrape.mixins.SelectionMixin, dryscrape.mixins.AttributeMixin

Node implementation wrapping a webkit_server node.

class dryscrape.driver.webkit.NodeFactory(client)[source]

Bases: webkit_server.NodeFactory

overrides the NodeFactory provided by webkit_server.

Module webkit_server

Python bindings for the webkit-server

class webkit_server.Client(connection=None, node_factory_class=<class 'webkit_server.NodeFactory'>)[source]

Bases: webkit_server.SelectionMixin

Wrappers for the webkit_server commands.

If connection is not specified, a new instance of ServerConnection is created.

node_factory_class can be set to a value different from the default, in which case a new instance of the given class will be used to create nodes. The given class must accept a client instance through its constructor and support a create method that takes a node ID as an argument and returns a node object.

body()[source]

Returns the current DOM as HTML.

clear_cookies()[source]

Deletes all cookies.

clear_proxy()[source]

Resets custom HTTP proxy (use none in future requests).

cookies()[source]

Returns a list of all cookies in cookie string format.

eval_script(expr)[source]

Evaluates a piece of Javascript in the context of the current page and returns its value.

exec_script(script)[source]

Executes a piece of Javascript in the context of the current page.

get_node_factory()[source]

Returns the associated node factory.

get_timeout()[source]

Return timeout for every webkit-server command

headers()[source]

Returns a list of the last HTTP response headers. Header keys are normalized to capitalized form, as in User-Agent.

issue_node_cmd(*args)[source]

Issues a node-specific command.

render(path, width=1024, height=1024)[source]

Renders the current page to a PNG file (viewport size in pixels).

reset()[source]

Resets the current web session.

reset_attribute(attr)[source]

Resets a custom attribute.

set_attribute(attr, value=True)[source]

Sets a custom attribute for our Webkit instance. Possible attributes are:

  • auto_load_images
  • dns_prefetch_enabled
  • plugins_enabled
  • private_browsing_enabled
  • javascript_can_open_windows
  • javascript_can_access_clipboard
  • offline_storage_database_enabled
  • offline_web_application_cache_enabled
  • local_storage_enabled
  • local_storage_database_enabled
  • local_content_can_access_remote_urls
  • local_content_can_access_file_urls
  • accelerated_compositing_enabled
  • site_specific_quirks_enabled

For all those options, value must be a boolean. You can find more information about these options in the QT docs.

Sets a cookie for future requests (must be in correct cookie string format).

set_error_tolerant(tolerant=True)[source]

DEPRECATED! This function is a no-op now.

Used to set or unset the error tolerance flag in the server. If this flag as set, dropped requests or erroneous responses would not lead to an error.

set_header(key, value)[source]

Sets a HTTP header for future requests.

set_html(html, url=None)[source]

Sets custom HTML in our Webkit session and allows to specify a fake URL. Scripts and CSS is dynamically fetched as if the HTML had been loaded from the given URL.

set_proxy(host='localhost', port=0, user='', password='')[source]

Sets a custom HTTP proxy to use for future requests.

set_timeout(timeout)[source]

Set timeout for every webkit-server command

set_viewport_size(width, height)[source]

Sets the viewport size.

source()[source]

Returns the source of the page as it was originally served by the web server.

status_code()[source]

Returns the numeric HTTP status of the last response.

url()[source]

Returns the current location.

visit(url)[source]

Goes to a given URL.

exception webkit_server.EndOfStreamError(msg='Unexpected end of file')[source]

Bases: exceptions.Exception

Raised when the Webkit server closed the connection unexpectedly.

exception webkit_server.InvalidResponseError[source]

Bases: exceptions.Exception

Raised when the Webkit server signaled an error.

exception webkit_server.NoResponseError[source]

Bases: exceptions.Exception

Raised when the Webkit server does not respond.

exception webkit_server.NoX11Error[source]

Bases: webkit_server.WebkitServerError

Raised when the Webkit server cannot connect to X.

class webkit_server.Node(client, node_id)[source]

Bases: webkit_server.SelectionMixin

Represents a DOM node in our Webkit session.

client is the associated client instance.

node_id is the internal ID that is used to identify the node when communicating with the server.

click()[source]

Alias for left_click.

double_click()[source]

Double clicks the current node, then waits for the page to fully load.

drag_to(element)[source]

Drag the node to another one.

eval_script(js)[source]

Evaluate arbitrary Javascript with the node variable bound to the current node.

exec_script(js)[source]

Execute arbitrary Javascript with the node variable bound to the current node.

focus()[source]

Puts the focus onto the current node, then waits for the page to fully load.

get_attr(name)[source]

Returns the value of an attribute.

get_bool_attr(name)[source]

Returns the value of a boolean HTML attribute like checked or disabled

get_node_factory()[source]

Returns the associated node factory.

hover()[source]

Hovers over the current node, then waits for the page to fully load.

is_attached()[source]

Checks whether the current node is actually existing on the currently active web page.

is_checked()[source]

is the checked attribute set for this node?

is_disabled()[source]

is the disabled attribute set for this node?

is_multi_select()[source]

is this node a multi-select?

is_selected()[source]

is the selected attribute set for this node?

is_visible()[source]

Checks whether the current node is visible.

left_click()[source]

Left clicks the current node, then waits for the page to fully load.

path()[source]

Returns an XPath expression that uniquely identifies the current node.

right_click()[source]

Right clicks the current node, then waits for the page to fully load.

select_option()[source]

Selects an option node.

set(value)[source]

Sets the node content to the given value (e.g. for input fields).

set_attr(name, value)[source]

Sets the value of an attribute.

submit()[source]

Submits a form node, then waits for the page to completely load.

tag_name()[source]

Returns the tag name of the current node.

text()[source]

Returns the inner text (not HTML).

unselect_options()[source]

Unselects an option node (only possible within a multi-select).

value()[source]

Returns the node’s value.

exception webkit_server.NodeError[source]

Bases: exceptions.Exception

A problem occured within a Node instance method.

class webkit_server.NodeFactory(client)[source]

Bases: object

Implements the default node factory.

client is the associated client instance.

class webkit_server.SelectionMixin[source]

Bases: object

Implements a generic XPath selection for a class providing _get_xpath_ids, _get_css_ids and get_node_factory methods.

css(css)[source]

Finds another node by a CSS selector relative to the current node.

xpath(xpath)[source]

Finds another node by XPath originating at the current node.

class webkit_server.Server(binary=None)[source]

Bases: object

Manages a Webkit server process. If binary is given, the specified webkit_server binary is used instead of the included one.

connect()[source]

Returns a new socket connection to this server.

kill()[source]

Kill the process.

class webkit_server.ServerConnection(server=None)[source]

Bases: object

A connection to a Webkit server.

server is a server instance or None if a singleton server should be connected to (will be started if necessary).

issue_command(cmd, *args)[source]

Sends and receives a message to/from the server

class webkit_server.SocketBuffer(f)[source]

Bases: object

A convenience class for buffered reads from a socket.

read(n)[source]

Consume n characters from the stream.

read_line()[source]

Consume one line from the stream.

exception webkit_server.WebkitServerError[source]

Bases: exceptions.Exception

Raised when the Webkit server experiences an error.

webkit_server.get_default_server()[source]

Returns a singleton Server instance (possibly after creating it, if it doesn’t exist yet).