Executor

Executor is a crawler engine to scrape data. Any custom executor extends Executor interface and implements abstract method.

Executor Interface

Requests

Selenium

Selenium Driver

Selenium Browser

Selenium Executor

Executor Interface

Understanding executor interface is crucial to understand default executors and creating custom executors.

class scrapqd.fetch.interface.Executor(url, method='get', headers=None, response_type=None)

Interface for Executor implementation

This class is exported only to assist people in implementing their own executors for crawling without duplicating too much code.

property success_status_code

Default success code for the request. Default success codes are [200].

Returns: List

get_payload(payload)

Creates payload for http request.

Parameters: payload – Additional payload argument for request.
Returns: Dict

get_default_headers()

Get user-agent and constructs other default headers for the request.

User-Agent: from the data files.

Connection: keep-alive

Upgrade-Insecure-Requests: 1

Accept-Language: en-US,en;q=0.9

Accept-Encoding: gzip, deflate, br

Pragma: no-cache

Returns: Dict

get_response_type()

Gets response type from the request response.

Returns: String

get_headers()

Constructs headers to be applied to the request from default headers and user provided headers. User provided headers will override default headers.

Returns: Dict

get_response_content()

gets response content from processed request.

Returns

json If the response type is json
html If the response type is text/html

execute(**kwargs)

Executes crawl method and gets http response from web.

Parameters: kwargs – Additional keyword arguments for extensibility.
Raises: Exception – Re-raises the exception occurred in the block for client to capture and handle

abstract get_response_url()

Gets response url. It should be the final url after redirect (if any).

Returns: String

abstract get_response_headers()

Gets http response headers

Returns: Dict

abstract is_success()

Method definition to identify the request is successful or not. By default, status_code == 200 is considered as success.

Returns: Boolean

abstract get_response_text()

Gets response text.

Returns: String

abstract get_response_json()

Gets response as json.

Returns: Dict

abstract get_status_code()

Gets response status code of the http request made.

Returns: integer

abstract crawl(url, method='get', headers=None, **kwargs)

Crawls given url from web. This method should return only http response from the library without any further processing of the response.

Parameters

url – URL to crawl
method – Http method which should be used to crawl
headers –
Additional headers for executor. Some websites need addition headers to make request. System add below request headers by default. These headers can be overridden using header argument.
- User-Agent: from the data files.
- Connection: keep-alive
- Upgrade-Insecure-Requests: 1
- Accept-Language: en-US,en;q=0.9
- Accept-Encoding: gzip, deflate, br
- Pragma: no-cache
kwargs – Additional keyword arguments to support executor.

Returns

Http response

Requests

Requests uses requests library for executing requests and implements parent abstract methods.

class Requests(Executor):
    def get_response_url(self):
        return self.response.url

    def get_response_headers(self):
        return dict(self.response.headers)

    def get_status_code(self):
        return self.response.status_code

    def get_response_text(self):
        return self.response.content

    def get_response_json(self):
        return self.response.json()

    def is_success(self):
        status_code = self.get_status_code()
        return status_code in self.success_status_code

    def crawl(self, url, headers=None, method="get", **kwargs):
        return requests.request(self.method, self.url, headers=headers, **kwargs)

Selenium

Selenium Driver

SeleniumDriver is the generic implementation for crawling using selenium.

class scrapqd.executor.selenium_driver.selenium.SeleniumDriver

Internal selenium driver implementation for all the browser types

wait_load(xpath, wait_time)

Waits for browser to load specific element in the given url. If the xpath is not given, selenium will wait for the document to be ready.

Parameters

xpath – Element to wait
wait_time – Wait time in seconds for the element to present in the web page.

fetch(url, **kwargs)

Fetches web page for the url

Parameters

url – url to crawl
kwargs –
- wait Wait time in seconds for the element in the web page.
- xpath Element to wait. If this parameter is not given, selenium will wait for the document
to be ready till wait time.

get_response_headers()

This executes javascript in the browser to get http response headers.

Returns: Dict

get_current_url()

Gets the current url after redirect (if any).

Returns: String

get_page_source(url, **kwargs)

Returns page source of the url

Parameters

url – url to crawl
kwargs –
- wait Wait time in seconds for the element in the web page.
- xpath Element to wait. If this parameter is not given, selenium will wait for the
document to be ready till wait time.

Returns

HTML Web page string

clean_up(): Quits browser and sets None, when this method is called

classmethod get_executable_path(browser, **kwargs)

Gets browser executable from repository using webdriver_manager.

Parameters

browser – Name of the browser
kwargs – Webdriver_manager options for the browser to download executable.

Returns

BrowserDriver

Selenium Browser

GoogleChrome, Firefox browsers are implemented currently. GoogleChrome is given as example here.

class scrapqd.executor.selenium_driver.browsers.GoogleChrome

Creates Google Chrome type driver

classmethod create_browser(): Returns headless Google browser object

Selenium Executor

Selenium executor is used to crawl modern webpages which uses javascript rendering (client-side rendering).

class Selenium(Executor):
    """SeleniumExecutor is class a generic processor (facade) for all browsers and
    implements all abstract method from `Executor` class."""

    def __init__(self, url, **kwargs):
        super().__init__(url, **kwargs)
        self._response_headers = {}
        self._current_url = None

    def get_response_url(self):
        if not self._current_url:
            logger.error("Not able to get current_url for %s from selenium", self.url, exc_info=True)
            return self.url
        return self._current_url

    def is_success(self):
        return True

    def get_response_text(self):
        return self.response

    def get_response_json(self):
        if isinstance(self.response, str):
            try:
                self.response = json.loads(self.response)
            except Exception:
                logger.exception("Not able to get convert to json data %s", self.url, exc_info=True)

        return self.response

    def get_status_code(self):
        return 200

    def get_response_headers(self):
        return self._response_headers

    def crawl(self, url, method="get", headers=None, **kwargs):
        """"Selenium crawl gets browser from browser factory and crawls the url"""
        browser_name = kwargs.get("browser", "GOOGLE_CHROME")
        browser = BrowserFactory().get(browser_name)()
        response = browser.get_page_source(url, **kwargs)
        self._response_headers = browser.get_response_headers()
        self._current_url = browser.get_current_url()
        return response