Executor

Executor is a crawler engine to scrape data. Any custom executor extends Executor interface and implements abstract method.

Executor Interface

Understanding executor interface is crucial to understand default executors and creating custom executors.

class scrapqd.fetch.interface.Executor(url, method='get', headers=None, response_type=None)

Interface for Executor implementation

This class is exported only to assist people in implementing their own executors for crawling without duplicating too much code.

property success_status_code

Default success code for the request. Default success codes are [200].

Returns

List

get_payload(payload)

Creates payload for http request.

Parameters

payload – Additional payload argument for request.

Returns

Dict

get_default_headers()

Get user-agent and constructs other default headers for the request.

  • User-Agent: from the data files.

  • Connection: keep-alive

  • Upgrade-Insecure-Requests: 1

  • Accept-Language: en-US,en;q=0.9

  • Accept-Encoding: gzip, deflate, br

  • Pragma: no-cache

Returns

Dict

get_response_type()

Gets response type from the request response.

Returns

String

get_headers()

Constructs headers to be applied to the request from default headers and user provided headers. User provided headers will override default headers.

Returns

Dict

get_response_content()

gets response content from processed request.

Returns

  • json If the response type is json

  • html If the response type is text/html

execute(**kwargs)

Executes crawl method and gets http response from web.

Parameters

kwargs – Additional keyword arguments for extensibility.

Raises

Exception – Re-raises the exception occurred in the block for client to capture and handle

abstract get_response_url()

Gets response url. It should be the final url after redirect (if any).

Returns

String

abstract get_response_headers()

Gets http response headers

Returns

Dict

abstract is_success()

Method definition to identify the request is successful or not. By default, status_code == 200 is considered as success.

Returns

Boolean

abstract get_response_text()

Gets response text.

Returns

String

abstract get_response_json()

Gets response as json.

Returns

Dict

abstract get_status_code()

Gets response status code of the http request made.

Returns

integer

abstract crawl(url, method='get', headers=None, **kwargs)

Crawls given url from web. This method should return only http response from the library without any further processing of the response.

Parameters
  • url – URL to crawl

  • method – Http method which should be used to crawl

  • headers

    Additional headers for executor. Some websites need addition headers to make request. System add below request headers by default. These headers can be overridden using header argument.

    • User-Agent: from the data files.

    • Connection: keep-alive

    • Upgrade-Insecure-Requests: 1

    • Accept-Language: en-US,en;q=0.9

    • Accept-Encoding: gzip, deflate, br

    • Pragma: no-cache

  • kwargs – Additional keyword arguments to support executor.

Returns

Http response

Requests

Requests uses requests library for executing requests and implements parent abstract methods.

class Requests(Executor):
    def get_response_url(self):
        return self.response.url

    def get_response_headers(self):
        return dict(self.response.headers)

    def get_status_code(self):
        return self.response.status_code

    def get_response_text(self):
        return self.response.content

    def get_response_json(self):
        return self.response.json()

    def is_success(self):
        status_code = self.get_status_code()
        return status_code in self.success_status_code

    def crawl(self, url, headers=None, method="get", **kwargs):
        return requests.request(self.method, self.url, headers=headers, **kwargs)

Selenium

Selenium Driver

SeleniumDriver is the generic implementation for crawling using selenium.

class scrapqd.executor.selenium_driver.selenium.SeleniumDriver

Internal selenium driver implementation for all the browser types

wait_load(xpath, wait_time)

Waits for browser to load specific element in the given url. If the xpath is not given, selenium will wait for the document to be ready.

Parameters
  • xpath – Element to wait

  • wait_time – Wait time in seconds for the element to present in the web page.

fetch(url, **kwargs)

Fetches web page for the url

Parameters
  • url – url to crawl

  • kwargs

    • wait Wait time in seconds for the element in the web page.

    • xpath Element to wait. If this parameter is not given, selenium will wait for the document

    to be ready till wait time.

get_response_headers()

This executes javascript in the browser to get http response headers.

Returns

Dict

get_current_url()

Gets the current url after redirect (if any).

Returns

String

get_page_source(url, **kwargs)

Returns page source of the url

Parameters
  • url – url to crawl

  • kwargs

    • wait Wait time in seconds for the element in the web page.

    • xpath Element to wait. If this parameter is not given, selenium will wait for the

    document to be ready till wait time.

Returns

HTML Web page string

clean_up()

Quits browser and sets None, when this method is called

classmethod get_executable_path(browser, **kwargs)

Gets browser executable from repository using webdriver_manager.

Parameters
  • browser – Name of the browser

  • kwargs – Webdriver_manager options for the browser to download executable.

Returns

BrowserDriver

Selenium Browser

GoogleChrome, Firefox browsers are implemented currently. GoogleChrome is given as example here.

class scrapqd.executor.selenium_driver.browsers.GoogleChrome

Creates Google Chrome type driver

classmethod create_browser()

Returns headless Google browser object

Selenium Executor

Selenium executor is used to crawl modern webpages which uses javascript rendering (client-side rendering).

class Selenium(Executor):
    """SeleniumExecutor is class a generic processor (facade) for all browsers and
    implements all abstract method from `Executor` class."""

    def __init__(self, url, **kwargs):
        super().__init__(url, **kwargs)
        self._response_headers = {}
        self._current_url = None

    def get_response_url(self):
        if not self._current_url:
            logger.error("Not able to get current_url for %s from selenium", self.url, exc_info=True)
            return self.url
        return self._current_url

    def is_success(self):
        return True

    def get_response_text(self):
        return self.response

    def get_response_json(self):
        if isinstance(self.response, str):
            try:
                self.response = json.loads(self.response)
            except Exception:
                logger.exception("Not able to get convert to json data %s", self.url, exc_info=True)

        return self.response

    def get_status_code(self):
        return 200

    def get_response_headers(self):
        return self._response_headers

    def crawl(self, url, method="get", headers=None, **kwargs):
        """"Selenium crawl gets browser from browser factory and crawls the url"""
        browser_name = kwargs.get("browser", "GOOGLE_CHROME")
        browser = BrowserFactory().get(browser_name)()
        response = browser.get_page_source(url, **kwargs)
        self._response_headers = browser.get_response_headers()
        self._current_url = browser.get_current_url()
        return response