Executor
Executor is a crawler engine to scrape data. Any custom executor extends Executor interface and implements abstract method.
Executor Interface
Understanding executor interface is crucial to understand default executors and creating custom executors.
- class scrapqd.fetch.interface.Executor(url, method='get', headers=None, response_type=None)
Interface for Executor implementation
This class is exported only to assist people in implementing their own executors for crawling without duplicating too much code.
- property success_status_code
Default success code for the request. Default success codes are [200].
- Returns
List
- get_payload(payload)
Creates payload for http request.
- Parameters
payload – Additional payload argument for request.
- Returns
Dict
- get_default_headers()
Get user-agent and constructs other default headers for the request.
User-Agent: from the data files.
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Pragma: no-cache
- Returns
Dict
- get_response_type()
Gets response type from the request response.
- Returns
String
- get_headers()
Constructs headers to be applied to the request from default headers and user provided headers. User provided headers will override default headers.
- Returns
Dict
- get_response_content()
gets response content from processed request.
- Returns
json
If the response type is jsonhtml
If the response type is text/html
- execute(**kwargs)
Executes crawl method and gets http response from web.
- Parameters
kwargs – Additional keyword arguments for extensibility.
- Raises
Exception – Re-raises the exception occurred in the block for client to capture and handle
- abstract get_response_url()
Gets response url. It should be the final url after redirect (if any).
- Returns
String
- abstract get_response_headers()
Gets http response headers
- Returns
Dict
- abstract is_success()
Method definition to identify the request is successful or not. By default, status_code == 200 is considered as success.
- Returns
Boolean
- abstract get_response_text()
Gets response text.
- Returns
String
- abstract get_response_json()
Gets response as json.
- Returns
Dict
- abstract get_status_code()
Gets response status code of the http request made.
- Returns
integer
- abstract crawl(url, method='get', headers=None, **kwargs)
Crawls given url from web. This method should return only http response from the library without any further processing of the response.
- Parameters
url – URL to crawl
method – Http method which should be used to crawl
headers –
Additional headers for executor. Some websites need addition headers to make request. System add below request headers by default. These headers can be overridden using header argument.
User-Agent: from the data files.
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Pragma: no-cache
kwargs – Additional keyword arguments to support executor.
- Returns
Http response
Requests
Requests uses requests library for executing requests and implements parent abstract methods.
class Requests(Executor):
def get_response_url(self):
return self.response.url
def get_response_headers(self):
return dict(self.response.headers)
def get_status_code(self):
return self.response.status_code
def get_response_text(self):
return self.response.content
def get_response_json(self):
return self.response.json()
def is_success(self):
status_code = self.get_status_code()
return status_code in self.success_status_code
def crawl(self, url, headers=None, method="get", **kwargs):
return requests.request(self.method, self.url, headers=headers, **kwargs)
Selenium
Selenium Driver
SeleniumDriver is the generic implementation for crawling using selenium.
- class scrapqd.executor.selenium_driver.selenium.SeleniumDriver
Internal selenium driver implementation for all the browser types
- wait_load(xpath, wait_time)
Waits for browser to load specific element in the given url. If the xpath is not given, selenium will wait for the document to be ready.
- Parameters
xpath – Element to wait
wait_time – Wait time in seconds for the element to present in the web page.
- fetch(url, **kwargs)
Fetches web page for the url
- Parameters
url – url to crawl
kwargs –
wait
Wait time in seconds for the element in the web page.xpath
Element to wait. If this parameter is not given, selenium will wait for the document
to be ready till wait time.
- get_response_headers()
This executes javascript in the browser to get http response headers.
- Returns
Dict
- get_current_url()
Gets the current url after redirect (if any).
- Returns
String
- get_page_source(url, **kwargs)
Returns page source of the url
- Parameters
url – url to crawl
kwargs –
wait
Wait time in seconds for the element in the web page.xpath
Element to wait. If this parameter is not given, selenium will wait for the
document to be ready till wait time.
- Returns
HTML Web page string
- clean_up()
Quits browser and sets None, when this method is called
- classmethod get_executable_path(browser, **kwargs)
Gets browser executable from repository using webdriver_manager.
- Parameters
browser – Name of the browser
kwargs – Webdriver_manager options for the browser to download executable.
- Returns
BrowserDriver
Selenium Browser
GoogleChrome, Firefox browsers are implemented currently. GoogleChrome is given as example here.
Selenium Executor
Selenium executor is used to crawl modern webpages which uses javascript rendering (client-side rendering).
class Selenium(Executor):
"""SeleniumExecutor is class a generic processor (facade) for all browsers and
implements all abstract method from `Executor` class."""
def __init__(self, url, **kwargs):
super().__init__(url, **kwargs)
self._response_headers = {}
self._current_url = None
def get_response_url(self):
if not self._current_url:
logger.error("Not able to get current_url for %s from selenium", self.url, exc_info=True)
return self.url
return self._current_url
def is_success(self):
return True
def get_response_text(self):
return self.response
def get_response_json(self):
if isinstance(self.response, str):
try:
self.response = json.loads(self.response)
except Exception:
logger.exception("Not able to get convert to json data %s", self.url, exc_info=True)
return self.response
def get_status_code(self):
return 200
def get_response_headers(self):
return self._response_headers
def crawl(self, url, method="get", headers=None, **kwargs):
""""Selenium crawl gets browser from browser factory and crawls the url"""
browser_name = kwargs.get("browser", "GOOGLE_CHROME")
browser = BrowserFactory().get(browser_name)()
response = browser.get_page_source(url, **kwargs)
self._response_headers = browser.get_response_headers()
self._current_url = browser.get_current_url()
return response