Parser

Parser is used in the GraphQL query to parse the html. Current system supports xpath in Lxml parser.

Library does not support Beautiful soup as it slower than lxml parser and Selector parsing is comparatively slower than xpath.

Lxml

Lxml

class scrapqd.gql_parser.lxml_parser.LXMLParser(raw_html=None, html_tree=None)

This is concerete implementation for lxml gql_parser to parse html text.

xpath_element(element, xpath=None, **kwargs)

Extracts target node using xpath from given html element.

Parameters

Returns

List[HTMLElement]

xpath_text(element, xpath, **kwargs)

Extracts text for given xpath.

Parameters

Returns

List[String]

extract_element_source_text(element)

Extracts source html content

extract_text(xpath, **kwargs)

Extracts text content from element.

Parameters

Returns

List[String]

extract_elements(xpath, **kwargs)

Extracts nodes from given html element.

Parameters

Returns

List[HTMLElement]

extract_attr(xpath, **kwargs)

Extracts attributes from the html element.

Parameters

Returns

List[Dict]

extract_form_input(xpath, **kwargs)

Extracts form inputs using given xpath. Method expects xpath to locate form node.

Parameters

Returns

List[Dict]