Parser

Parser is used in the GraphQL query to parse the html. Current system supports xpath in Lxml parser.

Library does not support Beautiful soup as it slower than lxml parser and Selector parsing is comparatively slower than xpath.

Lxml

class scrapqd.gql_parser.lxml_parser.LXMLParser(raw_html=None, html_tree=None)

This is concerete implementation for lxml gql_parser to parse html text.

xpath_element(element, xpath=None, **kwargs)

Extracts target node using xpath from given html element.

Parameters
  • element – Html element.

  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[HTMLElement]

xpath_text(element, xpath, **kwargs)

Extracts text for given xpath.

Parameters
  • element – Html element.

  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[String]

extract_element_source_text(element)

Extracts source html content

Parameters

element – Html element.

Returns

String

extract_text(xpath, **kwargs)

Extracts text content from element.

Parameters
  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[String]

extract_elements(xpath, **kwargs)

Extracts nodes from given html element.

Parameters
  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[HTMLElement]

extract_attr(xpath, **kwargs)

Extracts attributes from the html element.

Parameters
  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[Dict]

extract_form_input(xpath, **kwargs)

Extracts form inputs using given xpath. Method expects xpath to locate form node.

Parameters
  • xpath – Xpath to locate the elements.

  • kwargs – Additional keyword arguments for extensibility.

Returns

List[Dict]