Query

Scrape Query can be created with query, group and leaf queries.

Sample Query

query test_query($url: String!, $name: GenericScalar!) {
  result: fetch(url: $url) {
    name: constant(value: $name)
    summary: group {
      total_emp_expenses: text(xpath: "//*[@id='emp-exp-total']", data_type: INT)
      total_shown_expenses: text(xpath: "//*[@id='exp-total']/span[2]", data_type: INT)
      total_approved_expenses: text(xpath: "//*[@id='emp-exp-approved']/span[2]", data_type: INT)
    }
    exp_details: list(xpath: "//div[@class='card']") {
      name: text(xpath: "//div[contains(@class,'expense-emp-name')]")
      amount: group {
        money: text(xpath: "//h6[contains(@class,'expense-amount')]/span[1]", data_type: INT)
        name: text(xpath: "//h6[contains(@class,'expense-amount')]/span[2]")
      }
    }
  }
}

Query variables

{
    "url": "http://localhost:5000/scrapqd/sample_page/",
    "name": "local-testing"
}

Query Type

Query type queries are used for crawling url using different executors and pass down the data to child queries Leaf type for further processing. They expect leaf queries as sub query.

fetch

fetch(url, headers={}, executor='requests', is_json_response=false, method='GET', cache=false)

Fetch query will crawl the traditional websites.

url

URL to crawl

headers
  • sometimes websites need additional headers in the request. By default, system provides below headers. The given headers will be updated with default headers. So default system headers are constant which will be sent for all the request.

    • User-Agent: from the data files. This can be changed using USER_AGET_DATA_FILE or USER_AGET_DATA config.

    • Connection: keep-alive

    • Upgrade-Insecure-Requests: 1

    • Accept-Language: en-US,en;q=0.9

    • Accept-Encoding: gzip, deflate, br

    • Pragma: no-cache

  • You might not need this for most website. API type urls might need other extra headers and other http methods.

executor
  • Executors define how to crawl the url and how to process their response. By default system has “requests” executors which supports Requests library.

  • Custom executors can be creating by extending Executor class.

is_json_response
  • It is by default False. You have to set True if the url returns json data. Processing of json data is not supported as of now. This is for future enhancement. System will throw error if this is set to True.

method
  • http method to use for the request.

  • System uses GET by default. For website crawl you do not need to set this parameter.

  • API type urls might need other http methods like POST.

cache

Note

This should be used in development period

  • Fetch will be time consuming as it gets website data from internet. While developing the query, you may run the query multiple times. It will affect the development time.

  • Setting cache = true will cache the result of the url for consequent same url.

  • Setting ENV=DEVELOPMENT in config will enable cache for all the queries by default. Anything other than development, cache parameter is ignored.

selenium

selenium(url, browser=GOOGLE_CHROME, options={}, is_json_response=false, cache=false)

Selenium query will crawl the modern websites with javascript rendering.

url

URL to crawl.

browser

System supports below browser.

  • GOOGLE_CHROME

  • FIREFOX

options

Additional options to be used in crawling using selenium.

  • xpath Selenium will wait this element to be present in the loaded webpage.

  • wait_time Selenium will wait for above xpath target (wait_time) secs.

is_json_response

It is by default False. You have to set True if the url returns json data. Processing of json data is not supported as of now. This is for future enhancement. System will throw error if this is set to True.

cache

Similar to cache parameter in fetch query.

Group Type

Group queries process groups multiple leaf nodes and process multiple results of a xpath. They expect leaf or group queries as sub query.

  • group

  • list

group

Group query will group the leaf node output under group variable to returns result to client. This will be helpful to group certain types of elements/data from the query without needing addition outside code.

amount: group {
    money: text(xpath: "//h6[contains(@class,'expense-amount')]/span[1]", data_type: INT)
    name: text(xpath: "//h6[contains(@class,'expense-amount')]/span[2]")
}

list

list(xpath)

List query will help you to write sub-query to extract data from the parent and returns. If the list xpath return multiple elements, sub-query applied on each item in the list.

xpath

to locate element

Example

exp_details: list(xpath: "//div[@class='card']") {
    name: text(xpath: "//div[contains(@class,'expense-emp-name')]")
    amount: group {
        money: text(xpath: "//h6[contains(@class,'expense-amount')]/span[1]", data_type: INT)
        name: text(xpath: "//h6[contains(@class,'expense-amount')]/span[2]")
    }
}

Result

{
    "result": {
        "exp_details": [
            {
              "name": "Friedrich-Wilhelm, Langern",
              "amount": {
                "money": 8800,
                "name": "egp"
              }
            },
            {
              "name": "Sebastian, Bien",
              "amount": {
                "money": 3365,
                "name": "mkd"
              }
            },
            {
              "name": "Rosa, Becker",
              "amount": {
                "money": 6700,
                "name": "xof"
              }
            },
            {
              "name": "Ines, Gröttner",
              "amount": {
                "money": 8427,
                "name": "npr"
              }
            },
            {
              "name": "Clarissa, Bonbach",
              "amount": {
                "money": 1609,
                "name": "fjd"
              }
            },
            {
              "name": "Zbigniew, Stolze",
              "amount": {
                "money": 8789,
                "name": "ern"
              }
            },
            {
              "name": "Ines, Mentzel",
              "amount": {
                "money": 1750,
                "name": "srd"
              }
            }
          ],
    }
}

Leaf Type

Leaf nodes are final queries to get the value from html element such as text from above query. You can not provide another leaf query as sub query.

Data Types

Few leaf queries support data types. If the data type is given, the element content will be converted to the given data type and sent to client. System supported below data types. Custom data types can be created as well.

TEXT

Default data type.

RAW

When the element text is extract, text might have extra whitespace. They are stripped away by default. When RAW data type is given, data will be sent as it is extracted from the element.

INT
  • Data is converted to integer.

  • Example

    • 1,024 -> 1024

    • 12K -> 12000 (k/K - thousand, m/M - million, b/B - billion)

FLOAT
  • Data is converted to decimal.

Multi

Leaf nodes support multi parameter. Xpath will locate multiple elements. This parameter will help the system who to process and return to client.

  • false Only first element will be processed and returned to the client.

  • true All the elements will be processed. Result will be sent as array/list to client. If the query supports data_type parameter, data_type conversion will be applied on all elements.

When multi is set false, result format will be not same when it is set to true.

you can set NON_MULTI_RESULT_LIST to True to have same format on both cases in the config file.

constant

constant(value)

Constant query will give back results to client as hard coded in the query or value passed from query variables.

value

Non null value in the query or can be passed from query variable as from the example.

name: constant(value:"local-testing")

text

text(xpath, data_type: TEXT, multi: false)

Text query will get the content of the given element. Text does not represent that it will return text. It simply denotes that it will extract text from element.

xpath

Path to locate element

data_type

Data type to return

multi

when xpath matches multiple elements,

  • False Processes first element

  • True Processes all elements

Example

total_emp_expenses: text(xpath: "//*[@id='emp-exp-total']", data_type: INT)

attr

attr(xpath, name=null, multi=false)

Element will have multiple attributes as below. Attr query will help to fetch all of them or specified one. Data-hovercard-type, href are attributes on the example element. It will extract attributes value as key, value pair. Key as name, value as value of the attribute.

name
  • If the name is not given, it will extract all the attributes.

  • For example, if the name = ‘href’ given, it will get “{href: /abcxcom}” mapping.

multi

when xpath matches multiple elements,

  • False Processes first element

  • True Processes all elements

Example

approval_id: attr(xpath: "//button[contains(@class, 'expense-approve')]", name: "id")

query_params

query_params(xpath, name: null, multi: false)

When you want to extract query parameter from url in anchor tag or any element has url type content, you can use query_params query.

xpath

Path to locate element

name
  • If the name is not given, it will extract all the query parameters in the url.

  • For example, if the name = ‘product’ given, it will get “{product: xyzcourse}” mapping.

multi

when xpath matches multiple elements,

  • False Processes first element

  • True Processes all elements

Example

user_id: query_params(xpath:"//a/@href", name: "user")

Result

regex

regex(xpath, pattern, source="TEXT", multi: false)

Regex will be used on the located element using xpath and returns the result.

xpath

Path to locate element

pattern

Regular expression pattern to match and it will be used in re.findall from python to extract data.

source

Regular expression can be applied on located element’s content or element’s source html itself.

  • text Regex will be applied on element’s content. This is default value.

  • html Regex will be applied on element’s html.

multi

when xpath matches multiple elements,

  • False Processes first element

  • True Processes all elements

Example

total_shown_expenses: regex(xpath: "//*[@id='exp-total']", pattern: "(\\d+)")

Result

"total_shown_expenses": [
    "40"
]

form_input

form_input(xpath, name: null, multi: false)

Form input query will help you to extract input elements name, value pair from form element.

xpath

Path to locate form element

name
  • If the name is not given, it will extract all the input elements under the form.

  • If the name is given, it will get input element with the given name.

multi

when xpath matches multiple elements,

  • False Processes first element

  • True Processes all elements

Example

Html

<form class="requestParams" id="apiAttr">
    <input name="rlz" value="1C5CHFA_enIN991IN991" type="hidden">
    <input name="tbm" value="lcl" type="hidden">
    <input name="sxsrf" value="APq-WBu3vzrA9-WQU_Mp0Zs9aq2a-PQlJg:1644327612221" type="hidden">
    <input value="vHICYpKHDaWXseMP57uWuA4" name="ei" type="hidden">
    <input value="AHkkrS4AAAAAYgKAzF3dfuu_a7YROtX7wSMb404M2sTE" disabled="true" name="iflsig" type="hidden">
</form>

Query

meta_data: form(xpath: "//form[@class='requestParams']", name: "sxsrf")