The easy web scraping with Ferret

If you want to do some webscraping on sunday, or more specifically, to extract automatically data from a web page, for testing purposes, for machine learning, to make some stats or just pump up data, here is Ferret.

Ferret is a MIT licensed tool that aims to make it all very simple, using its own declarative language. This makes it possible to focus only on the data to be retrieved, ignoring the technical details.

ferret

Here's a code example:

LET google = DOCUMENT("https://www.google.com/", true)

INPUT(google, 'input[name="q"]', "korben")
CLICK(google, 'input[name="btnK"]')

WAIT_NAVIGATION(google)

LET result = (
    FOR result IN ELEMENTS(google, '.g')
       RETURN {
           title: ELEMENT(result, 'h3 > a'),
           description: ELEMENT(result, '.st'),
           url: ELEMENT(result, 'cite')
       }
)

RETURN (
    FOR page IN result
    FILTER page.title != NONE
    RETURN page
)

In this example, Ferret opens the Google homepage, enters a word in the search field, and then clicks the "Search" button.

The script waits for the page to load, then iterates through all the search results to place the title, url, and description in variables. It then applies a filter to eliminate empty contents before displaying the retrieved content.


Related Posts