Globy webcrawlers

Scrape the world with Globy!

Usage

$ globy-scraper.py -h
usage: globy-scraper.py [-h] -f URLS_FILE [-l LOGLEVEL] [-o OUTPUT_FILE] [-d] [-s] [-b BACKEND]

options:
  -h, --help            show this help message and exit
  -f URLS_FILE, --urls_file URLS_FILE
                        Newline separated file of URLs to scrape
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Set log level (1-3)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output CSV file path
  -d, --debug           Debug/inspect responses (Will set PYTHONINSPECT to True)
  -s, --store_html_to_file
                        Store HTML all content to file per domain in the "html_output" folder
  -b BACKEND, --backend BACKEND
      Select backend to use for scraping: "globy" (default) or "scrapy".
      (!) The scrapy backend will only dump data to the "html_output" folder for now. No website analysis or other functionallity is supported yet.

Quickstart

You can just run globy-scraper as a script (no installation needed):

pip3 install -r requirements.txt  # Install dependencies if not already installed
./globy-scraper.py -f wordpress-top50.txt

Install the globy-webcrawlers package

You can install the package for development and use with other Globy projects. It's recommended to set up a python virtual environment before installing the package.

pip3 install -e .

Now you can use the package in python:

from globy_webcrawlers.crawler import WebSiteDataCrawler
>>> c = WebSiteDataCrawler()
>>> c.load_urls_from_file("urls.txt")
>>> c.run()

Also, after installing globy-scraper, you can just run it: globy-scraper.py -h `

Debugging/inspecting website content

Since the crawler is asynchronous, it can be a bit tricky to debug the responses. To make it easier, you can use the -d flag. This will allow you to inspect the HTML content and any internal objects from the most recent website. Here's an example:

$ ./globy-scraper.py -f wordpress-top50.txt -d

>>> w = crawler.debug_latest_response()
>>> w.url
'https://www.thyroid.org/'
>>> len(w.html_content)
249167
>>> w.get_website_info()

Change the crawler.urls to get the response you want to debug, or the input url list.

Development only

Build and & create python package:

python3 -m build

Scrape the world with Globy!​

Usage​

Quickstart​

Install the globy-webcrawlers package​

Debugging/inspecting website content​