Skip to main content

Globy webcrawlers

Scrape the world with Globy!

Usage


$ globy-scraper.py -h
usage: globy-scraper.py [-h] -f URLS_FILE [-l LOGLEVEL] [-o OUTPUT_FILE] [-d] [-s] [-b BACKEND]

options:
-h, --help show this help message and exit
-f URLS_FILE, --urls_file URLS_FILE
Newline separated file of URLs to scrape
-l LOGLEVEL, --loglevel LOGLEVEL
Set log level (1-3)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output CSV file path
-d, --debug Debug/inspect responses (Will set PYTHONINSPECT to True)
-s, --store_html_to_file
Store HTML all content to file per domain in the "html_output" folder
-b BACKEND, --backend BACKEND
Select backend to use for scraping: "globy" (default) or "scrapy".
(!) The scrapy backend will only dump data to the "html_output" folder for now. No website analysis or other functionallity is supported yet.

Quickstart


You can just run globy-scraper as a script (no installation needed):

pip3 install -r requirements.txt  # Install dependencies if not already installed
./globy-scraper.py -f wordpress-top50.txt

Install the globy-webcrawlers package


You can install the package for development and use with other Globy projects. It's recommended to set up a python virtual environment before installing the package.

pip3 install -e .

Now you can use the package in python:

from globy_webcrawlers.crawler import WebSiteDataCrawler
>>> c = WebSiteDataCrawler()
>>> c.load_urls_from_file("urls.txt")
>>> c.run()

Also, after installing globy-scraper, you can just run it: globy-scraper.py -h `

Debugging/inspecting website content


Since the crawler is asynchronous, it can be a bit tricky to debug the responses. To make it easier, you can use the -d flag. This will allow you to inspect the HTML content and any internal objects from the most recent website. Here's an example:

$ ./globy-scraper.py -f wordpress-top50.txt -d
>>> w = crawler.debug_latest_response()
>>> w.url
'https://www.thyroid.org/'
>>> len(w.html_content)
249167
>>> w.get_website_info()

Change the crawler.urls to get the response you want to debug, or the input url list.

Development only


Build and & create python package:

python3 -m build