Skip to main content

Web Extractors

HTML Content Extractor API


This project is a Flask-based web extractor and prompt generator that processes HTML files and directories, extracts content like headings, paragraphs, and images, and returns the structured data as JSON or generates prompts suitable for Language Models (LLMs). The project supports both individual HTML files and folders containing multiple files, making it versatile for a variety of input formats. Additionally, it includes URL scraping functionality and custom prompt generation based on the extracted content.

Quick Start


You can either extract HTML content using the Flask API or run the script locally.

Running the API


  1. Start the Flask Server:

    Run the app.py to start the server:

    python app.py
  2. Use the API Endpoints:

  • Process HTML File or Folder: You can upload a file or provide a folder path to extract content.

  • Save to JSON: Optionally, the API can save the extracted content to a JSON file.

API Endpoints


  1. /process (POST):
  • Extracts content from a single HTML file or a folder and returns it in the response as JSON.

  • Upload an HTML file:

curl -X POST -F "file=@path/to/file.html" http://localhost:5000/process
  • Upload a folder:
curl -X POST -d "folder_path=path/to/folder" http://localhost:5000/process
  • Upload a URL:
curl -X POST -d "url=https://example.com" http://localhost:5000/process
  1. /process_and_save (POST): Extracts content from a single HTML file or a folder and saves the result to a JSON file. Returns a message confirming the save. (Linux/MacOS required)

Upload an HTML file:

curl -X POST -F "file=@path/to/file.html" http://localhost:5000/process_and_save

Upload a folder:

curl -X POST -d "folder_path=path/to/folder" http://localhost:5000/process_and_save

Upload a URL:

curl -X POST -d "url=https://example.com" http://localhost:5000/process_and_save
  1. /generate_prompt (POST): Processes an uploaded HTML file, folder, or URL and returns the extracted content as JSON. (Max Depth Control: You can now specify a max_depth parameter, which controls how deep the script will search for HTML files within folders. max_depth=2: Process files in the root folder and up to two levels of subfolders (default).)

Upload an HTML file:

curl -X POST -F "file=@path/to/file.html" http://localhost:5000/generate_prompt

Upload a folder:

curl -X POST -d "folder_path=path/to/folder" http://localhost:5000/generate_prompt

Upload a URL:

curl -X POST -d "url=https://example.com" http://localhost:5000/generate_prompt

Running Locally


If you prefer not to use the API, you can still run the script locally by executing main.py. Input can be a single file or folder, and the output will be saved to a JSON file.


python app.py
By default, the script will look for an HTML file or folder, process it, and save the results in an output.json file.

Features


  • Extract Headings: Extracts headings (h1 to h6) from HTML files.

  • Extract Paragraphs: Extracts paragraphs ((p)) and their id attributes.

  • Extract Div Content: Extracts the content inside <div> elements with class content.

  • Ignore Template Variables: Removes content wrapped inside {{ }} and { } to avoid dynamic content.

  • Single File or Folder Input: Can process a single HTML file or all HTML files in a folder and its subfolders.

  • JSON Output: Stores the extracted content in a structured JSON file.

Requirements


  • Python 3.x

  • Required Python libraries: beautifulsoup4, lxml, flask

Install Required Libraries


You can install the required libraries using pip:

pip install beautifulsoup4 lxml flask