5 Open Source Web Crawler Projects Based on LLM

Recommended Open Source AI Crawlers#

01. Crawl4AI #

Crawl4AI simplifies the process of asynchronous web data extraction, making web data extraction simple and efficient, ideal for AI and LLM applications.

Advantages:#

100% Open Source and Free: Fully open source code.
Lightning Fast Performance: Outperforms many paid services in fast and reliable crawling.
Built on AI LLM: Outputs data in JSON, HTML, or Markdown format.
Multi-Browser Support: Seamlessly works with Chromium, Firefox, and WebKit.
Simultaneous URL Crawling: Processes multiple websites at once for efficient data extraction.
Full Media Support: Easily extracts images, audio, video, and all HTML media tags.
Link Extraction: Retrieves all internal and external links for deeper data mining.
XML Metadata Retrieval: Captures page titles, descriptions, and other metadata.
Customizable: Add features for authentication, headers, or custom page modifications.
Supports Anonymity: Custom user agent settings.
Screenshot Support: Powerful error handling capabilities to take snapshots of pages.
Custom JavaScript: Executes scripts before fetching custom results.
Structured Data Output: Generates well-structured JSON data based on rules.
Intelligent Extraction: Uses LLM, clustering, regular expressions, or CSS selectors for accurate data scraping.
Proxy Validation: Supports access to protected content via secure proxies.
Session Management: Easily handles multi-page navigation.
Image Optimization: Supports lazy loading and responsive images.
Dynamic Content Handling: Manages lazy loading of interactive pages.
LLM-Friendly Headers: Passes custom headers for LLM-specific interactions.
Precise Extraction: Optimizes results using keywords or directives.
Flexible Settings: Adjusts timeouts and delays for smoother crawling.
Iframe Support: Extracts content from iframes for deeper data extraction.

02. ScrapeGraphAI #

ScrapeGraphAI is a Python library for web data scraping that uses LLM and logical graphs to create scraping workflows for websites or local documents (XML, HTML, JSON, Markdown, etc.).

03. LLM Scraper #

LLM Scraper is a TypeScript library for crawling based on LLM, supporting code generation features.

Advantages:#

Supports Local or MaaS Providers: Compatible with Ollama, GGUF, OpenAI, Vercel AI SDK.
Fully Type Safe: Implemented in TypeScript using schemas defined by Zod.
Based on Playwright Framework: Stream object support.
Code Generation: Supports code generation features.
Four Data Formatting Modes:
- HTML: For loading raw HTML.
- Markdown: For loading Markdown.
- Text: For loading extracted text (using Readability.js).
- Image: For loading screenshots (multi-mode only).

Crawlee is a web crawler and browser automation Python library. It extracts web page data through AI, LLM, RAG, or GPT, including downloading HTML, PDF, JPG, PNG, and other files from websites. It is compatible with BeautifulSoup, Playwright, and raw HTTP, supporting both headful and headless modes, as well as proxy rotation rules.

05. CyberScraper 2077 #

CyberScraper 2077 is a web scraping tool based on OpenAI, Gemini, or local large models, designed for precise and efficient data extraction, suitable for data analysts, tech enthusiasts, and anyone needing to simplify online information access.

Advantages:#

AI-Based Extraction: Utilizes AI models to intelligently understand and parse web content.
Smooth Streamlined Interface: User-friendly GUI.
Multi-Format Support: Exports data in JSON, CSV, HTML, SQL, or Excel formats.
Tor Network Support: Securely scrapes .onion sites with automatic routing and security features.
Incognito Mode: Implements incognito mode parameters to help avoid detection as a bot.
LLM Support: Provides functionality supporting various LLMs.
Asynchronous Operations: Asynchronous operations for fast execution.
Intelligent Parsing: Scrapes content as if directly extracting from primary memory.
Caching: Implements content and query-based caching using LRU caching and custom dictionaries to reduce redundant API calls.
Supports Uploading to Google Sheets: Easily uploads extracted CSV data to Google Sheets.
Captcha Bypass: Can bypass captchas by using captcha at the end of the URL (currently only works locally, not on Docker).
Current Browser: Uses local browser environment to help bypass 99% of bot detection.
Proxy Mode (Coming Soon): Built-in proxy support to help bypass network restrictions.
Browse Pages: Browse web pages and scrape data from different pages.

Being towards death