banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

5 Open Source Web Crawler Projects Based on LLM

01. Crawl4AI#

Crawl4AI simplifies the process of asynchronous web data extraction, making web data extraction simple and efficient, ideal for AI and LLM applications.

image

Advantages:#

  • 100% Open Source and Free: Fully open source code.
  • Lightning Fast Performance: Outperforms many paid services in fast and reliable crawling.
  • Built on AI LLM: Outputs data in JSON, HTML, or Markdown format.
  • Multi-Browser Support: Seamlessly works with Chromium, Firefox, and WebKit.
  • Simultaneous URL Crawling: Processes multiple websites at once for efficient data extraction.
  • Full Media Support: Easily extracts images, audio, video, and all HTML media tags.
  • Link Extraction: Retrieves all internal and external links for deeper data mining.
  • XML Metadata Retrieval: Captures page titles, descriptions, and other metadata.
  • Customizable: Add features for authentication, headers, or custom page modifications.
  • Supports Anonymity: Custom user agent settings.
  • Screenshot Support: Powerful error handling capabilities to take snapshots of pages.
  • Custom JavaScript: Executes scripts before fetching custom results.
  • Structured Data Output: Generates well-structured JSON data based on rules.
  • Intelligent Extraction: Uses LLM, clustering, regular expressions, or CSS selectors for accurate data scraping.
  • Proxy Validation: Supports access to protected content via secure proxies.
  • Session Management: Easily handles multi-page navigation.
  • Image Optimization: Supports lazy loading and responsive images.
  • Dynamic Content Handling: Manages lazy loading of interactive pages.
  • LLM-Friendly Headers: Passes custom headers for LLM-specific interactions.
  • Precise Extraction: Optimizes results using keywords or directives.
  • Flexible Settings: Adjusts timeouts and delays for smoother crawling.
  • Iframe Support: Extracts content from iframes for deeper data extraction.

02. ScrapeGraphAI#

ScrapeGraphAI is a Python library for web data scraping that uses LLM and logical graphs to create scraping workflows for websites or local documents (XML, HTML, JSON, Markdown, etc.).

image

03. LLM Scraper#

LLM Scraper is a TypeScript library for crawling based on LLM, supporting code generation features.

image

Advantages:#

  • Supports Local or MaaS Providers: Compatible with Ollama, GGUF, OpenAI, Vercel AI SDK.
  • Fully Type Safe: Implemented in TypeScript using schemas defined by Zod.
  • Based on Playwright Framework: Stream object support.
  • Code Generation: Supports code generation features.
  • Four Data Formatting Modes:
    • HTML: For loading raw HTML.
    • Markdown: For loading Markdown.
    • Text: For loading extracted text (using Readability.js).
    • Image: For loading screenshots (multi-mode only).

04. Crawlee Python#

image

Crawlee is a web crawler and browser automation Python library. It extracts web page data through AI, LLM, RAG, or GPT, including downloading HTML, PDF, JPG, PNG, and other files from websites. It is compatible with BeautifulSoup, Playwright, and raw HTTP, supporting both headful and headless modes, as well as proxy rotation rules.


05. CyberScraper 2077#

CyberScraper 2077 is a web scraping tool based on OpenAI, Gemini, or local large models, designed for precise and efficient data extraction, suitable for data analysts, tech enthusiasts, and anyone needing to simplify online information access.

image

Advantages:#

  • AI-Based Extraction: Utilizes AI models to intelligently understand and parse web content.
  • Smooth Streamlined Interface: User-friendly GUI.
  • Multi-Format Support: Exports data in JSON, CSV, HTML, SQL, or Excel formats.
  • Tor Network Support: Securely scrapes .onion sites with automatic routing and security features.
  • Incognito Mode: Implements incognito mode parameters to help avoid detection as a bot.
  • LLM Support: Provides functionality supporting various LLMs.
  • Asynchronous Operations: Asynchronous operations for fast execution.
  • Intelligent Parsing: Scrapes content as if directly extracting from primary memory.
  • Caching: Implements content and query-based caching using LRU caching and custom dictionaries to reduce redundant API calls.
  • Supports Uploading to Google Sheets: Easily uploads extracted CSV data to Google Sheets.
  • Captcha Bypass: Can bypass captchas by using captcha at the end of the URL (currently only works locally, not on Docker).
  • Current Browser: Uses local browser environment to help bypass 99% of bot detection.
  • Proxy Mode (Coming Soon): Built-in proxy support to help bypass network restrictions.
  • Browse Pages: Browse web pages and scrape data from different pages.
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.