Project Introduction#
AnyCrawl is a high-performance web crawler and data scraping application designed for modern AI application scenarios. It is not just a simple crawling tool but a comprehensive data collection solution.
Core Features#
Diverse Crawling Modes#
- SERP Crawling: Supports multiple search engines
- Batch Web Page Crawling: Efficient single-page content extraction
- Site Crawling: Intelligent full-site traversal and scraping
- Batch Processing: Supports large-scale batch crawling tasks
Powerful Technical Architecture#
- Multithreaded Architecture: Fully utilizes system resources to enhance crawling efficiency
- Multiprocess Support: Excellent performance when handling large tasks
- Multi-engine Support: Choose from three major engines: Cheerio, Playwright, Puppeteer
- LLM Optimization: Specifically optimized for large language model application scenarios
Tech Stack and Deployment#
AnyCrawl is built on a modern tech stack:
- Node.js + TypeScript: Ensures code quality and development efficiency
- Redis: Provides high-performance caching support
- Docker: One-click deployment for quick setup
docker compose up --build
Rich Configuration Options#
AnyCrawl offers flexible environment variable configurations, mainly including:
Basic Configuration#
ANYCRAWL_API_PORT
: API server port (default 8080)ANYCRAWL_HEADLESS
: Browser headless modeANYCRAWL_AVAILABLE_ENGINES
: Available crawling engines
Network Configuration#
ANYCRAWL_PROXY_URL
: Proxy server settings (supports HTTP and SOCKS)ANYCRAWL_IGNORE_SSL_ERROR
: SSL certificate error handlingANYCRAWL_KEEP_ALIVE
: Connection keep-alive policy
Data Storage#
ANYCRAWL_API_DB_TYPE
: Database type (SQLite/PostgreSQL)ANYCRAWL_REDIS_URL
: Redis connection configuration
Usage Examples#
Basic Web Page Scraping#
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "cheerio"
}'
Search Engine Results Scraping#
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"query": "AnyCrawl",
"limit": 10,
"engine": "google",
"lang": "all"
}'
Engine Selection Strategy#
AnyCrawl provides three crawling engines, each with its own features:
- Cheerio: Static HTML parsing, the fastest, suitable for simple pages
- Playwright: Modern JavaScript rendering engine, powerful functionality
- Puppeteer: Chrome-based JavaScript rendering, good compatibility
Developers can choose the most suitable engine based on specific needs to achieve a perfect balance of performance and functionality.
Practical Feature Highlights#
- Proxy Support: Perfectly supports HTTP and SOCKS proxies, easily adapting to various network environments
- JavaScript Rendering: Supported through Puppeteer and Playwright, capable of handling both SPAs and dynamically loaded content seamlessly
- Batch Processing: Built-in batch task processing mechanism, making large-scale data collection no longer a challenge
- API Friendly: RESTful API design, easy and convenient integration
Project Address#
- GitHub Address: https://github.com/any4ai/anycrawl
- Official Documentation: https://docs.anycrawl.dev