How to efficiently scrape web data using DrissionPage?

DrissionPage Introduction#

https://github.com/g1879/DrissionPage
DrissionPage is a web automation tool written in Python that cleverly integrates the functionalities of Selenium and Requests, providing a unified and simple operating interface. Developers can freely switch between browser mode (like using Selenium) and headless mode (similar to using Requests). With this feature, whether handling dynamic web content that requires JavaScript rendering or scraping simple static page data, DrissionPage can easily handle it.

Main Page Objects#

DrissionPage provides three main page objects, each suited for its own usage scenarios:

ChromiumPage: Mainly used for direct browser manipulation, suitable for situations that require interaction with web pages, such as clicking buttons, entering text, running JavaScript scripts, etc. However, its performance may be limited by the browser, and its running speed may not be very fast, with potentially higher memory usage.
WebPage: A comprehensive page object that can control the browser and send and receive data packets. It has two modes:
- d mode: Used for browser operations, very powerful but slower;
- s mode: Mainly handles data packets, fast, suitable for simpler data packets.
SessionPage: A lightweight page object specifically designed for sending and receiving data packets without needing to interact with web pages. It is highly efficient for large-scale data scraping and is an ideal choice in this regard.

Features#

Seamless Mode Switching#

DrissionPage allows developers to switch freely between Selenium's browser driver and Requests' session. If rendering a web page is needed, use Selenium; if quick data scraping is desired, use Requests. For example, when encountering a web page with both dynamic and static content, one can quickly obtain static data using SessionPage and then switch to ChromiumPage or WebPage's d mode to handle dynamic content.

Simplified Interface#

DrissionPage provides a unified interface that simplifies the process of web automation and data scraping. Developers no longer need to learn the complex APIs of both Selenium and Requests separately, saving a lot of time in learning and development. For locating web elements, DrissionPage offers ele() and eles() methods similar to Selenium, supporting several types of selectors (like CSS selectors, XPath), making it particularly convenient to use.

Flexible Customization#

It supports users in setting custom request headers, proxies, timeout durations, etc., making web scraping more flexible. When scraping data, one may encounter anti-scraping mechanisms on websites; in such cases, setting custom request headers and proxies can help bypass these restrictions smoothly.

Built-in Common Features#

DrissionPage includes many commonly used features, such as waiting for elements to load and automatic retries. When dealing with dynamic web pages, loading web elements may take some time, and DrissionPage's wait-for-element-loading feature ensures that operations are performed only after elements are fully loaded, avoiding errors due to incomplete loading.

Multi-Tab Operation#

It can operate multiple tabs in the browser simultaneously, even if the tabs are not in the currently active state, without needing to switch. This feature is particularly useful when handling multiple web pages at the same time, significantly improving work efficiency.

Packet Capture Function Upgrade - Listen Feature#

In DrissionPage version 4.0, the packet capture feature has been greatly enhanced, with each page object now having a built-in listener, making it more powerful and the API more reasonable. This greatly aids developers in debugging and data collection.

Example Code#

The following example can be run directly to see the effect; it will also record the time, allowing you to understand how to use the Listen feature:

from DrissionPage import ChromiumPage
from TimePinner import Pinner
from pprint import pprint

page = ChromiumPage()
page.listen.start('api/getkeydata')  # Specify the target to listen to, then start listening
pinner = Pinner(True, False)
page.get('http://www.hao123.com/')  # Open this website
packet = page.listen.wait()  # Wait to receive the data packet
pprint(packet.response.body)  # Print the content of the data packet
pinner.pin('Time taken', True)

After running this code, the content of the captured data packet and the total time taken will be output, making it easier for developers to analyze performance and debug.

Page Access Logic Optimization#

In version 3.x, there were two main issues with page connections: the timeout parameter of the get() method for the browser page object was only effective during the page loading phase and not during the connection phase; the loading strategy none mode was not useful in practice. Both issues have been resolved in version 4.0, and users can now control when to terminate connections.

Usage Scenarios#

Web Automation Testing#

Utilizing Selenium's capabilities to simulate user operations on web pages for automated testing. Various functions of the web page, such as login, registration, and form submission, can be tested to ensure the stability and reliability of the web page.

Data Scraping#

Using Requests to obtain data from static pages, switching to browser mode when encountering complex pages. This allows for quick and efficient scraping of data from various websites, such as news, product information, social network data, etc.

Crawler Development#

With its flexible mode switching and powerful element locating capabilities, DrissionPage is well-suited for developing various types of crawlers. One can choose the appropriate mode based on the characteristics of the website, improving the efficiency and stability of the crawler.

Usage Examples#

Controlling the Browser#

Using the ChromiumPage object, browser automation operations can be easily implemented, such as logging in and filling out forms.

from DrissionPage import ChromiumPage

page = ChromiumPage()
page.get('https://gitee.com/login')  # Open the login page
# Find the account input box
user_login = page.ele('#user_login')
user_login.input('Your account')  # Input account
# Find the password input box
user_password = page.ele('#user_password')
user_password.input('Your password')  # Input password
# Find the login button and click
login_button = page.ele('@value=Log In')
login_button.click()

Scraping Data#

Using the SessionPage object, data can be efficiently scraped without complex interactions with web pages.

from DrissionPage import SessionPage

page = SessionPage()
for i in range(1, 4):  # Loop through three pages
    page.get(f'https://gitee.com/explore/all?page={i}')  # Open each page
    # Find all project link elements
    links = page.eles('.title.project-namespace-path')
    for link in links:  # Iterate through each link element
        print(link.text, link.link)  # Print the link text and address

Page Analysis#

Using the WebPage object, one can flexibly switch between browser mode and data packet mode to adapt to different analysis needs.

from DrissionPage import WebPage

page = WebPage()
page.get('https://gitee.com/explore/all')  # Open the page
page.change_mode()  # Switch mode
# Find project list elements
items = page.ele('.ui.relaxed.divided.items.explore-repo__list').eles('.item')
for item in items:  # Iterate through each project
    print(item('t:h3').text)  # Print project title
    print(item('.project-desc.mb-1').text)  # Print project description

Summary#

DrissionPage is a powerful and user-friendly open-source Python package that provides efficient and flexible solutions for web automation and data scraping. By integrating the functionalities of Selenium and Requests, it offers seamless mode switching and a simple interface, allowing developers to focus more on business logic. Whether for novice developers or experienced professionals, DrissionPage is worth trying, making it easier to accomplish various web automation tasks.