banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

The new generation alternative to selenium - DrissionPage

Today, I recommend a web automation tool based on Python: DrissionPage. This tool can control browsers, send and receive data packets, and even combine the two. In simple terms, it combines the convenience of web browser automation with the efficiency of requests.

There are usually two forms of web automation:

  1. Directly send request data packets to the server to obtain the required data and simulate data flow operations.

  2. Interact with the browser and web pages to simulate user interface operations.

The former is lightweight and fast, such as the requests library. However, when facing websites that require login, requests often have to deal with anti-crawling measures such as captchas, JS obfuscation, and signature parameters, which have a higher threshold. If the data is generated by JS calculations, the calculation process needs to be reproduced, which is not efficient in terms of development.

The latter directly uses the browser to simulate user behavior, such as the Selenium library, which can largely bypass these obstacles, but the browser's running efficiency is not high.

Therefore, the original intention of DrissionPage is to combine them into one, switch the corresponding mode when needed, and provide a user-friendly method to improve development and execution efficiency.

image

Features:

  • No webdriver features
  • No need to download different drivers for different versions of browsers
  • Faster execution speed
  • Can search for elements across iframes without switching in and out
  • Treat iframes as regular elements, making logic clearer
  • Can operate on multiple tabs in the browser simultaneously, even if the tabs are not active, no need to switch
  • Can directly read browser cache to save images, no need to use GUI to click "Save As"
  • Can take screenshots of the entire webpage, including parts outside the viewport (supported by browser versions 90 and above)
  • Can handle shadow-root in non-open state

Project address:

https://gitee.com/g1879/DrissionPage

Install DrissionPage using pip:

pip install DrissionPage -i https://pypi.tuna.tsinghua.edu.cn/simple

Application example: Crawl the top 100 movies on Maoyan

This example demonstrates data crawling using a browser.

Target URL: https://www.maoyan.com/board/4

Example code:

The following code can be run directly.

Note that a recorder object is used here, see DataRecorder for details.

from DrissionPage import ChromiumPage
from DataRecorder import Recorder

# Create a page object
page = ChromiumPage()

# Create a recorder object
recorder = Recorder('data.csv')

# Access the webpage
page.get('https://www.maoyan.com/board/4')

while True:
    # Iterate over all dd elements on the page
    for mov in page.eles('t:dd'):
        # Get the required information
        num = mov('t:i').text
        score = mov('.score').text
        title = mov('@data-act=boarditem-click').attr('title')
        star = mov('.star').text
        time = mov('.releasetime').text
        # Write to the recorder
        recorder.add_data((num, title, star, time, score))

    # Get the next page button, click if it exists
    btn = page('下一页', timeout=2)
    if btn:
        btn.click()
        page.wait.load_start()
    # Exit the program if it doesn't exist
    else:
        break

recorder.record()

Now let's talk about this useful library, DataRecorder.

https://gitee.com/huiwei13/data-recorder

Although it is not widely used, it is very useful.

It can cache data and write it once it reaches a certain quantity, reducing the number of file read/write operations and lowering overhead.

It supports writing data simultaneously in multiple threads.

When writing, it will automatically wait for the file to close before writing to avoid data loss.

It provides good support for resuming crawling from breakpoints.

It can easily transfer data in batches.

It can automatically create headers based on dictionary data.

It automatically creates files and paths, reducing code volume.

Recorder:

Recorder is a simple, intuitive, efficient, and practical tool that only performs one action, which is continuously receiving data and adding it to the file in order. It can receive single-line data or multi-line data in a two-dimensional format.

It supports four file formats: csv, xlsx, json, and txt.

from DataRecorder import Recorder

data = ((1, 2, 3, 4), 
        (5, 6, 7, 8))

r = Recorder('data.csv')
r.add_data(data)  # Record multiple lines of data at once
r.add_data('abc')  # Record a single line of data

Filler:

Filler is used to fill data into table files, and you can specify the coordinates to fill. It is very flexible and can specify the coordinates as the top-left corner and fill in a two-dimensional data block. It also encapsulates the functionality of recording data processing progress (such as resuming crawling). In addition, it can also set links for cells.

It only supports csv and xlsx file formats.

from DataRecorder import Filler

f = Filler('results.csv')
f.add_data((1, 2, 3, 4), 'a2')  # Write a row of data starting from cell A2
f.add_data(((1, 2), (3, 4)), 'd4')  # Write a two-dimensional data block with D4 as the top-left corner
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.