banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Surprise! Python web crawler only requires 10 lines of code, allowing you to crawl massive amounts of public account articles!

Preface
Since the appearance of chatGPT, the ability to process text has directly increased by a dimension. Before this, after crawling text content from the internet, we needed to write a text cleaning program to clean the text. But now, with the support of chatGPT, we can easily clean all types and formats of text content in just a few seconds, removing HTML tags and more.
Even after cleaning the articles, we can still do many things, such as extracting key points and rewriting articles. Using chatGPT can easily solve these tasks.

As early as 2019, I wrote an article introducing the method of crawling articles from public accounts. It still applies now, but back then, we didn't do much processing on the crawled articles. But now, things are different. We can do many things, but the premise is not to violate laws and regulations.

Obtaining the URL of an Article
The most important thing for web crawling is to find the URL address of the target website, and then find the pattern, traverse the addresses one by one or in multiple threads for crawling. There are generally two ways to obtain the address. One is through webpage pagination, inferring the pattern of the URL address, such as using parameters like pageNum = num, and simply incrementing num. Another way is hidden in the HTML, where we need to parse the hyperlinks in the current webpage, such as , and extract the URL as the address for subsequent crawling.

Unfortunately, both of these methods are difficult to use for WeChat public accounts. When we open a WeChat public account address in a webpage, we can see its URL format:

https://mp.weixin.qq.com/s?__biz=MzIxMTgyODczNg==&mid=2247483660&idx=1&sn=2c14b9b416e2d8eeed0cbbc6f44444e9&chksm=974e2863a039a1752605e6e76610eb39e5855c473edab16ab2c7c8aa2f624b4d892d26130110&token=20884314&lang=zh_CN#rd
Except for the domain name at the beginning, the parameters at the end are completely irregular. Moreover, in an article, there is no way to link to the next article. We cannot start from one article and crawl all the articles under this public account.

But fortunately, we still have a way to batch obtain all the article addresses under a certain public account. We just need to save them and then traverse and crawl them later, which becomes much easier.

  1. First, you need to have a public account. If you don't have one, you can register one. This is a prerequisite. The steps to register a public account are relatively simple and can be done by yourself.
  2. After registering, log in to the WeChat public platform and click on the "Drafts" on the left side to create a new article.

640 (1)
3. On the new article creation page, click on the hyperlink at the top, and a popup window will appear. Select "Public Account Article" and enter the name of the public account you want to crawl, as shown below:

640 (2)
4. After selecting, you can see the list of all articles under this public account. At this time, open F12 and view the network requests in the webpage.

640 (3)
When you click on the next page, you can see the requested URL and the accompanying business parameters and request header parameters.

640 (4)
The above are some business parameters. These parameters are easy to understand. The more important ones are "begin", which indicates from which article to start the query, and "count", which indicates how many articles to query at a time. "fakeid" is the unique identifier of the public account, which is different for each public account. If you want to crawl other public accounts, just change this parameter. "random" can be omitted. You can also see the corresponding results:

640 (5)
Writing Code
With all this information, we can write the code. I used Python 3.8. First, define the URL, headers, and required parameters.

# Target URL
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
# Request header parameters
headers = {
  "Cookie": "ua_id=YF6RyP41YQa2QyQHAAAAAGXPy_he8M8KkNCUbRx0cVU=; pgv_pvi=2045358080; pgv_si=s4132856832; uuid=48da56b488e5c697909a13dfac91a819; bizuin=3231163757; ticket=5bd41c51e53cfce785e5c188f94240aac8fad8e3; ticket_id=gh_d5e73af61440; cert=bVSKoAHHVIldcRZp10_fd7p2aTEXrTi6; noticeLoginFlag=1; remember_acct=mf1832192%40smail.nju.edu.cn; data_bizuin=3231163757; data_ticket=XKgzAcTceBFDNN6cFXa4TZAVMlMlxhorD7A0r3vzCDkS++pgSpr55NFkQIN3N+/v; slave_sid=bU0yeTNOS2VxcEg5RktUQlZhd2xheVc5bjhoQTVhOHdhMnN2SlVIZGRtU3hvVXJpTWdWakVqcHowd3RuVF9HY19Udm1PbVpQMGVfcnhHVGJQQTVzckpQY042QlZZbnJzel9oam5SdjRFR0tGc0c1eExKQU9ybjgxVnZVZVBtSmVnc29ZcUJWVmNWWEFEaGtk; slave_user=gh_d5e73af61440; xid=93074c5a87a2e98ddb9e527aa204d0c7; openid2ticket_obaWXwJGb9VV9FiHPMcNq7OZzlzY=lw6SBHGUDQf1lFHqOeShfg39SU7awJMxhDVb4AbVXJM=; mm_lang=zh_CN",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
}

# Business parameters
data = {
    "token": "1378111188",
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1",
    "action": "list_ex",
    "begin": "0",
    "count": "5",
    "query": "",
    "fakeid": "MzU5MDUzMTk5Nw==",
    "type": "9",
}

The cookie and token need to be changed according to the URL you are requesting. Then send the request and parse the response.

content_list = []
for i in range(20):
    data["begin"] = i*5
    time.sleep(5)
    # Use the get method to submit
    content_json = requests.get(url, headers=headers, params=data).json()
    # It returns a JSON, which contains the data of each page
    for item in content_json["app_msg_list"]:    
    # Extract the title and corresponding URL of each article on the page
        items = []
        items.append(item["title"])
        items.append(item["link"])
        t = time.localtime(item["create_time"])
        items.append(time.strftime("%Y-%m-%d %H:%M:%S", t))
        content_list.append(items)

The first for loop is for the number of pages to crawl. Here, I recommend crawling 20 pages at a time, with 5 articles per page, which is 100 articles in total. First, you need to know how many pages there are in the historical article list of the public account, and this number should be smaller than the number of pages. Change data["begin"] to indicate from which article to start, 5 articles at a time. Note that the crawling should not be too much or too frequent, so you need to wait a few seconds each time, otherwise your IP, cookie, and even the public account may be blocked.

Finally, we just need to save the titles and URLs, and then we can crawl them one by one.

name = ['title', 'link', 'create_time']
test = pd.DataFrame(columns=name, data=content_list)
test.to_csv("url.csv", mode='a', encoding='utf-8')

To get all the historical articles, we need to get the total number of articles, which can be obtained from "app_msg_cnt", and then calculate how many pages there are. Then we can crawl all the articles at once.

content_json = requests.get(url, headers=headers, params=data).json()
count = int(content_json["app_msg_cnt"])
print(count)
page = int(math.ceil(count / 5))
print(page)

To avoid crawling too frequently, we need to sleep for a period of time. After crawling 10 times, we can let the program sleep for a few seconds.

if (i > 0) and (i % 10 == 0):
        name = ['title', 'link', 'create_time']
        test = pd.DataFrame(columns=name, data=content_list)
        test.to_csv("url.csv", mode='a', encoding='utf-8')
        print("Saved successfully for the " + str(i) + "th time")
        content_list = []
        time.sleep(random.randint(60,90))
    else:
        time.sleep(random.randint(15,25))

640 (6)
The complete code is available on GitHub and can be downloaded and used directly.
https://github.com/cxyxl66/WeChatCrawler

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.