Python Chinese word segmentation toolkit, accuracy far exceeds Jieba.

Chinese word segmentation is a profound and mysterious technology, whether for humans or for AI.

Peking University has released an open-source Chinese word segmentation toolkit called PKUSeg, based on Python.

The segmentation accuracy of this toolkit far exceeds that of the two important competitors, THULAC and JIEBA.
In addition, PKUSeg supports domain-specific word segmentation and also supports training models with new annotation data.
Accuracy Comparison
In this competition, PKUSeg has two opponents:

One is THULAC from Tsinghua University, and the other is Jieba, which aims to be the "best Chinese word segmentation component". They are both mainstream word segmentation tools at present.

The test environment is Linux, and the test datasets are MSRA (news data) and CTB8 (mixed text).

The results are as follows:

The evaluation criteria used in the competition are the word segmentation evaluation script provided by the Second International Chinese Word Segmentation Evaluation Competition.

In terms of F-score and error rate, PKUSeg is significantly better than the other two competitors.

Usage
Pre-trained models
PKUSeg provides three pre-trained models trained on different types of datasets.

The first model is trained on MSRA (news corpus):
https://pan.baidu.com/s/1twci0QVBeWXUg06dK47tiA

The second model is trained on CTB8 (a mixed corpus of news and web text):
https://pan.baidu.com/s/1DCjDOxB0HD2NmP9w1jm8MA

The third model is trained on Weibo (web text corpus):
https://pan.baidu.com/s/1QHoK2ahpZnNmX6X7Y9iCgQ
You can choose to load different models according to your needs.

In addition, you can also train new models with new annotation data.

Code examples:

# Example 1: Using the default model and default dictionary for word segmentation
import pkuseg
seg = pkuseg.pkuseg()                # Load the model with default configuration
text = seg.cut('我爱北京天安门')    # Perform word segmentation
print(text)

# Example 2: Setting a user-defined dictionary
import pkuseg
lexicon = ['北京大学', '北京天安门']    # Words in the user dictionary that should not be segmented
seg = pkuseg.pkuseg(user_dict=lexicon)    # Load the model with the user dictionary
text = seg.cut('我爱北京天安门')        # Perform word segmentation
print(text)

# Example 3:
import pkuseg
seg = pkuseg.pkuseg(model_name='./ctb8')    # Assuming the user has downloaded the ctb8 model and placed it in the './ctb8' directory, load the model by setting model_name
text = seg.cut('我爱北京天安门')            # Perform word segmentation
print(text)

If you want to train a new model yourself:

# Example 5:
import pkuseg
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20)    # Train the model with 'msr_training.utf8' as the training file and 'msr_test_gold.utf8' as the test file, save the model to the './models' directory, and use 20 threads for training

For more detailed usage, please visit the link at the end of the text.
Go ahead and give it a try
PKUSeg was developed by three authors: Ruixuan Luo, Jingjing Xu, and Xu Sun.

The creation of this toolkit is also based on an ACL paper in which two of the authors participated.

With such high accuracy, why not give it a try?

GitHub link:
https://github.com/lancopku/PKUSeg-python