banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Python-Camelot: Extract PDF table data in three lines of code.

PDF files are a very commonly used file format, typically used for formal electronic documents. It is able to effectively preserve different formatting styles, creating clear and aesthetically pleasing layouts. However, for people who want to extract information from PDFs, especially tables, it can be a nightmare.

A large number of academic reports, papers, and analytical articles use PDFs to present tabular data. However, it can be very difficult to directly copy data from tables. Recently, a developer provided a tool called Camelot that can extract table information from text-based PDFs. It can directly convert most tables into Pandas Dataframes.

Project address: https://github.com/camelot-dev/camelot

What is Camelot?

According to the project description, Camelot is a Python tool used to extract table data from PDF files.

Specifically, users can open PDF files like they would with Pandas and use this tool to extract table data. They can then specify the output format, such as a CSV file.

Code example

The project provides a PDF file shown in the image, assuming that the user wants to extract information from table 2-1.

Using Camelot to extract table data, the code is as follows:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf') # similar to opening a CSV file with Pandas
>>> tables[0].df # get a pandas DataFrame!
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite, specify output format
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite, export data to a file
>>> tables
<TableList n=1>
>>> tables[0]
<Table shape=(7, 7)> # get the output format
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}

The output results are shown below. For merged cells, Camelot adds empty rows after extraction, which is a reliable method.

Installation method

The project author provides three installation methods. First, you can use Conda, which is the simplest method.

conda install -c conda-forge camelot-py

The most popular installation method is using pip.

pip install camelot-py[cv]

You can also clone the code from the project and install from the source.

git clone https://www.github.com/camelot-dev/camelot
cd camelot
pip install ".[cv]"

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.