PDF files are a very commonly used file format, typically used for formal electronic documents. It is able to effectively preserve different formatting styles, creating clear and aesthetically pleasing layouts. However, for people who want to extract information from PDFs, especially tables, it can be a nightmare.
A large number of academic reports, papers, and analytical articles use PDFs to present tabular data. However, it can be very difficult to directly copy data from tables. Recently, a developer provided a tool called Camelot that can extract table information from text-based PDFs. It can directly convert most tables into Pandas Dataframes.
Project address: https://github.com/camelot-dev/camelot
What is Camelot?
According to the project description, Camelot is a Python tool used to extract table data from PDF files.
Specifically, users can open PDF files like they would with Pandas and use this tool to extract table data. They can then specify the output format, such as a CSV file.
Code example
The project provides a PDF file shown in the image, assuming that the user wants to extract information from table 2-1.
Using Camelot to extract table data, the code is as follows:
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf') # similar to opening a CSV file with Pandas
>>> tables[0].df # get a pandas DataFrame!
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite, specify output format
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite, export data to a file
>>> tables
<TableList n=1>
>>> tables[0]
<Table shape=(7, 7)> # get the output format
>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
The output results are shown below. For merged cells, Camelot adds empty rows after extraction, which is a reliable method.
Installation method
The project author provides three installation methods. First, you can use Conda, which is the simplest method.
conda install -c conda-forge camelot-py
The most popular installation method is using pip.
pip install camelot-py[cv]
You can also clone the code from the project and install from the source.
git clone https://www.github.com/camelot-dev/camelot
cd camelot
pip install ".[cv]"