[Python Automation Office] Batch convert Word documents to PDF and count the number of pages.

Hello everyone, today we will demonstrate how to use Python to batch convert Word documents to PDF format.

Without further ado, let's get started!

PyPDF2 is a Python module that can be used to read, write, and manipulate PDF files. To install the PyPDF2 module, follow these steps:

Make sure you have Python installed. You can check if Python is installed by entering "python --version" in the terminal or command prompt.

Installation of the PyPDF2 module:
ModuleNotFoundError: No module named 'PyPDF2'

Once the installation is complete, you can use the PyPDF2 module in Python to read, write, and manipulate PDF files.

For example, to extract the text content from a PDF file, you can import the PyPDF2 module in your Python script, then use the PdfFileReader class to read the file and iterate through each page. Here is a simple example code:

import PyPDF2

pdf_file = PyPDF2.PdfFileReader('example.pdf')
for page_num in range(pdf_file.getNumPages()):
    page = pdf_file.getPage(page_num)
    print(page.extractText())

This will print the text content of each page in the PDF file.

Note:
Due to updates in the PyPDF2 version, some classes and functions have been deprecated. To use alternative functions, such as getting the number of pages in a PDF file, replace "getNumPages" with "len(reader.pages)".

Here are two error messages indicating the functions that need to be replaced:

PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

PyPDF2.errors.DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead.

Using Python code to batch convert Word documents to PDF format
and perform page count on the converted documents, as shown below (code example):

# -*- coding:utf-8 -*-
import os
from win32com.client import Dispatch, DispatchEx
from win32com.client import constants
from win32com.client import gencache
from PyPDF2 import PdfReader
import re
import pythoncom

def getfilenames(filepath='',filelist_out=[],file_ext='all'):
    for fpath, dirs, fs in os.walk(filepath):
        for f in fs:
            fi_d = os.path.join(fpath, f)
            if file_ext == '.doc':
                if os.path.splitext(fi_d)[1] in ['.doc','.docx']:
                    filelist_out.append(re.sub(r'\\','/',fi_d))
            else:
                if  file_ext == 'all':
                    filelist_out.append(fi_d)
                elif os.path.splitext(fi_d)[1] == file_ext:
                    filelist_out.append(fi_d)
                else:
                    pass
        filelist_out.sort()
    return filelist_out

def wordtopdf(filelist,targetpath):
    totalPages = 0
    valueList = []
    try:
        pythoncom.CoInitialize()
        gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)
        w = Dispatch("Word.Application")
        for fullfilename in filelist:
            (filepath,filename) = os.path.split(fullfilename)
            softfilename = os.path.splitext(filename)
            os.chdir(filepath)
            doc = os.path.abspath(filename)
            os.chdir(targetpath)
            pdfname = softfilename[0] + ".pdf"
            output = os.path.abspath(pdfname)
            pdf_name = output

            try:
                doc = w.Documents.Open(doc, ReadOnly=1)
                doc.ExportAsFixedFormat(output, constants.wdExportFormatPDF, \
                                        Item=constants.wdExportDocumentWithMarkup,
                                        CreateBookmarks=constants.wdExportCreateHeadingBookmarks)
            except Exception as e:
                print(e)
            if os.path.isfile(pdf_name):
                pages = getPdfPageNum(pdf_name)
                valueList.append([fullfilename,str(pages)])
                totalPages += pages
            else:
                print('Conversion failed!')
                return False
        w.Quit(constants.wdDoNotSaveChanges)
        return totalPages,valueList
    except TypeError as e:
        print('Error occurred!')
        print(e)
        return False

def getPdfPageNum(path):
    with open(path, "rb") as file:
        doc = PdfReader(file)
        pagecount = len(doc.pages)
    return pagecount

if __name__ == '__main__':
    sourcepath = r"C:/Users/Lenovo/Desktop/python代码示例/word/"
    targetpath = r"C:/Users/Lenovo/Desktop/python代码示例/pdf/"
    filelist = getfilenames(sourcepath,[],'.doc')
    valueList = wordtopdf(filelist,targetpath)
    resultList = valueList[1]
    if valueList:
        for i in resultList:
            print(i[0],i[1])
        totalPages = str(valueList[0])
        print("Total pages:",totalPages)
    else:
        print("No files to count or counting failed!")