PDF技能深度解析：从基础操作到高级自动化处理

2次阅读

共计 2311 个字符，预计需要花费 6 分钟才能阅读完成。

在日常开发中，处理 PDF 文件经常会遇到各种头疼的问题。这些问题如果不妥善解决，轻则影响效率，重则可能导致系统崩溃。以下是我总结的几个最常见痛点：

内存泄漏问题 ：处理大型 PDF 时，如果不及时释放资源，很容易导致内存暴涨
格式兼容性问题 ：不同生成工具创建的 PDF 在解析时可能出现乱码或布局错乱
批量处理效率低下 ：同步处理大量文件时耗时过长
字体显示异常 ：缺少嵌入字体导致的显示差异
加密文档处理困难 ：密码保护文档的自动化处理存在障碍

Python 生态中有多个 PDF 处理库，各有特点和适用场景：

PyPDF2
优点：纯 Python 实现，轻量级，支持基础操作（合并 / 拆分 / 旋转）
缺点：功能有限，不支持 PDF 生成
适用场景：简单的 PDF 文档操作
pdfkit
优点：基于 wkhtmltopdf，HTML 转 PDF 效果好
缺点：需要安装外部依赖
适用场景：将网页或 HTML 模板转为 PDF
ReportLab
优点：强大的 PDF 生成能力，支持复杂布局
缺点：学习曲线较陡
适用场景：需要动态生成复杂 PDF 报告

import os
from PyPDF2 import PdfFileReader, PdfFileWriter

def merge_pdfs(output_path, input_paths):
    """合并多个 PDF 文件"""
    writer = PdfFileWriter()

    try:
        for path in input_paths:
            reader = PdfFileReader(path)
            for page in range(reader.getNumPages()):
                writer.addPage(reader.getPage(page))

        with open(output_path, 'wb') as out:
            writer.write(out)
    except Exception as e:
        print(f"合并失败: {e}")
        # 确保资源释放
        writer.close()
        raise

# 性能优化：处理大文件时使用逐页读取
# 异常处理：捕获文件不存在的错误

import pdfkit
import tempfile

def html_to_pdf(html_string, output_path):
    """HTML 字符串转 PDF"""
    try:
        # 使用临时文件避免内存问题
        with tempfile.NamedTemporaryFile(suffix='.html') as f:
            f.write(html_string.encode('utf-8'))
            f.flush()

            options = {
                'encoding': 'UTF-8',
                'quiet': ''
            }
            pdfkit.from_file(f.name, output_path, options=options)
    except Exception as e:
        print(f"转换失败: {e}")
        # 清理临时文件
        os.unlink(f.name) if 'f' in locals() else None

# 注意事项：确保 wkhtmltopdf 已正确安装
# 性能建议：批量处理时复用配置

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader

def generate_pdf(output_path, data):
    """生成包含动态数据的 PDF"""
    c = canvas.Canvas(output_path, pagesize=letter)
    try:
        # 设置文档信息
        c.setAuthor("Your Company")

        # 添加内容
        y_position = 700
        for item in data:
            c.drawString(100, y_position, f"Item: {item['name']}")
            c.drawString(300, y_position, f"Price: ${item['price']}")
            y_position -= 20

            # 分页控制
            if y_position < 50:
                c.showPage()
                y_position = 700

        c.save()
    except Exception as e:
        print(f"生成失败: {e}")
        # 释放资源
        if 'c' in locals():
            c.__del__()
        raise

# 最佳实践：使用样式表统一格式
# 性能优化：预计算布局减少重复计算