基于GPT-3/4的Transformer架构大模型应用解决方案实战：从PDF处理到自然语言理解

11次阅读

共计 2725 个字符，预计需要花费 7 分钟才能阅读完成。

PDF 文档作为最常见的非结构化数据载体，在金融、法律、医疗等领域广泛使用。但 PDF 的复杂格式（文字 / 图片混合布局、表格嵌套、多栏排版等）给自动化处理带来三大挑战：

格式解析困难：传统 OCR 工具难以保持原始文档的逻辑结构和语义关系
信息密度不均：正文、脚注、页眉等内容需要差异化处理
大模型输入限制：GPT-3.5-turbo 的上下文窗口仅 16k tokens（约 12 页纯文本），超出需分块处理

PyPDF2：
优点：安装简单，基础文本提取速度快
缺点：无法解析视觉元素（如表格），对扫描版 PDF 支持差
适用场景：纯文字文档的快速处理
pdfplumber：
优点：支持表格提取（extract_table()方法），保留文字位置信息
缺点：内存占用较高（建议配合 with 语句使用）
适用场景：含复杂排版的学术论文 / 财务报表

型号	最大 tokens	每千 token 成本	特点
gpt-3.5-turbo	16k	$0.002	性价比高，响应快
gpt-4	32k	$0.06	复杂任务准确率高

import pdfplumber
import re

def extract_text_from_pdf(pdf_path):
    """
    提取 PDF 文本并清洗
    :param pdf_path: PDF 文件路径
    :return: 清洗后的文本列表（按段落分割）"""
    clean_text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # 提取当前页文本（保留换行符）text = page.extract_text(layout=True)

            # 清洗步骤
            text = re.sub(r'\s+', ' ', text)  # 合并多余空格
            text = text.replace('\ufeff', '')  # 移除 BOM 头

            # 按段落分割（假设两个换行符为段落分隔）paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
            clean_text.extend(paragraphs)

    return clean_text

import openai
from typing import List

def query_gpt(text_chunks: List[str], system_prompt: str) -> List[str]:
    """
    分块调用 GPT API 处理文本
    :param text_chunks: 文本分块列表（每个 chunk 不超过模型限制）:param system_prompt: 定义任务角色的系统提示
    :return: 模型响应列表
    """
    responses = []
    for chunk in text_chunks:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "system", "content": system_prompt},
                {"role": "user", "content": chunk}
            ],
            temperature=0.3  # 降低随机性保证稳定性
        )
        responses.append(response.choices[0].message.content)
    return responses

# 示例：法律条款摘要生成
system_prompt = """ 你是一名资深法律顾问，需要将复杂的法律条款转换为通俗易懂的要点总结。要求：1. 保留所有权利义务条款
2. 用 bullet points 列出关键内容
3. 标注原文对应页码 """

语义分块：使用 NLTK 检测句子边界，确保 chunk 不截断完整句子

from nltk.tokenize import sent_tokenize

def semantic_chunking(text: str, max_tokens=15000) -> List[str]:
    sentences = sent_tokenize(text)
    chunks, current_chunk = [], ""
    for sent in sentences:
        if len(current_chunk.split()) + len(sent.split()) < max_tokens:
            current_chunk += sent + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sent + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

重叠窗口：相邻 chunk 保留 20% 内容重叠，避免上下文断裂

启用 stream=True 实时获取响应，避免超时重试
使用 max_tokens 参数限制输出长度
通过 logprobs 监控响应质量，自动过滤低置信度结果

现象：提取的文字出现乱码（如â€¢代替•）

解决方案：

with open(pdf_path, 'rb') as f:
    raw = pdfminer.high_level.extract_text(f, codec='utf-8')  # 显式指定编码

推荐方案：组合使用 pdfplumber 和 camelot

import camelot

tables = camelot.read_pdf(pdf_path, flavor='lattice')  # 处理线框表格
dfs = [table.df for table in tables]

数据脱敏：

使用正则表达式自动过滤身份证号、银行卡号等敏感信息

import re

def anonymize_text(text):
    text = re.sub(r'\d{18}|\d{17}[Xx]', '[ID_NUMBER]', text)  # 中国大陆身份证
    return re.sub(r'\d{16}', '[CARD_NUMBER]', text)  # 银行卡号