OpenClaw开发必备Skill：从零构建高效抓取系统的实战指南

2次阅读

共计 2657 个字符，预计需要花费 7 分钟才能阅读完成。

去年我们团队接手了一个电商价格监控项目，初期使用简单 Requests+BeautifulSoup 方案，结果三天内遭遇：

目标网站封禁整个 C 段 IP，导致公司内网无法访问
关键商品价格漏抓率高达 37%（动态加载未处理）
单机日均仅能采集 2 万条数据（竞品能做到 20 万 +）

这个教训让我们意识到： 没有工程化的爬虫系统就是定时炸弹 。下面分享重构过程中总结的实战经验。

测试环境：阿里云 2 核 4G 服务器，目标站点允许每秒 5 请求

# 同步方案（requests）import requests

def sync_fetch(urls):
    for url in urls:
        resp = requests.get(url)
        # 处理响应...

# 异步方案（aiohttp）import aiohttp

async def async_fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()

性能对比（1000 次请求）：

方案	耗时 (s)	CPU 占用	网络 IO 等待
requests	218	15%	92%
aiohttp	47	68%	31%

import asyncio
from aiohttp import TCPConnector

class AsyncFetcher:
    def __init__(self, max_conn=100):
        self.semaphore = asyncio.Semaphore(max_conn)

    async def _request(self, session, url):
        async with self.semaphore:
            try:
                async with session.get(url, timeout=10) as resp:
                    if resp.status == 200:
                        return await resp.text()
            except Exception as e:
                print(f"Request failed: {url} - {str(e)}")
                return None

    async def batch_fetch(self, urls):
        connector = TCPConnector(limit=0)  # 禁用连接数限制
        async with aiohttp.ClientSession(connector=connector) as session:
            tasks = [self._request(session, url) for url in urls]
            return await asyncio.gather(*tasks, return_exceptions=True)

关键优化点：

使用 Semaphore 控制并发度，避免瞬间爆发请求
TCPConnector 调整连接池参数
统一的超时和异常处理

from faker import Faker
import random

class HeaderFactory:
    def __init__(self):
        self.faker = Faker()
        with open('user_agents.txt') as f:
            self.user_agents = [line.strip() for line in f]

    def generate(self):
        return {'User-Agent': random.choice(self.user_agents),
            'Referer': self.faker.url(),
            'Accept-Language': 'en-US,en;q=0.9',
            'X-Forwarded-For': self.faker.ipv4()}

配套策略：

维护 2000+ 条真实 User-Agent
每 10 次请求更换 IP（代理池实现）
随机请求间隔（0.5s~3s）

from lxml import etree
import re

def parse_product(html):
    tree = etree.HTML(html)

    # XPath 提取主体内容
    name = tree.xpath('//h1[@class="title"]/text()')[0].strip()

    # 正则处理特殊格式
    price_text = tree.xpath('//span[@class="price"]/text()')[0]
    price = re.search(r'\d+\.\d{2}', price_text).group()

    # 处理 JS 动态数据
    script = tree.xpath('//script[contains(.,"window.__DATA__")]/text()')[0]
    stock = re.search(r'"stock":(\d+)', script).group(1)

    return {
        'name': name,
        'price': float(price),
        'stock': int(stock)
    }

立即切换代理 IP 池
降低请求频率至原 1 /10
检查是否触犯 robots.txt 规则
模拟人工操作（鼠标移动、页面停留）

推荐方案优先级：

打码平台（如超级鹰）
自建 CNN 模型（适合固定样式）
人工打码队列（备用）

import pickle

class StateManager:
    def __init__(self, state_file='crawl_state.pkl'):
        self.state_file = state_file

    def save(self, urls_done, urls_todo):
        with open(self.state_file, 'wb') as f:
            pickle.dump({
                'done': urls_done,
                'todo': urls_todo
            }, f)

    def load(self):
        try:
            with open(self.state_file, 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            return {'done': [], 'todo': []}