测试自动化实战：从零构建高可维护性 skill 测试框架

2次阅读

共计 3632 个字符，预计需要花费 10 分钟才能阅读完成。

开发对话式 AI 技能时，测试自动化常遇到几个典型问题：

NLU 模型迭代导致用例雪崩：当调整意图分类或实体抽取规则时，70% 的测试用例需要同步修改
多轮对话状态难跟踪：用户说 ” 取消刚才的操作 ” 时，测试脚本无法自动识别上下文指向哪个具体步骤
平台 API 限制：技能服务商常对开发环境调用频次做严格限制，传统轮询策略直接触发限流
结果验证复杂：” 帮我订明天北京的酒店 ” 的响应，需要同时检查语音输出、卡片展示和后台预订状态

传统 UI 自动化工具 (Selenium/Appium) 在这里完全失效——因为技能没有可视化界面，且交互依赖自然语言理解。

# 项目结构示例
skill_test/
├── api/               # API 层
│   ├── skill_client.py  # 封装平台 REST API
├── intent/            # 意图层  
│   ├── weather.py     # 天气技能特定意图
├── flow/              # 流程层
│   ├── hotel_booking.py # 酒店预订多轮流程
└── conftest.py        # Pytest 全局配置

API 层：直接调用技能平台接口，返回原始响应数据

class SkillClient:
    def send_text(self, utterance: str):
        # 处理 OAuth2 认证、限流等基础逻辑
        return requests.post(API_ENDPOINT, json={"query": utterance})

意图层：通过 Page Object 模式封装业务操作

class WeatherSkill:
    def __init__(self, client):
        self.client = client

    def ask_weather(self, city: str, date="今天"):
        # 封装意图触发逻辑
        response = self.client.send_text(f"{date}{city}天气")
        return WeatherResponse(response)  # 返回领域对象

流程层：组合多个意图完成端到端测试

def test_hotel_booking_flow():
    # 初始化各 Page Object
    calendar = CalendarSkill(client)
    hotel = HotelSkill(client)

    # 模拟多轮对话  
    calendar.select_date("明天")
    hotels = hotel.search("北京", price_range="500-1000")
    first_hotel = hotels[0].select()
    assert "预订成功" in first_hotel.confirm()

使用 Pytest 参数化避免用例爆炸：

# test_weather.py
@pytest.mark.parametrize("city,date,expected", [("北京", "今天", "晴"),
    ("上海", "明天", "多云"),
    ("广州", "后天", "雷阵雨"),
])
def test_weather_prediction(city, date, expected):
    result = WeatherSkill().ask_weather(city, date)
    assert expected in result.weather_description

技能平台通常需要 3 - 5 秒生成响应，推荐混合等待策略：

from tenacity import retry, wait_exponential, stop_after_attempt

class AsyncSkillClient(SkillClient):
    @retry(wait=wait_exponential(multiplier=1, max=10),
        stop=stop_after_attempt(3)
    )
    def wait_for_response(self, session_id: str):
        response = self.get_session_status(session_id)
        if response["status"] == "COMPLETED":
            return response
        elif response["status"] == "FAILED":
            raise SkillRuntimeError(response["error"])
        else:
            continue  # 触发重试

Allure 报告添加对话上下文：

@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
    outcome = yield
    report = outcome.get_result()

    if report.when == "call" and hasattr(item, "conversation"):
        allure.attach(json.dumps(item.conversation, indent=2, ensure_ascii=False),
            name="对话记录",
            attachment_type=allure.attachment_type.JSON
        )

# conftest.py
@pytest.fixture
def mock_hotel_db(monkeypatch):
    # 每个测试用例运行前注入独立数据库
    test_db = HotelDB(":memory:")
    test_db.load_test_data("fixtures/hotels.json")

    monkeypatch.setattr(
        "hotel.provider.get_db", 
        lambda: test_db
    )
    return test_db

# 在 SkillClient 中添加速率控制
from ratelimit import limits, sleep_and_retry

class RateLimitedClient(SkillClient):
    @sleep_and_retry
    @limits(calls=30, period=60)
    def call_skill_api(self, payload):
        return super().call_skill_api(payload)

GitLab CI 示例配置：

test_skill:
  stage: test
  image: python:3.9
  script:
    - pip install -r requirements.txt
    - pytest --alluredir=allure-results
  artifacts:
    paths:
      - allure-results
  only:
    - merge_requests

不要过度依赖录制回放：
录制的用例包含大量实现细节（如具体话术）
应该用 intent(用户说 "X") 代替send("请问北京今天天气怎样")
意图匹配设置动态阈值：
python # conftest.py def pytest_configure(config): # 根据环境调整匹配阈值 if config.getoption("--env") == "staging": IntentMatcher.threshold = 0.7 # 测试环境放宽标准 else: IntentMatcher.threshold = 0.9
多语言测试策略：
核心流程每种语言至少覆盖 1 个用例
语言特定功能（如中文数字识别）单独测试

上下文切换验证：

def test_context_switch():
    # 先触发机票查询流程
    flight.search("北京", "上海")

    # 不说完流程直接切换意图
    response = skill.send("等等，先帮我查天气")
    assert "您想查询哪个城市" in response  # 确认上下文已清除

技能状态快照：
python @pytest.fixture def skill_snapshot(skill): # 保存当前技能状态 state = skill.backup_state() yield skill # 测试结束后恢复 skill.restore_state(state)
性能基线监控：
python # pytest.ini [pytest] markers = slow: mark test as slow (deselect with '-m"not slow"') performance: 记录响应时间基线