TTS Skill 开发入门指南：从零构建你的第一个语音合成应用

7次阅读

共计 1450 个字符，预计需要花费 4 分钟才能阅读完成。

TTS（Text-to-Speech）是将文字转换为人类可听语音的技术。它广泛应用于智能助手、无障碍服务、教育工具等领域。核心原理是通过语言模型和声学模型将文本解析为语音特征，再通过声码器生成波形文件。

Google TTS
优势：支持 220+ 语言 / 方言，神经网络音质优秀
缺点：免费配额有限（每月 100 万字符）
Amazon Polly
优势：提供神经语音和标准语音两种引擎
缺点：自定义发音规则较复杂
Azure TTS
优势：与微软生态无缝集成，支持 SSML 标记
缺点：实时流式处理需额外配置

# 安装必要库
pip install google-cloud-texttospeech boto3

def preprocess_text(text):
    """处理特殊字符和缩写"""
    import re
    text = re.sub(r'&', 'and', text)  # 替换 HTML 实体
    return text[:5000]  # 限制输入长度

from google.cloud import texttospeech

def synthesize_speech(text, output_file='output.mp3'):
    client = texttospeech.TextToSpeechClient()

    # 文本输入设置
    synthesis_input = texttospeech.SynthesisInput(text=text)

    # 语音参数配置
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
    )

    # 音频格式设置
    audio_config = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)

    # 发起合成请求
    response = client.synthesize_speech(
        input=synthesis_input,
        voice=voice,
        audio_config=audio_config
    )

    # 保存音频文件
    with open(output_file, "wb") as out:
        out.write(response.audio_content)