Tushare技能实战：从数据获取到高效分析的避坑指南

7次阅读

共计 2282 个字符，预计需要花费 6 分钟才能阅读完成。

金融数据分析的第一步是获取高质量的数据源，但在实际操作中常遇到以下问题：

API 调用限制：免费版 Tushare 的每分钟请求次数（通常为 50 次 / 分钟）在批量获取历史数据时极易触发限流
数据格式不统一：不同接口返回的字段类型（如日期格式可能为 YYYYMMDD 或 timestamp）需要额外处理
网络波动影响：长时间运行的采集任务可能因网络问题中断，缺乏重试机制会导致数据缺失
清洗成本高：原始数据包含大量金融专业字段（如复权因子、停牌标志），需要专业知识进行标准化

参数组合技巧
使用 fields 参数限定返回字段，减少网络传输量
对时间序列数据优先采用 start_date/end_date 过滤，避免全量拉取
分页参数 limit 和offset配合使用，单次请求不超过 5000 条记录

异步请求实现

import aiohttp
import asyncio

async def fetch_data(session, api_name, params):
    async with session.get(f'http://api.tushare.pro', params=params) as resp:
        return await resp.json()

async def batch_fetch(api_params_list):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_data(session, **params) for params in api_params_list]
        return await asyncio.gather(*tasks, return_exceptions=True)

采用 分区存储：按市场（沪深 / 港股）、数据类型（日线 / 分钟线）、时间范围进行物理隔离
使用Parquet 格式：相比 CSV 节省 50% 存储空间，支持列式查询加速

import tushare as ts
from sqlalchemy import create_engine
import pandas as pd

# 初始化配置
token = 'your_token'  # 替换为实际 token
pro = ts.pro_api(token)
engine = create_engine('postgresql://user:pass@localhost:5432/finance')

# 数据获取函数
def get_daily(start='20200101', end='20201231'):
    df = pro.daily(
        ts_code='', 
        trade_date='',
        start_date=start,
        end_date=end,
        fields='ts_code,trade_date,open,high,low,close,vol'
    )
    # 日期格式标准化
    df['trade_date'] = pd.to_datetime(df['trade_date'], format='%Y%m%d')
    return df

# 数据清洗函数
def clean_data(raw_df):
    # 处理异常值
    cleaned = raw_df[raw_df['vol'] > 0].copy()
    # 计算涨跌幅
    cleaned['pct_chg'] = cleaned.groupby('ts_code')['close'].pct_change()
    return cleaned

# 主流程
if __name__ == '__main__':
    raw_data = get_daily()
    clean_df = clean_data(raw_data)
    clean_df.to_sql('daily_price', engine, if_exists='append', index=False)

策略	首次请求耗时	重复请求耗时	内存占用
无缓存	1200ms	1100ms	低
内存缓存	1200ms	5ms	高
本地 SQLite	1200ms	20ms	中

推荐使用 diskcache 库实现磁盘缓存：

from diskcache import Cache

cache = Cache('./tushare_cache')

@cache.memoize(expire=86400)
def cached_query(api_name, **kwargs):
    return pro.query(api_name, **kwargs)