语音识别(Whisper)

更新时间：2026-06-15

简介

语音识别模块 - 基于Whisper模型的多语言语音转文字解决方案

功能描述

多语言识别：支持中英文等主流语言
语音翻译：可将识别结果翻译为英文
30秒内长度的音频片段可支持 batch_size > 1，提高处理效率；否则建议 batch_size 设置为1
英文场景识别准确率最高
openai/whisper-large-v3-turbo
openai/whisper-large-v3
openai/whisper-medium（中文支持一般）
openai/whisper-small（中文输出繁体字）

算子参数

输入

输入	含义
audios	包含音频数据的数组，支持以下格式： - audio_base64: base64编码的音频字符串； - audio_url: 音频文件URL路径； - audio_binary: 原始音频字节数据
languages	-

输出

输出	含义
asr_result	语音识别文本结果；
timestamps	时间戳对列表(开始/结束时间)；
segments	分段文本结果列表

参数

参数名称	类型	默认值	描述
audio_src_type	str	必填	音频格式类型支持的音频格式类型，包含： - bos/http 地址(audio_url) - base64 编码(audio_base64) - 二进制流(audio_binary) 可选值：["audio_binary", "audio_url", "audio_base64"]
model_path	str	'/opt/aihc/models'	模型存储路径默认值："/opt/aihc/models"
model_name	str	'openai/whisper-large-v3'	模型名称支持的Whisper系列模型： - whisper-small: 小模型 - whisper-medium: 中等模型 - whisper-large-v3: 最新大模型 - whisper-large-v3-turbo: 优化版大模型可选值：[ "openai/whisper-small", "openai/whisper-medium", "openai/whisper-large-v3-turbo", "openai/whisper-large-v3" ] 默认值："openai/whisper-large-v3"
batch_size	int	10	单次处理的音频样本数量默认值：10
source_language	str	None	音频源语言支持：chinese/english/japanese/korean等，可以设置为None以启用自动检测默认值：None
translate_to_english	bool	False	英文翻译模式是否将识别结果翻译为英文启用后输出文本将为英文翻译结果默认值：False
condition_on_prev_tokens	bool	True	历史依赖模式是否基于历史token进行预测关闭后会降低结果连贯性但提升处理速度默认值：True
compression_ratio_threshold	float	1.35	文本压缩阈值控制生成文本的压缩程度（建议范围1.2-2.0）值越大保留的重复内容越多默认值：1.35
temperature	float	0.5	温度系数控制生成文本的随机性（0.0-1.0）较高值适合创造性场景，较低值适合确定性场景默认值：0.5
logprob_threshold	float	-1.0	对数概率阈值对数概率阈值，过滤置信度过低的词。若词的对数概率低于此值，可能被拒绝。默认为-1.0，不启用过滤，保留所有词。
dtype	str	'bfloat16'	计算精度类型模型推理使用的数值精度： - bfloat16: 平衡精度与速度（默认） - float16: 更快的推理速度 - float32: 最高精度可选值：["bfloat16", "float16", "float32"] 默认值："bfloat16"
rank	int	0	GPU设备编号指定使用的GPU设备ID（多卡环境生效）默认使用首张显卡（ID=0）

调用示例

                Python
                
            

                from __future__ import annotations

import os

import daft
from daft import col

from daft.aihc.common.udf import aihc_udf
from daft.aihc.functions.audio.audio_asr_whisper import AudioAsrWhisper

if __name__ == "__main__":
    if os.getenv("DAFT_RUNNER", "native") == "ray":
        import ray
        ray.init(dashboard_host="0.0.0.0", ignore_reinit_error=True)
        daft.set_runner_ray()
    daft.set_execution_config(actor_udf_ready_timeout=6000, min_cpu_per_task=0)

    # TODO: 根据实际场景准备样本数据
    samples = {"audios": [...], "languages": [...]}
    ds = daft.from_pydict(samples)
    constructor_kwargs = {
        "model_path": '/opt/aihc/models',
        "model_name": 'openai/whisper-large-v3',
        "batch_size": 10,
        "source_language": None,
    }
    ds = ds.with_column(
        "result",
        aihc_udf(
            AudioAsrWhisper,
            construct_args=constructor_kwargs,
            num_cpus=1,
            concurrency=4,
            batch_size=8,
        )(col("audios"), col("languages")),
    )
    ds.show()
            

评价此篇文章

有帮助没帮助

说话人分离

音频高斯噪声增广

百度智能云

百度百舸 · AI计算平台

百度百舸 · AI计算平台

语音识别(Whisper)

简介

功能描述

算子参数

输入

输出

参数

调用示例