简介:本文详细介绍在Windows 10系统环境下本地部署FunASR语音转文字模型的全流程,涵盖环境配置、依赖安装、模型下载、推理代码实现等关键步骤,提供从零开始的完整解决方案。
Windows 10需满足64位版本要求,建议使用专业版或企业版。通过”设置-系统-关于”确认系统版本,内存建议不低于16GB,NVIDIA显卡需安装CUDA驱动(可选但推荐)。
conda create -n funasr_env python=3.9conda activate funasr_env
使用conda安装基础依赖:
conda install -c conda-forge cudatoolkit=11.3 cudnn=8.2.0pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
git clone https://github.com/alibaba-damo-academy/FunASR.gitcd FunASR
paraformer-zh-20230410中文模型,下载后解压至models目录编辑funasr/conf/model.yaml,重点修改以下参数:
model:type: paraformercheckpoint_path: ./models/paraformer-zh-20230410/exp/model.ptdevice: cuda:0 # 或cpudecoder:beam_size: 5max_len: 200
pip install -r requirements.txtpip install . # 从源码安装
对于某些模型格式,需执行转换脚本:
from funasr.utils.model_converter import convert_checkpointconvert_checkpoint(input_path="original_model.bin",output_path="converted_model.pt",model_type="paraformer")
python funasr/bin/asr_cli.py \--model_path ./models/paraformer-zh-20230410 \--audio_path test.wav \--output_path result.txt
from funasr import AutoModelForSpeech2Textmodel = AutoModelForSpeech2Text.from_pretrained("./models/paraformer-zh-20230410",device="cuda" # 或"cpu")output = model.transcribe("test.wav")print(output["text"])
device="cuda"torch.backends.cudnn.benchmark=True使用动态量化减少内存占用:
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
from funasr.utils.audio_processor import AudioProcessorprocessor = AudioProcessor(sample_rate=16000, chunk_size=3200)model = AutoModelForSpeech2Text(...)for chunk in processor.stream_read("input.wav"):output = model.transcribe_chunk(chunk)print(output["partial_text"], end="", flush=True)
ModuleNotFoundError: No module named 'xxx'
pip install --ignore-installed package_name或使用conda虚拟环境隔离
CUDA out of memorybatch_size参数torch.cuda.empty_cache()清理缓存
from funasr.utils.lm import KenLMLanguageModellm = KenLMLanguageModel("zh_giga.no_cna_cmn.prune01244.klm")output = model.transcribe("test.wav", lm=lm)
创建hotwords.txt文件,每行一个热词,加载时指定:
model.set_hotwords(["热词1", "热词2"], weights=[2.0, 1.5])
结合分离模型实现多人对话识别:
from funasr.models.speech_separation import SpeechSeparationsep_model = SpeechSeparation.from_pretrained("separation_model")separated = sep_model.separate("mixed.wav")for i, audio in enumerate(separated):asr_result = model.transcribe(audio, speaker_id=i)
使用FastAPI创建REST接口:
from fastapi import FastAPIfrom pydantic import BaseModelapp = FastAPI()class AudioRequest(BaseModel):audio_data: bytes@app.post("/asr")async def transcribe(request: AudioRequest):with open("temp.wav", "wb") as f:f.write(request.audio_data)text = model.transcribe("temp.wav")["text"]return {"text": text}
关注GitHub仓库的Release页面,使用增量更新脚本:
python update_model.py --model paraformer-zh --target_dir ./models
使用conda导出环境:
conda env export > environment.yml
编写基准测试脚本:
import timestart = time.time()model.transcribe("test.wav")print(f"Inference time: {time.time()-start:.2f}s")
本指南完整覆盖了从环境搭建到高级应用的全部流程,通过模块化设计和丰富的配置选项,可满足从个人开发者到企业用户的多样化需求。实际部署时建议先在测试环境验证,再逐步迁移到生产环境。