简介:从环境配置到模型部署,完整指南助你快速构建本地化语音识别系统
Whisper是OpenAI推出的开源语音识别模型,支持100+种语言及方言,具备高精度、低延迟的特点。本地部署的核心优势在于:
# 以Ubuntu 22.04为例sudo apt update && sudo apt install -y \python3.10 python3-pip ffmpeg git \nvidia-cuda-toolkit nvidia-driver-535# 创建虚拟环境(推荐)python3 -m venv whisper_envsource whisper_env/bin/activatepip install --upgrade pip
# 核心依赖pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118pip install openai-whisper transformers# 可选增强工具pip install soundfile librosa # 音频处理pip install gradio # 快速构建交互界面
Whisper提供5种规模的预训练模型:
| 模型尺寸 | 参数数量 | 适用场景 | 硬件要求 |
|—————|—————|—————|—————|
| tiny | 39M | 实时转录 | CPU |
| base | 74M | 通用场景 | CPU/GPU |
| small | 244M | 高精度 | GPU |
| medium | 769M | 专业场景 | 高性能GPU |
| large | 1550M | 极致精度 | 旗舰GPU |
下载命令示例:
# 下载base模型(平衡精度与速度)git clone https://github.com/openai/whisper.gitcd whisperpip install .wget https://openaipublic.blob.core.windows.net/main/models/base.en.pt # 英文专用wget https://openaipublic.blob.core.windows.net/main/models/base.pt # 多语言通用
import whisper# 加载模型(首次运行会自动下载)model = whisper.load_model("base")# 音频转录result = model.transcribe("audio.mp3", language="zh", task="transcribe")# 输出结果print(result["text"])
result = model.transcribe("audio.wav",language="zh",task="translate", # 翻译为英文temperature=0.3, # 降低随机性no_speech_thresh=0.6, # 静音检测阈值condition_on_previous_text=True # 上下文关联)
import osimport whisperdef batch_transcribe(input_dir, output_file):model = whisper.load_model("small")results = []for filename in os.listdir(input_dir):if filename.endswith((".mp3", ".wav")):path = os.path.join(input_dir, filename)res = model.transcribe(path, language="zh")results.append(f"{filename}:\n{res['text']}\n")with open(output_file, "w", encoding="utf-8") as f:f.write("\n".join(results))batch_transcribe("audio_files", "transcriptions.txt")
nvidia-smi监控GPU利用率fp16=True启用半精度
chunk_size = 30 # 每30秒处理一次audio = whisper.load_audio("long_audio.mp3")for i in range(0, len(audio), chunk_size*16000):chunk = audio[i:i+chunk_size*16000]# 处理chunk...
pip install optimum# 转换为FP16精度from optimum.onnxruntime import ORTQuantizerquantizer = ORTQuantizer.from_pretrained("openai/whisper-base")quantizer.quantize(save_dir="whisper-base-quantized")
nvcc --version与torch.version.cuda是否一致--user参数或使用sudo
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
batch_size或换用更小模型
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
def detect_language(audio_path):
model = whisper.load_model(“tiny”)
res = model.transcribe(audio_path, task=”detect_language”)
return res[“language”]
### 七、扩展应用场景#### 1. 实时语音识别系统```pythonimport sounddevice as sdimport numpy as npdef callback(indata, frames, time, status):if status:print(status)audio = (indata[:, 0] * 32768).astype(np.int16)# 实时处理audio数据...with sd.InputStream(samplerate=16000, channels=1, callback=callback):print("实时录音中...按Ctrl+C退出")while True:pass
from gtts import gTTSimport osdef asr_to_tts(audio_path):model = whisper.load_model("medium")text = model.transcribe(audio_path)["text"]tts = gTTS(text=text, lang='zh')tts.save("output.mp3")os.system("mpg321 output.mp3") # 播放结果
pip install -r requirements.txt
```
通过以上完整流程,开发者可在本地构建高效、可靠的语音识别系统。实际部署时建议先在测试环境验证性能,再逐步迁移到生产环境。对于企业级应用,可考虑使用Docker容器化部署以实现环境隔离。