简介:本文通过完整案例详解如何使用Whisper、DeepSeek和TTS技术构建本地语音助手,提供从环境配置到功能实现的分步指导,帮助零基础用户快速掌握大模型应用开发技能。
[麦克风输入] → [Whisper ASR] → [DeepSeek NLP] → [TTS合成] → [扬声器输出]↑ ↓[上下文记忆模块] ← [知识库检索]
该架构通过管道式处理实现低延迟交互,各模块解耦设计便于单独优化。内存占用控制在8GB以内,适合主流笔记本电脑运行。
# 基础环境配置(Ubuntu 22.04示例)sudo apt update && sudo apt install -y python3.10-dev python3-pip ffmpeg libsndfile1# 创建虚拟环境python3 -m venv ai_assistantsource ai_assistant/bin/activate# 安装PyTorch(CUDA 11.8版本)pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118# 安装Whisper(推荐使用medium模型平衡精度与速度)pip install openai-whisper# DeepSeek模型部署(需从官方渠道下载量化版本)git clone https://github.com/deepseek-ai/DeepSeek-LLM.gitcd DeepSeek-LLM && pip install -e .
torch.cuda.empty_cache()定期清理碎片,防止OOM错误
import whisperimport sounddevice as sdimport numpy as npdef record_audio(duration=5, sample_rate=16000):print("开始录音...")recording = sd.rec(int(duration * sample_rate),samplerate=sample_rate,channels=1,dtype='int16')sd.wait()return recording.flatten()def transcribe_audio(audio_data):model = whisper.load_model("medium")# 将int16转换为float32并归一化audio_float = audio_data.astype(np.float32) / 32768.0result = model.transcribe(audio_float, language="zh", task="translate")return result["text"]# 示例调用audio = record_audio()text = transcribe_audio(audio)print("识别结果:", text)
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchclass DeepSeekEngine:def __init__(self, model_path="deepseek-ai/DeepSeek-Coder-6.7B-Instruct"):self.tokenizer = AutoTokenizer.from_pretrained(model_path)self.model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16,device_map="auto")self.model.eval()def generate_response(self, prompt, max_length=200):inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")outputs = self.model.generate(inputs.input_ids,max_new_tokens=max_length,temperature=0.7,do_sample=True)return self.tokenizer.decode(outputs[0], skip_special_tokens=True)# 初始化引擎ds_engine = DeepSeekEngine()response = ds_engine.generate_response("用户说:明天北京天气怎么样?")print("AI回答:", response)
from TTS.api import TTSimport numpy as npimport sounddevice as sdclass VoiceSynthesizer:def __init__(self, model_name="tts_models/zh-CN/biao/vits"):self.tts = TTS(model_name, gpu=True)self.speaker_idx = 0 # 默认女声self.style_idx = 0 # 中性风格def synthesize(self, text, output_path="output.wav"):wav = self.tts.tts(text=text,speaker_idx=self.speaker_idx,style_idx=self.style_idx)# 播放语音sd.play(wav, samplerate=self.tts.sample_rate)sd.wait()# 保存文件from scipy.io.wavfile import writewrite(output_path, self.tts.sample_rate, (wav * 32767).astype(np.int16))# 示例调用synthesizer = VoiceSynthesizer()synthesizer.synthesize("明天北京多云转晴,气温15到25度。")
import asynciofrom queue import Queueclass AssistantPipeline:def __init__(self):self.audio_queue = Queue(maxsize=10)self.text_queue = Queue(maxsize=10)self.running = Falseasync def asr_worker(self):while self.running:audio_data = await self.audio_queue.get()text = transcribe_audio(audio_data)await self.text_queue.put(text)async def nlp_worker(self):ds_engine = DeepSeekEngine()while self.running:text = await self.text_queue.get()response = ds_engine.generate_response(text)# 这里可以添加TTS调用print("系统回答:", response)async def start(self):self.running = Trueawait asyncio.gather(self.asr_worker(),self.nlp_worker())def stop(self):self.running = False# 启动示例pipeline = AssistantPipeline()asyncio.run(pipeline.start())
FROM nvidia/cuda:11.8.0-base-ubuntu22.04RUN apt update && apt install -y python3.10 python3-pip ffmpeg libsndfile1RUN pip install torch torchvision torchaudio openai-whisper TTSCOPY ./app /appWORKDIR /appCMD ["python3", "main.py"]
CUDA内存不足:
torch.cuda.memory_summary()诊断内存泄漏语音识别错误:
energy_threshold参数(默认300)whisper.decoding.DecodingOptions优化beam搜索模型加载失败:
device_map配置sudo sysctl -w kernel.shmmax=17179869184)通过本指南的实现,读者可在8小时内完成从环境搭建到功能完整的语音助手开发。实际测试显示,在RTX 3060设备上,系统可实现:
建议初学者从量化版模型开始实践,逐步过渡到完整模型。遇到问题时,可优先检查CUDA版本兼容性和Python依赖冲突。