简介:本文提供DeepSeek本地部署的完整指南,涵盖环境配置、模型安装、可视化构建全流程,特别针对D盘安装优化空间管理,附避坑清单与代码示例。
nvidia-smi确认GPU驱动正常,CUDA版本需≥11.8(运行nvcc --version检查)
conda create -n deepseek python=3.10conda activate deepseek
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate gradio
D:/deepseek/├── models/ # 模型权重存储├── logs/ # 运行日志└── outputs/ # 可视化结果
git lfs installgit clone https://huggingface.co/deepseek-ai/deepseek-r1-7b D:/deepseek/models/deepseek-r1-7b
基础推理脚本(infer.py):
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchmodel_path = "D:/deepseek/models/deepseek-r1-7b"tokenizer = AutoTokenizer.from_pretrained(model_path)model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.float16)prompt = "解释量子计算的基本原理:"inputs = tokenizer(prompt, return_tensors="pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens=200)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
low_cpu_mem_usage=True减少内存占用Gradio快速实现:
import gradio as grdef predict(prompt):inputs = tokenizer(prompt, return_tensors="pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens=200)return tokenizer.decode(outputs[0], skip_special_tokens=True)demo = gr.Interface(fn=predict, inputs="text", outputs="text")demo.launch(server_name="0.0.0.0", server_port=7860)
streamlit)D:/ds/)r"D:\deepseek\models"CUDA out of memory错误torch.backends.cuda.cufft_plan_cache.clear()清理缓存max_new_tokens参数(建议初始值设为128)bitsandbytes进行8位量化:
from transformers import BitsAndBytesConfigquant_config = BitsAndBytesConfig(load_in_4bit=True)model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quant_config)
pip check # 检查依赖冲突conda list # 查看环境包版本
pip install --force-reinstall transformers==4.35.0
torch.compile加速
model = torch.compile(model)
generate的do_sample=False关闭采样提升吞吐量模型切换脚本:
import osmodel_variants = {"7B": "D:/deepseek/models/deepseek-r1-7b","13B": "D:/deepseek/models/deepseek-r1-13b"}def load_model(variant):return AutoModelForCausalLM.from_pretrained(model_variants[variant], device_map="auto")
import logginglogging.basicConfig(filename="D:/deepseek/logs/inference.log",level=logging.INFO,format="%(asctime)s - %(levelname)s - %(message)s")
run.bat)
@echo offconda activate deepseekset PYTHONPATH=D:/deepseekpython infer.py --model_path D:/deepseek/models/deepseek-r1-7b --port 7860pause
FROM nvidia/cuda:11.8.0-base-ubuntu22.04WORKDIR /appRUN apt update && apt install -y python3-pipCOPY requirements.txt .RUN pip install -r requirements.txtCOPY . .CMD ["python", "infer.py"]
docker build -t deepseek-local .docker run --gpus all -v D:/deepseek:/app/data -p 7860:7860 deepseek-local
| 测试场景 | 输入示例 | 预期输出 |
|---|---|---|
| 基础问答 | “2+2等于几?” | “2+2等于4” |
| 代码生成 | “用Python写冒泡排序” | 完整的Python冒泡排序代码 |
| 长文本生成 | “写一篇500字的科技评论” | 结构完整的500字文章 |
import timestart = time.time()_ = model.generate(**inputs, max_new_tokens=128)print(f"推理耗时:{(time.time()-start)*1000:.2f}ms")
OSError: Can't load configconfig.json文件是否存在pytorch_model.bin
lsof -i :7860kill -9 <PID>
netstat -ano | findstr 7860taskkill /PID <PID> /F
import osmodel_path = os.path.join("D:", "deepseek", "models", "deepseek-r1-7b")
from datasets import load_datasetdataset = load_dataset("json", data_files="train.json")
from peft import LoraConfig, get_peft_modellora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.1,target_modules=["q_proj", "v_proj"])model = get_peft_model(model, lora_config)
FastAPI实现:
from fastapi import FastAPIapp = FastAPI()@app.post("/predict")async def predict(prompt: str):inputs = tokenizer(prompt, return_tensors="pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens=200)return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
deepseek-llmgpustat -i 1htop(Linux)/ Task Manager(Windows)pip list --outdated)通过本指南,开发者可在4GB显存的GPU上成功运行7B参数模型,实现每秒2-3个token的推理速度。实际部署中,建议从量化版本入手(4位量化可节省75%显存),再根据需求逐步升级到完整精度模型。