部署基于SGLang的DeepSeek-V3.1单机推理服务

更新时间：2025-10-14

本文主要介绍使用容器引擎CCE部署中基于SGLang的DeepSeek-R1-0528单机推理服务。

背景知识

DeepSeek-V3.1

DeepSeek-V3.1 是 DeepSeek 公司推出的最新一代大型语言模型，基于 MoE（Mixture of Experts）架构设计，具有671B总参数量，其中37B参数处于激活状态。该模型在数学推理、代码生成、多语言理解等任务上表现卓越，支持32K上下文长度，具备强大的指令遵循能力和多轮对话能力。

SGLang

SGLang 是一个高性能的大型语言模型与多模态模型服务推理引擎，通过前后端协同设计，提升模型交互速度与控制能力。其后端支持 RadixAttention（前缀缓存）、零开销 CPU 调度、PD分离、Speculative decoding、连续批处理、PagedAttention、TP/DP/PP/EP并行、结构化输出、chunked prefill及多种量化技术（FP8/INT4/AWQ/GPTQ）和多LoRA批处理，显著提升推理效率。前端提供灵活编程接口，支持链式生成、高级提示、控制流、多模态输入、并行处理和外部交互，便于构建复杂应用。支持 Qwen、DeepSeek、Llama等生成模型，E5-Mistral等嵌入模型以及 Skywork 等奖励模型，易于扩展新模型。更多关于SGLang推理引擎的信息，请参见SGLang GitHub。

前提条件

已创建CCE集群且集群版本为1.31及以上，并且已经为集群添加GPU节点。

具体操作，请参见创建CCE托管集群。

本文推荐使用bcc.lsgn7ec.c176m1952.8h20-141.2d规格（请联系客户经理申请GPU规格邀测）。

模型部署

步骤一：准备DeepSeek-V3.1模型文件

在PFS中创建目录，将模型下载至PFS。

登录PFS控制台，将集群中的节点挂载到PFS挂载服务中，参考控制台操作文档和命令行操作文档。如何创建PFS文件系统，创建挂载服务并绑定存储实例，请参考创建文件系统，创建挂载服务，绑定存储实例。
在PFS中创建目录，并执行以下命令从ModelScope下载DeepSeek-V3.1模型将模型拷贝到PFS中。

Plain Text

1mkdir models-test/DeepSeek-V3.1
2pip install modelscope
3modelscope download --model deepseek-ai/DeepSeek-V3.1 --local_dir ./<pfs目录>/models-test/DeepSeek-V3.1

创建PV和PVC。为目标集群配置存储卷PV和存储声明PVC。请参考使用并行文件存储PFS L2。

通过Yaml新建PV示例：

Plain Text

1apiVersion: v1
2kind: PersistentVolume
3metadata:
4  name: <your-pv-name> #本示例中为test-pv-02
5spec:
6  accessModes:
7  - ReadOnlyMany
8  capacity:
9    storage: 500Gi
10  local:
11    path: <your-pfs-path> #本示例中为/pfs/pfs-qnL8Jh/Qwen-models
12  nodeAffinity:
13    required:
14      nodeSelectorTerms:
15      - matchExpressions:
16        - key: ready-for-pfsl2
17          operator: In
18          values:
19          - "true"
20  persistentVolumeReclaimPolicy: Retain
21  storageClassName: local-volume
22  volumeMode: Filesystem

通过Yaml新建PVC示例：

Plain Text

1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4  finalizers:
5  - kubernetes.io/pvc-protection
6  name: <your-pvc-name> #本示例中为test-pvc-02
7  namespace: default
8spec:
9  accessModes:
10  - ReadOnlyMany
11  resources:
12    requests:
13      storage: 500Gi
14  storageClassName: local-volume
15  volumeMode: Filesystem
16  volumeName: <your-pv-name> #本示例中为test-pv-02

步骤二：部署推理服务

参考以下YAML代码示例，在CCE中使用SGLang推理引擎部署单机DeepSeek-V3.1推理服务。

Plain Text

1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4  labels:
5    # for prometheus to scrape
6    baidu-cce/inference-workload: sglang-ds-v3-1
7    baidu-cce/inference_backend: sglang
8  name: sglang-ds-v3-1
9  namespace: default
10spec:
11  replicas: 1
12  selector:
13    matchLabels:
14      baidu-cce/inference-workload: sglang-ds-v3-1
15      baidu-cce/inference_backend: sglang
16  template:
17    metadata:
18      labels:
19        baidu-cce/inference-workload: sglang-ds-v3-1
20        baidu-cce/inference_backend: sglang
21    spec:
22      nodeSelector:
23        gputype: h20
24      volumes:
25      - name: model
26        persistentVolumeClaim:
27          claimName: <your-pvc-name> #本示例中为test-pvc-02
28          readOnly: true
29      - name: dshm
30        emptyDir:
31          medium: Memory
32          sizeLimit: 15Gi
33      containers:
34      - command:
35        - sh
36        - -c
37        - python -m sglang.launch_server --model-path /<your-model-path>/DeepSeek-V3.1/ --tp 8 --host 0.0.0.0 --port 8000 --enable-metrics
38        image: registry.baidubce.com/ai-native-dev/infer-manager/dev-image:0.4.ubuntu2204-py313-sglang0.5.2-router0.1.9-mooncake-0.3.6-nixl-0.6.0-cuda12.4
39        name: sglang
40        ports:
41        - containerPort: 8000
42          name: http
43        readinessProbe:
44          initialDelaySeconds: 600
45          periodSeconds: 30
46          tcpSocket:
47            port: 8000
48        resources:
49          limits:
50            nvidia.com/gpu: "8"
51            memory: "1024Gi"
52            cpu: "128"
53          requests:
54            nvidia.com/gpu: "8"
55            memory: "1024Gi"
56            cpu: "128"
57        volumeMounts:
58        - mountPath: <your-model-path> #本示例中为/models-test/DeepSeek-V3.1
59          name: model
60        - mountPath: /dev/shm
61          name: dshm

步骤三：验证推理服务

登录容器，执行以下命令，向模型推理服务发送一条示例的模型推理请求。

Plain Text

1curl http://127.0.0.1:800/v1/chat/completions-H"Content-Type: application/json"-d'("model": "/models-test/DeepSeek-V3.1/","messages":[{"role":"user","content":"测试一下，用python代码写出hello world")],"max_tokens":2000,"temperature":0.7,"top_p":0.9,"seed":10}'

预期输出：

Plain Text

1{"id":"dfe576847a034b33ba882f2de690ee4f","object":"chat.completion","created":1760414204,"model":"/pfs/pfs-qnL8Jh/models-test/DeepSeek-V3.1/","choices":[{"index":0,"message":{"role":"assistant","content":"当然可以！以下是使用 Python 输出 \"Hello, World!\" 的代码：\n\n```python\nprint(\"Hello, World!\")\n```\n\n运行这段代码后，控制台将会显示：\n```\nHello, World!\n```\n\n如果你想要更正式一些的版本（例如包含主函数），可以这样写：\n\n```python\ndef main():\n    print(\"Hello, World!\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\n两种方式都可以正确输出结果。第一个是最简单的单行实现，第二个是更结构化的写法，适用于大型程序。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":13,"total_tokens":125,"completion_tokens":112,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}#

输出结果表明模型可以根据给定的输入（在这个例子中是一条测试消息，用python代码写出hello world）生成相应的回复。

部署基于vLLM或SGLang的Qwen3-32B单机推理服务

部署基于SGLang的Qwen3-32B多机推理服务

百度智能云

容器引擎 CCE

容器引擎 CCE

部署基于SGLang的DeepSeek-V3.1单机推理服务

背景知识

前提条件

模型部署

步骤一：准备DeepSeek-V3.1模型文件

步骤二：部署推理服务

步骤三：验证推理服务