最近更新时间:2025-02-17 19:04:26
本篇文章将介绍如何在金山云容器服务(KCE)集群上部署 DeepSeek-r1 大模型(以 70B 参数模型为例)。我们将使用 vLLM 来运行 DeepSeek-r1 并暴露 API 接口,同时结合 OpenWeb UI 进行交互。后续的文章将会详细讲解如何使用 Ollama 工具部署 DeepSeek 大模型。
在本文中,您将了解如何通过金山云平台和 vLLM 简单高效地部署大语言模型,同时探索 OpenWeb UI 如何提供直观的用户交互体验。
vLLM 是一个专注于优化大语言模型推理过程的工具,特别在计算资源有限的情况下,它能够提高推理效率,确保模型可以在不同硬件条件下高效运行。与 Ollama 相比,vLLM 不仅支持大模型的部署,还特别优化了模型的推理阶段,使得推理过程更为迅速且节能。通过精细化的资源管理,vLLM 能够减少内存和计算资源的消耗,同时保持推理速度和准确度。
vLLM 还提供了与 OpenAI API 兼容的接口,开发者可以通过熟悉的 API 快速集成到现有应用中,进行模型的远程调用和管理。
Ollama 是一种简化大型语言模型(LLM)部署的工具,类似于 Docker 在容器化应用中的作用。它提供了一个简单的框架来下载和运行各种大规模模型,并允许用户通过自定义的 API 接口暴露模型服务。Ollama 极大简化了大模型部署的技术门槛,使得开发者能够专注于应用层面的开发,而不必过多关心底层的部署细节。
Ollama 支持各种大语言模型的运行,包括但不限于 DeepSeek、GPT 系列等,适用于需要快速部署和大规模推理的场景。它通过封装复杂的部署细节,帮助用户集中精力于应用开发而非底层实现。
特性 | Ollama | vLLM |
---|---|---|
主要功能 | 模型下载与API暴露 | 推理优化与性能提升 |
优势 | 简单易用,快速部署 | 高效推理,资源节省 |
适用场景 | 模型部署与管理 | 高效推理与集成到应用 |
API支持 | 自定义API暴露 | 兼容OpenAI API |
性能优化 | 无特殊推理优化 | 针对推理阶段进行优化 |
推荐您使用裸金属服务器作为K8s Worker来运行大模型及所使用工具,点击查看创建裸金属服务器流程;运行不同模型所需资源如下:
模型 | CPU(核数) | Mem(GB) | GPU(数量) |
DeepSeek-R1 Distill Qwen-1.5B | 4 | 16 | 1 |
DeepSeek-R1-Distill-Qwen-7B | 8 | 24 | 1 |
DeepSeek-R1-Distill-Qwen-14B | 16 | 48 | 1 |
DeepSeek-R1-Distill-Qwen-32B | 32 | 96 | 2 |
DeepSeek-R1-Distill-Llama-8B | 16 | 32 | 1 |
DeepSeek-R1-Distill-Llama-70B | 32 | 128 | 4 |
由于大模型体积较大,建议将大模型存放至对象存储KS3或性能更佳的高性能文件存储KPFS中,以PVC形式挂载至运行大模型工具(vLLM/Ollama)的Pod中,本文将选用KS3来存储DeepSeek R1大模型。
相关链接:
样例部署环境信息:
Kernel: 5.15.0-60-generic
驱动信息:NVIDIA-SMI 550.127.08
Driver Version: 550.127.08
CUDA Version: 12.4
GPU: A30*8(一台)
在部署DeepSeek-R1-70B之前,我们先执行如下命令,从Hugging Face下载DeepSeek-R1-Distill-Llama-70B模型。
执行如下命令,可以通过--local-dir后的路径,执行模型保存路径
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-70B --local-dir /root/DeepSeek-R1-Distill-Llama-70B
登录对象存储控制台,创建好存储空间;可通过控制台上传文件,完成模型上传
也支持选择使用KS3提供的API或KS3 Finder工具进行上传,参考:上传文件
在金山云控制台创建KPFS实例。
在服务器上挂载KPFS文件系统。
sudo curl -L http://${KPFS_URL}/onpremise/juicefs -o /usr/local/bin/juicefs && sudo chmod +x /usr/local/bin/juicefs
sudo /usr/local/bin/juicefs mount ${KPFS_NAME} /root/juicefs
将DeepSeek-R1-Distill-Llama-70B模型存储在KPFS中。
mv /root/DeepSeek-R1-Distill-Llama-70B /root/juicefs
可参考:使用KS3对象存储
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: "****"
namespace: "*****"
data:
akId: **************
akSecret: **************
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: "****"
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 100Gi
csi:
driver: com.ksc.csi.ks3plugin
volumeHandle: pv-ks3
volumeAttributes:
# Replaced by the url of your region.
url: "http://ks3-cn-beijing.ksyuncs.com"
# Replaced by the bucket name you want to use.
bucket: "****"
# Replaced by the subPath in bucket you want to use.
path: /test
# You can specify any other options used by the s3fs command in here.
additional_args: "-oensure_diskfree=2048 -osigv2"
nodePublishSecretRef:
# Replaced by the name and namespace of your secret.
name: *****
namespace: *****
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: Deepseek-r1--70B-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
# You can specify the pv name manually or just let kubernetes to bind the pv and pvc.
volumeName: "****"
# Currently ks3 only supports static provisioning, the StorageClass name should be empty.
storageClassName: ""
可参考:CSI存储卷
apiVersion: v1
metadata:
name: juicefs-secret
namespace: default
labels:
juicefs.com/validate-secret: "true"
kind: Secret
type: Opaque
stringData:
name: ${KPFS_NAME}
token: ${KPFS_TOKEN}
access-key: ${ACCESS_KEY}
secret-key: ${SECRET_KEY}
envs: '{"BASE_URL": "$KPFS_URL/static", "CFG_URL": "$KPFS_URL/volume/%s/mount"}'
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: juicefs-pv
labels:
juicefs-name: ten-pb-fs
spec:
capacity:
storage: 10Pi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: csi.juicefs.com
volumeHandle: juicefs-pv
fsType: juicefs
nodePublishSecretRef:
name: juicefs-secret
namespace: default
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: juicefs-pvc
namespace: default
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
storageClassName: ""
resources:
requests:
storage: 10Pi
selector:
matchLabels:
juicefs-name: ten-pb-fs
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-70b
namespace: default
labels:
app: deepseek-r1-70b
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-r1-70b
template:
metadata:
labels:
app: deepseek-r1-70b
spec:
# 将模型以持久卷的方式,挂载到容器中
volumes:
- name: model
persistentVolumeClaim:
claimName: Deepseek-r1--70B-pvc
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "20Gi"
containers:
- name: deepseek-r1-70b
image: hub.kce.ksyun.com/ksyun/vllm-openai:0.6.6
command: ["/bin/sh", "-c"]
# 通过VLLM把模型部署成openai API server形式
args: [
"python3 -m vllm.entrypoints.openai.api_server --model /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 8 --trust-remote-code --enforce-eager --max-model-len 22560 --port 8000 --api-key token-abc123"
]
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 60G
nvidia.com/gpu: "8"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: model
mountPropagation: HostToContainer
- name: shm
mountPath: /dev/shm
模型启动后,执行Nvidia-smi命令查看GPU 运行状态;查看deepseek-r1-70b容器日志,检查模型运行状态。
helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install openwebui open-webui/open-webui
创建Service,open-webui连接DeepSeek-R1-70B
apiVersion: "v1"
kind: "Service"
metadata:
name: "deepseek-70b"
namespace: "default"
spec:
ports:
- port: 80
protocol: "TCP"
targetPort: 8000
selector:
app: "deepseek-r1-70b"
sessionAffinity: "None"
type: "ClusterIP"
访问Service Open-webui,详见通过金山云负载均衡访问服务webui 连接DeepSeek-R1-70B。
样例部署环境信息:
Kernel: 5.15.0-60-generic
驱动信息:NVIDIA-SMI 550.127.08
Driver Version: 550.127.08
CUDA Version: 12.4
GPU: A30*8(两台)
注:模型下载及存储卷创建流程与上面一致
可按需调整ConfigMap中变量值。
apiVersion: v1
kind: ConfigMap
metadata:
name: deepseek-70b-config
data:
ACTIVE_NODE: "2" # ray 可用的节点数(数值为head + worker)
IP_OF_HEAD_NODE: "deepseek-70b-head.default.svc.cluster.local" # head 的service 访问域名
NCCL_SOCKET_IFNAME: "eth0,eth1" # 以太网接口名称
NCCL_DEBUG: "TRACE" # 调试信息级别
NCCL_IB_DISABLE: "1" # 是否禁用InfiniBand
MODEL_DIR: "/root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B" # 容器内,模型的路径
PIPELINE_PARALLEL_SIZE: "1" # 管道并行大小
TENSOR_PARALLEL_SIZE: "16" # 张量并行大小
MAX_MODEL_LEN: "22560" # 模型的最大生成长度
MODEL_PORT: "8000" # 模型API端口
MODEL_API_KEY: "token-abc123" # 模型API TOKEN
准备Ray Head Node Service, 用于服务发现
apiVersion: "v1"
kind: "Service"
metadata:
name: "deepseek-70b-head"
namespace: "default"
spec:
ports:
- port: 80
protocol: "TCP"
targetPort: 8000
name: model
- port: 6379
protocol: "TCP"
targetPort: 6379
name: ray
selector:
app: "deepseek-r1-70b-head"
sessionAffinity: "None"
type: "ClusterIP"
部署Ray Head Node Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-70b-head
namespace: default
labels:
app: deepseek-r1-70b-head
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-r1-70b-head
template:
metadata:
labels:
app: deepseek-r1-70b-head
spec:
# 将模型以持久卷的方式,挂载到容器中
volumes:
- name: model
persistentVolumeClaim:
claimName: juicefs-pvc
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "20Gi"
containers:
- name: deepseek-r1-70b-head
image: hub.kce.ksyun.com/ksyun/vllm-openai:0.6.6
envFrom:
- configMapRef:
name: deepseek-70b-config
# 通过VLLM把模型部署成openai API server形式
command:
- /bin/sh
- -c
- |
nohup ray start --head --port=6379 --dashboard-host=0.0.0.0 --object-manager-port=8625 --block 2>&1 &
while :; do
active_count=$(ray status | awk '/Active:/{flag=1; next} /Pending:/{flag=0} flag && /^[[:space:]]*[0-9]+/{count++} END {print count}');
echo "当前Active 节点的数量是: $active_count";
[ "$active_count" -eq "$ACTIVE_NODE" ] && python3 -m vllm.entrypoints.openai.api_server --model $MODEL_DIR --tensor-parallel-size $TENSOR_PARALLEL_SIZE --pipeline-parallel-size $PIPELINE_PARALLEL_SIZE --trust-remote-code --enforce-eager --max-model-len $MAX_MODEL_LEN --port $MODEL_PORT --api-key $MODEL_API_KEY && break;
sleep 1;
done
ports:
- containerPort: 8000
- containerPort: 6379
resources:
limits:
cpu: "10"
memory: 60G
nvidia.com/gpu: "8"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: model
mountPropagation: HostToContainer
- name: shm
mountPath: /dev/shm
部署Ray Worker Node Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-70b-worker
namespace: default
labels:
app: deepseek-r1-70b-worker
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-r1-70b-worker
template:
metadata:
labels:
app: deepseek-r1-70b-worker
spec:
# 将模型以持久卷的方式,挂载到容器中
volumes:
- name: model
persistentVolumeClaim:
claimName: juicefs-pvc
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "20Gi"
containers:
- name: deepseek-r1-70b-worker
image: hub.kce.ksyun.com/ksyun/vllm-openai:0.6.6
envFrom:
- configMapRef:
name: deepseek-70b-config
# 通过VLLM把模型部署成openai API server形式
command:
- /bin/sh
- -c
- |
ray start --address=${IP_OF_HEAD_NODE}:6379 --block
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 60G
nvidia.com/gpu: "8"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: model
mountPropagation: HostToContainer
- name: shm
mountPath: /dev/shm
查看模型启动情况
通过Hlem 部署open-webui
helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install openwebui open-webui/open-webui
访问Service Open-webui,详见通过金山云负载均衡访问服务webui 连接DeepSeek-R1-70B。
纯净模式