CCE 支持 GPUSharing 集群
所有文档

          容器引擎 CCE

          CCE 支持 GPUSharing 集群

          K8S GPUSharing 介绍

          K8S 基于 nvidia-device-plugin 的 GPU 调度通常使用"GPU 卡"作为最小粒度,每个 Pod 至少绑定一张卡,这种方式提供了很好的隔离性,但以下场景存在不足:

          1. AI 开发和推理场景 GPU 利用率较低,通过多个 Pod 挂载一张卡,可以提高 GPU 利用率,;
          2. K8S 集群存在多种不同类型 GPU 卡的混布,不同 GPU 卡的算力差别较大,调度时需要考虑卡的类型。

          基于以上原因,CCE 将内部 KongMing GPUSharing 方案对外开放,提供 GPUSharing 功能,既支持多 Pod 共享 GPU 卡,也支持按卡类型进行调度。

          在 CCE 使用 GPUSharing

          新建集群

          CCE 支持直接新建 GPUSharing 集群,先按照正常创建集群流程选完参数,在提交前切换为"自定义集群配置"模式

          image.png

          修改 clusterType 为 gpuShare,直接发起创建:

          image.png

          ps:后续会直接支持 GPUSharing 类型集群,更加方便。

          已有集群

          已有集群可按照下面文档描述,自行修改组件配置,建议修改配置前先备份。下面操作都在 Master 机器上进行,仅支持自定义类型集群。

          部署 extender-scheduler

          修改 /etc/kubernetes/scheduler-policy.json 配置

          备份已有配置:

          $cp /etc/kubernetes/scheduler-policy.json /etc/kubernetes/scheduler-policy.json.bak

          修改 scheduler-policy.json,下面配置支持 v100、k40、p40、p4 等常见 GPU 卡类型,可根据实际情况进行调整:

          {
            "kind": "Policy",
            "apiVersion": "v1",
            "predicates": [{"name":"PodFitsHostPorts"},{"name":"PodFitsResources"},{"name":"NoDiskConflict"},{"name":"CheckVolumeBinding"},{"name":"NoVolumeZoneConflict"},{"name":"MatchNodeSelector"},{"name":"HostName"}],
            "priorities": [{"name":"ServiceSpreadingPriority","weight":1},{"name":"EqualPriority","weight":1},{"name":"LeastRequestedPriority","weight":1},{"name":"BalancedResourceAllocation","weight":1}],
            "extenders":[
              {
                "urlPrefix":"http://127.0.0.1:39999/gpushare-scheduler",
                "filterVerb":"filter",
                "bindVerb":"bind",
                "enableHttps":false,
                "nodeCacheCapable":true,
                "ignorable":false,
                "managedResources":[
                  {
                    "name":"baidu.com/v100_cgpu_memory",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/v100_cgpu_core",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/k40_cgpu_memory",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/k40_cgpu_core",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/p40_cgpu_memory",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/p40_cgpu_core",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/p4_cgpu_memory",
                    "ignoredByScheduler":false
                  },
                  {
                    "name":"baidu.com/p4_cgpu_core",
                    "ignoredByScheduler":false
                  }
                ]
              }
            	],
            "hardPodAffinitySymmetricWeight": 10
          }

          修改 /etc/systemd/system/kube-extender-scheduler.service 配置

          [Unit]
          Description=Kubernetes Extender Scheduler
          After=network.target
          After=kube-apiserver.service
          After=kube-scheduler.service
          
          [Service]
          Environment=KUBECONFIG=/etc/kubernetes/admin.conf
          
          ExecStart=/opt/kube/bin/kube-extender-scheduler \
          --logtostderr \
          --policy-config-file=/etc/kubernetes/scheduler-policy.json \
          --mps=false  \
          --core=100  \
          --health-check=true \
          --memory-unit=GiB \
          --mem-quota-env-name=GPU_MEMORY \
          --compute-quota-env-name=GPU_COMPUTATION \
          --v=6
          Restart=always
          Type=simple
          LimitNOFILE=65536
          
          [Install]
          WantedBy=multi-user.target

          部署 extender-scheduler

          不同地域二进制地址:

          下载二进制:

          $wget -q -O /opt/kube/bin/kube-extender-scheduler http://baidu-container.bj.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler

          启动服务 extender-scheduler:

          $chmod +x /opt/kube/bin/kube-extender-scheduler
          
          $systemctl daemon-reload
          
          $systemctl enable kube-extender-scheduler.service
          
          $systemctl restart kube-extender-scheduler.service

          重启 scheduler

          $systemctl restart kube-scheduler.service

          一般 Master 为 3 副本,依次完成上述操作。

          部署 device-plugin

          备份 nvidia-device-plugin,删除 ,可以和 nvidia-device-plugin:

          $ kubectl get ds nvidia-device-plugin-daemonset -n kube-system -o yaml > nvidia-device-plugin.yaml
          
          $ kubectl delete ds nvidia-device-plugin-daemonset -n kube-system

          部署 kongming-device-plugin,all-in-one YAML 如下:

          # RBAC authn and authz
          apiVersion: v1
          kind: ServiceAccount
          metadata:
            name: cce-gpushare-device-plugin
            namespace: kube-system
            labels:
              k8s-app: cce-gpushare-device-plugin
              kubernetes.io/cluster-service: "true"
              addonmanager.kubernetes.io/mode: Reconcile
          
          ---
          kind: ClusterRole
          apiVersion: rbac.authorization.k8s.io/v1
          metadata:
            name: cce-gpushare-device-plugin
            labels:
              k8s-app: cce-gpushare-device-plugin
              kubernetes.io/cluster-service: "true"
              addonmanager.kubernetes.io/mode: Reconcile
          rules:
            - apiGroups:
                - ""
              resources:
                - nodes
              verbs:
                - get
                - list
                - watch
            - apiGroups:
                - ""
              resources:
                - events
              verbs:
                - create
                - patch
            - apiGroups:
                - ""
              resources:
                - pods
              verbs:
                - update
                - patch
                - get
                - list
                - watch
            - apiGroups:
                - ""
              resources:
                - nodes/status
              verbs:
                - patch
                - update
          
          ---
          kind: ClusterRoleBinding
          apiVersion: rbac.authorization.k8s.io/v1
          metadata:
            namespace: kube-system
            name: cce-gpushare-device-plugin
            labels:
              k8s-app: cce-gpushare-device-plugin
              kubernetes.io/cluster-service: "true"
              addonmanager.kubernetes.io/mode: Reconcile
          subjects:
            - kind: ServiceAccount
              name: cce-gpushare-device-plugin
              namespace: kube-system
              apiGroup: ""
          roleRef:
            kind: ClusterRole
            name: cce-gpushare-device-plugin
            apiGroup: ""
          
          ---
          apiVersion: apps/v1
          kind: DaemonSet
          metadata:
            namespace: kube-system
            name: cce-gpushare-device-plugin
            labels:
              app: cce-gpushare-device-plugin
          spec:
            updateStrategy:
              type: RollingUpdate
            selector:
              matchLabels:
                app: cce-gpushare-device-plugin
            template:
              metadata:
                labels:
                  app: cce-gpushare-device-plugin
              spec:
                serviceAccountName: cce-gpushare-device-plugin
                nodeSelector:
                  beta.kubernetes.io/instance-type: GPU
                containers:
                  - name: cce-gpushare-device-plugin
                    image: hub.baidubce.com/jpaas-public/cce-nvidia-share-device-plugin:v0
                    imagePullPolicy: Always
                    args:
                      - --logtostderr
                      - --mps=false
                      - --core=100
                      - --health-check=true
                      - --memory-unit=GiB
                      - --mem-quota-env-name=GPU_MEMORY
                      - --compute-quota-env-name=GPU_COMPUTATION
                      - --gpu-type=baidu.com/gpu_k40_4,baidu.com/gpu_k40_16,baidu.com/gpu_p40_8,baidu.com/gpu_v100_8,baidu.com/gpu_p4_4
                      - --v=1
                    resources:
                      limits:
                        memory: "300Mi"
                        cpu: "1"
                      requests:
                        memory: "300Mi"
                        cpu: "1"
                    env:
                      - name: NODE_NAME
                        valueFrom:
                          fieldRef:
                            fieldPath: spec.nodeName
                    securityContext:
                      allowPrivilegeEscalation: false
                      capabilities:
                        drop: ["ALL"]
                    volumeMounts:
                      - name: device-plugin
                        mountPath: /var/lib/kubelet/device-plugins
                volumes:
                  - name: device-plugin
                    hostPath:
                      path: /var/lib/kubelet/device-plugins
                dnsPolicy: ClusterFirst
                hostNetwork: true
                restartPolicy: Always

          检查 Node 资源

          通过 kubectl get node -o yaml,能够看到 node 上有新的 GPU 资源:

            allocatable:
              baidu.com/gpu-count: "1"
              baidu.com/t4_cgpu_core: "100"
              baidu.com/t4_cgpu_memory: "14"
              cpu: 23870m
              ephemeral-storage: "631750310891"
              hugepages-1Gi: "0"
              hugepages-2Mi: "0"
              memory: "65813636449"
              pods: "256"
            capacity:
              baidu.com/gpu-count: "1"
              baidu.com/t4_cgpu_core: "100"
              baidu.com/t4_cgpu_memory: "14"
              cpu: "24"
              ephemeral-storage: 685492960Ki
              hugepages-1Gi: "0"
              hugepages-2Mi: "0"
              memory: 74232212Ki
              pods: "256"

          提交测试任务

          提交测试任务:

          apiVersion: v1
          kind: ReplicationController
          metadata:
            name: paddlebook
          spec:
            replicas: 1
            selector:
              app: paddlebook
            template:
              metadata:
                name: paddlebook
                labels:
                  app: paddlebook
              spec:
                containers:
                - name: paddlebook
                  image: hub.baidubce.com/cce/tensorflow:gpu-benckmarks
                  command: ["/bin/sh", "-c", "sleep 3600"]
                  #command: ["/bin/sh", "-c", "python /root/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
                  resources:
                    requests:
                      baidu.com/t4_cgpu_core: 10
                      baidu.com/t4_cgpu_memory: 2
                    limits:
                      baidu.com/t4_cgpu_core: 10
                      baidu.com/t4_cgpu_memory: 2
          上一篇
          CCE 节点 CDS 扩容
          下一篇
          节点组管理