A new AI project has recently been launched, primarily providing online AI experiments for universities. The project has also purchased a GPU server, but it only has one Nvidia Tesla T4 card, which needs to support multiple students doing experiments online simultaneously.

The current online experiment system runs on Kubernetes, so we need to consider GPU sharing in the k8s environment. We have previously tested the Alibaba Cloud GPU card sharing solution; here, I will just record the steps for using it:

Kubernetes cluster version: 1.23.1

Adjusting the K8S Scheduler

Starting from 1.23.+, due to changes in the kube-scheduler scheduling policy, the previous deployment methods are no longer suitable.

Refer to: https://kubernetes.io/docs/reference/scheduling/policies/

1
2
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
cp config/scheduler-policy-config.yaml /etc/kubernetes/

Since my k8s was installed via kubeasz, the Kubeconfig is located at /etc/kubernetes/kubelet.kubeconfig. You need to update /etc/kubernetes/scheduler-policy-config.yaml as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
---
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/kube-scheduler.kubeconfig
extenders:
- urlPrefix: "http://192.168.233.101:32766/gpushare-scheduler"
  filterVerb: filter
  bindVerb: bind
  enableHTTPS: false
  nodeCacheCapable: true
  managedResources:
  - name: aliyun.com/gpu-mem
    ignoredByScheduler: false
  ignorable: false

Modify the Kube-scheduler startup parameters, /etc/systemd/system/kube-scheduler.service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
[Unit]
Description=Kubernetes Scheduler
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
ExecStart=/opt/kube/bin/kube-scheduler \
  --authentication-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
  --authorization-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
  --bind-address=0.0.0.0 \
  --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
  --config=/etc/kubernetes/scheduler-policy-config.yaml \
  --leader-elect=true \
  --v=2
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
1
2
systemctl  daemon-reload
systemctl  restart  kube-scheduler.service

Installing the Scheduler Extender

Label the GPU machine:

kubectl label node mynode gpushare=true
1
2
cd gpushare-scheduler-extender/config
kubectl apply -f gpushare-schd-extender.yaml

Installing the Plugin

1
2
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml

After installation, it should look like this:

1
2
3
# kubectl get pod   -n kube-system |grep gpushare
gpushare-device-plugin-ds-j8blj              1/1     Running   0                30h
gpushare-schd-extender-74796c5f64-7g4bl      1/1     Running   0                29h

Testing the GPU

Modify samples/1.yaml as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-count: 1
            aliyun.com/gpu-mem: 5
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# kubectl apply -f  samples/1.yaml
deployment.apps/binpack-1 created
# kubectl get pod  |grep binpack
binpack-1-9995bdf69-pk2d4                 1/1     Running   0             12s
# kubectl logs -f binpack-1-9995bdf69-pk2d4
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=5
2023-06-30 09:40:50.890296: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-06-30 09:40:50.976283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:03:00.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2023-06-30 09:40:50.976313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:03:00.0, compute capability: 7.5)

From the logs, GPU sharing has been successful, and 5G of video memory has been allocated.

You can also check using the kubectl extension tool kubectl-inspect-gpushare:

# cd /usr/bin/
# wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
# chmod u+x /usr/bin/kubectl-inspect-gpushare
# kubectl-inspect-gpushare
NAME             IPADDRESS        GPU0(Allocated/Total)  GPU Memory(GiB)
192.168.233.101  192.168.233.101  9/14                   9/14
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/14 (64%)

At this point, GPU card sharing is complete, and the next step is to schedule my GPU card using TensorFlow.

Reference documentation:

https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md