A new AI project has recently been launched, primarily providing online AI experiments for universities. The project has also purchased a GPU server, but it only has one Nvidia Tesla T4 card, which needs to support multiple students doing experiments online simultaneously.
The current online experiment system runs on Kubernetes, so we need to consider GPU sharing in the k8s environment. We have previously tested the Alibaba Cloud GPU card sharing solution; here, I will just record the steps for using it:
Kubernetes cluster version: 1.23.1
Adjusting the K8S Scheduler
Starting from 1.23.+, due to changes in the kube-scheduler scheduling policy, the previous deployment methods are no longer suitable.
Since my k8s was installed via kubeasz, the Kubeconfig is located at /etc/kubernetes/kubelet.kubeconfig. You need to update /etc/kubernetes/scheduler-policy-config.yaml as follows:
apiVersion:apps/v1kind:Deploymentmetadata:name:binpack-1labels:app:binpack-1spec:replicas:1selector:# define how the deployment finds the pods it managesmatchLabels:app:binpack-1template:# define the pods specificationsmetadata:labels:app:binpack-1spec:containers:- name:binpack-1image:cheyang/gpu-player:v2resources:limits:# GiBaliyun.com/gpu-count:1aliyun.com/gpu-mem:5
# kubectl apply -f samples/1.yamldeployment.apps/binpack-1 created
# kubectl get pod |grep binpackbinpack-1-9995bdf69-pk2d4 1/1 Running 0 12s
# kubectl logs -f binpack-1-9995bdf69-pk2d4ALIYUN_COM_GPU_MEM_DEV=14ALIYUN_COM_GPU_MEM_CONTAINER=52023-06-30 09:40:50.890296: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-06-30 09:40:50.976283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:03:00.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2023-06-30 09:40:50.976313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:03:00.0, compute capability: 7.5)
From the logs, GPU sharing has been successful, and 5G of video memory has been allocated.
You can also check using the kubectl extension tool kubectl-inspect-gpushare:
# cd /usr/bin/
# wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
# chmod u+x /usr/bin/kubectl-inspect-gpushare
# kubectl-inspect-gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
192.168.233.101 192.168.233.101 9/14 9/14
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/14 (64%)
At this point, GPU card sharing is complete, and the next step is to schedule my GPU card using TensorFlow.