Before deployment, ensure that nvidia-driver and nvidia-docker are installed on your Kubernetes nodes, and Docker’s default runtime has been set to nvidia.
$ kubectl apply -f 3.yaml -n test$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)sd-cluster-04 192.168.1.214 8/14 8/14
sd-cluster-05 192.168.1.215 12/14 12/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
20/28 (71%)$ kubectl get pod -n testNAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 28m
binpack-2-58579b95f7-4wpbl 1/1 Running 0 27m
binpack-2-58579b95f7-sjhwt 1/1 Running 0 27m
binpack-3-556bbd84f9-9xqg7 1/1 Running 0 14m
$ kubectl logs -f binpack-3-556bbd84f9-9xqg7 -n testALIYUN_COM_GPU_MEM_DEV=14ALIYUN_COM_GPU_MEM_CONTAINER=22021-08-13 03:01:53.897423: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 03:01:54.008665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 7.08GiB
2021-08-13 03:01:54.008716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
After deploying the third application, the maximum available GPU memory is 8GiB. However, actual usage is limited to 6GiB per task because a single task cannot span across multiple GPU cards.
4. Deploy Fourth Application
Request 5GiB GPU memory — should be scheduled on sd-cluster-04.
$ kubectl apply -f 4.yaml -n test$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)sd-cluster-04 192.168.1.214 13/14 13/14
sd-cluster-05 192.168.1.215 12/14 12/14
------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
25/28 (89%)$ kubectl get pod -n testNAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 26m
binpack-2-58579b95f7-4wpbl 1/1 Running 0 24m
binpack-2-58579b95f7-sjhwt 1/1 Running 0 24m
binpack-3-556bbd84f9-9xqg7 1/1 Running 0 11m
binpack-4-6956458f85-cv62j 1/1 Running 0 6s
$ kubectl logs -f binpack-4-6956458f85-cv62j -n testALIYUN_COM_GPU_MEM_DEV=14ALIYUN_COM_GPU_MEM_CONTAINER=52021-08-13 03:13:20.208122: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 03:13:20.361391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 6.46GiB
2021-08-13 03:13:20.361481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
III. Summary
The gpushare-device-plugin has the following limitations:
A single task cannot utilize GPUs across multiple machines (shared).
GPU resource allocation cannot be based on utilization percentage within a single GPU.
However, it is sufficient for algorithm team model testing scenarios. There are two alternative GPU sharing solutions, which are not covered here—refer directly to their official repositories if needed: