How to create kubernetes cluster with Preemptive GPUs in GCP

Sam A
3 min readApr 20, 2021

--

If you have any GPU dependent work load ( Machine learning )that also needs to be highly available or needs to be run as batch jobs you probably want a kubernetes cluster to manage them. In this tutorial i will demonstrate how to build a very basic kubernetes GPU cluster in GCP.

To create the cluster you can either go though the Google Cloud GUI or run the Cli command to create the cluster

In this example I'm using Cli command

Run the command below but make sure add your own project id and computer IP this is for extra security, so that no one else aside from yourself cant login to the cluster

Also in this example im using nvidia-tesla-a100, however you can change that based on your needs.

Price list of GPUs in GCP : https://cloud.google.com/compute/gpus-pricing

gcloud beta container --project "<your-project-id>" clusters create "gpu-cluster-1" --zone "us-central1-a" --no-enable-basic-auth --cluster-version "1.18.16-gke.502" --release-channel "regular" --machine-type "a2-highgpu-1g" --accelerator "type=nvidia-tesla-a100,count=1" --image-type "ubuntu" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --preemptible --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "172.16.0.0/28" --enable-ip-alias --network "projects/<your-project-id>/global/networks/default" --subnetwork "projects/<your-project-id>/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --enable-master-authorized-networks --master-authorized-networks <your-computer-public-ip> --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-a" --enable-shielded-nodes

Note: make sure u have gcloud command installed

After that you need add a NAT Gateway to allow downloads from internet

  1. Use the default VPC, since that's where we deployed the cluster
  2. Choose the region that the cluster is located in in our case “us-central1”
  3. Create a new router and leave the rest unchanged.

Now that your cluster is up and running click on the connect and get the Cli command to connect to your cluster

Note: At this point make sure you have kubectl installed

Now login and run the command below to install the nvida drivers

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

Now run get pods to NVIDIA driver installer shows as running

kubectl get pods

Now its time to test the driver

Save the content below as test.yml

apiVersion: v1kind: Podmetadata:name: my-gpu-podspec:containers:- name: my-gpu-containerimage: nvidia/cuda:11.0-runtime-ubuntu18.04command: ["/bin/bash", "-c", "--"]args: ["while true; do sleep 600; done;"]resources:limits:nvidia.com/gpu: 1

Now deploy it and test the driver

kubectl apply -f test_2.yml

Test to see if the driver is operational

kubectl exec --stdin --tty my-gpu-pod -- nvidia-smi

Congratulation now everything is setup up!

--

--

Sam A

Senior DevOps Consultant, a tech enthusiast and cloud automation expert that helps companies improve efficiency by incorporating automation