If you have any GPU dependent work load ( Machine learning )that also needs to be highly available or needs to be run as batch jobs you probably want a kubernetes cluster to manage them. In this tutorial i will demonstrate how to build a very basic kubernetes GPU cluster in GCP.
To create the cluster you can either go though the Google Cloud GUI or run the Cli command to create the cluster
In this example I'm using Cli command
Run the command below but make sure add your own project id and computer IP this is for extra security, so that no one else aside from yourself cant login to the cluster
Also in this example im using nvidia-tesla-a100, however you can change that based on your needs.
Price list of GPUs in GCP : https://cloud.google.com/compute/gpus-pricing
gcloud beta container --project "<your-project-id>" clusters create "gpu-cluster-1" --zone "us-central1-a" --no-enable-basic-auth --cluster-version "1.18.16-gke.502" --release-channel "regular" --machine-type "a2-highgpu-1g" --accelerator "type=nvidia-tesla-a100,count=1" --image-type "ubuntu" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --preemptible --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "172.16.0.0/28" --enable-ip-alias --network "projects/<your-project-id>/global/networks/default" --subnetwork "projects/<your-project-id>/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --enable-master-authorized-networks --master-authorized-networks <your-computer-public-ip> --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-a" --enable-shielded-nodes
Note: make sure u have gcloud command installed
After that you need add a NAT Gateway to allow downloads from internet
- Use the default VPC, since that's where we deployed the cluster
- Choose the region that the cluster is located in in our case “us-central1”
- Create a new router and leave the rest unchanged.
Now that your cluster is up and running click on the connect and get the Cli command to connect to your cluster
Note: At this point make sure you have kubectl installed
Now login and run the command below to install the nvida drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
Now run get pods to NVIDIA driver installer shows as running
kubectl get pods
Now its time to test the driver
Save the content below as test.yml
apiVersion: v1kind: Podmetadata:name: my-gpu-podspec:containers:- name: my-gpu-containerimage: nvidia/cuda:11.0-runtime-ubuntu18.04command: ["/bin/bash", "-c", "--"]args: ["while true; do sleep 600; done;"]resources:limits:nvidia.com/gpu: 1
Now deploy it and test the driver
kubectl apply -f test_2.yml
Test to see if the driver is operational
kubectl exec --stdin --tty my-gpu-pod -- nvidia-smi
Congratulation now everything is setup up!