6.2 KiB
Deploy a batch system using Kueue
This tutorial shows you how to deploy a batch system using Kueue to perform Job queueing on Google Kubernetes Engine (GKE) using Terraform.
Jobs are applications that run to completion, such as machine learning, rendering, simulation, analytics, CI/CD, and similar workloads.
Kueue is a Cloud Native Job scheduler that works with the default Kubernetes scheduler, the Job controller, and the cluster autoscaler to provide an end-to-end batch system. Kueue implements Job queueing, deciding when Jobs should wait and when they should start, based on quotas and a hierarchy for sharing resources fairly among teams.
Kueue has the following characteristics:
- It is optimized for cloud architectures, where resources are heterogeneous, interchangeable, and scalable.
- It provides a set of APIs to manage elastic quotas and manage Job queueing.
- It does not re-implement existing functionality such as autoscaling, pod scheduling, or Job lifecycle management.
- Kueue has built-in support for the Kubernetesbatch/v1.Job API.
- It can integrate with other job APIs.
- Kueue refers to jobs defined with any API as Workloads, to avoid the confusion with the specific Kubernetes Job API.
When working with Kueue there are a few concepts that ome needs to be familiar with:
-
ResourceFlavour
An object that you can define to describe what resources are available in a cluster. Typically, it is associated with the characteristics of a group of Nodes: availability, pricing, architecture, models, etc.
-
ClusterQueue
A cluster-scoped resource that governs a pool of resources, defining usage limits and fair sharing rules.
-
LocalQueue
A namespaced resource that groups closely related workloads belonging to a single tenant.
-
Workload
An application that will run to completion. It is the unit of admission in Kueue. Sometimes referred to as job
Kueue refers to jobs defined with any API as Workloads, to avoid the confusion with the specific Kubernetes Job API.
Objectives
This tutorial is for cluster operators and other users that want to implement a batch system on Kubernetes. In this tutorial, you set up a shared cluster for two tenant teams. Each team has their own namespace where they create Jobs and share the same global resources that are controlled with the corresponding quotas.
In this tutorial we will be doing the following using Terraform code available in a git repository:
- Create a GKE cluster.
- Create a namespace for Kueue (kueue-system).
- Create a namespace for each team running batch jobs in the cluster (team-a, team-b).
- Install Kueue in the namespace created for it.
- Create the ResourceFlavor.
- Create the ClusterQueue.
- Create a LocalQueue for each of the teams in the corresponding namespace.
- Create for each of teams a manifest for a sample job associated with the corresponding LocalQueue.
Estimated time:
To get started, click Start.
select/create a project
Create the Autopilot GKE cluster
-
Change to the
autopilot-cluster
directory.cd autopilot-cluster
-
Create a new file
terraform.tfvars
in that directory.touch terraform.tfvars
-
Open the file for editing.
-
Paste the following content in the file and update any value as needed.
project_id = "<walkthrough-project-name/>"
cluster_name = "cluster"
cluster_create = {
deletion_protection = false
}
region = "europe-west1"
vpc_create = {
enable_cloud_nat = true
}
-
Initialize the terraform configuration.
terraform init
-
Apply the terraform configuration.
terraform apply
-
Fetch the cluster credentials.
gcloud container fleet memberships get-credentials cluster --project "<walkthrough-project-name/>"
-
Check the nodes are ready.
kubectl get pods -n kube-system
Install Kueue and create associated resources
-
Change to the
patterns/batch
directory.cd ../batch
-
Create a new file
terraform.tfvars
in that directory.touch terraform.tfvars
-
Open the file for editing.
-
Paste the following content in the file.
credentials_config = { kubeconfig = { path = "~/.kube/config" } }
-
Initialize the terraform configuration.
terraform init
-
Apply the terraform configuration.
terraform apply
-
Check that the Kueue pods are ready (Use CTRL+C to exit watching)
kubectl get pods -n kueue-system -w
-
Check the status of the ClusterQueue
kubectl get clusterqueue cluster-queue -o wide -w
-
Check the status of the LocalQueue for the teams
kubectl get localqueue -n team-a local-queue -o wide -w
kubectl get localqueue -n team-b local-queue -o wide -w
Run jobs in the cluster
-
Create Jobs for namespace team-a and team-b every 10 seconds associated with the corresponding LocalQueue:
./create_jobs.sh job-team-a.yaml job-team-b.yaml 10
Hit Ctrl-C when you want to stop the creation of jobs
-
Observe the workloads being queued up, admitted in the ClusterQueue, and nodes being brought up with GKE Autopilot.
kubectl -n team-a get workloads
-
Copy a Job name from the previous step and observe the admission status and events for a Job through the W Workloads API:
kubectl -n team-a describe workload JOB_NAME
Destroy resources (optional)
-
Change to the
patterns/autopilot-cluster
directory.cd ../autopilot-cluster
-
Destroy the cluster with the following command.
terraform destroy
Congratulations
You’re all set!