Adding TPU limits for GKE cluster node auto-provisioning (NAP) (#2406)
* Adding TPU limits for GKE cluster node auto-provisioning (NAP) * rework of the cluster autoscaling configuration * updated README * fixing README * Update modules/gke-cluster-standard/README.md Co-authored-by: Wiktor Niesiobędzki <wiktorn@google.com> * fixing indentation --------- Co-authored-by: Wiktor Niesiobędzki <wiktorn@google.com>
This commit is contained in:
parent
c81bc84e3a
commit
59657415be
|
@ -15,6 +15,7 @@ This module offers a way to create and manage Google Kubernetes Engine (GKE) [St
|
||||||
- [Cloud DNS](#cloud-dns)
|
- [Cloud DNS](#cloud-dns)
|
||||||
- [Backup for GKE](#backup-for-gke)
|
- [Backup for GKE](#backup-for-gke)
|
||||||
- [Automatic creation of new secondary ranges](#automatic-creation-of-new-secondary-ranges)
|
- [Automatic creation of new secondary ranges](#automatic-creation-of-new-secondary-ranges)
|
||||||
|
- [Node auto-provisioning with GPUs and TPUs](#node-auto-provisioning-with-gpus-and-tpus)
|
||||||
- [Variables](#variables)
|
- [Variables](#variables)
|
||||||
- [Outputs](#outputs)
|
- [Outputs](#outputs)
|
||||||
<!-- END TOC -->
|
<!-- END TOC -->
|
||||||
|
@ -305,6 +306,47 @@ module "cluster-1" {
|
||||||
}
|
}
|
||||||
# tftest modules=1 resources=1
|
# tftest modules=1 resources=1
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Node auto-provisioning with GPUs and TPUs
|
||||||
|
|
||||||
|
You can use `var.cluster_autoscaling` block to configure node auto-provisioning for the GKE cluster. The example below configures limits for CPU, memory, GPUs and TPUs.
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
module "cluster-1" {
|
||||||
|
source = "./fabric/modules/gke-cluster-standard"
|
||||||
|
project_id = var.project_id
|
||||||
|
name = "cluster-1"
|
||||||
|
location = "europe-west1-b"
|
||||||
|
vpc_config = {
|
||||||
|
network = var.vpc.self_link
|
||||||
|
subnetwork = var.subnet.self_link
|
||||||
|
secondary_range_blocks = {
|
||||||
|
pods = ""
|
||||||
|
services = "/20" # can be an empty string as well
|
||||||
|
}
|
||||||
|
}
|
||||||
|
cluster_autoscaling = {
|
||||||
|
cpu_limits = {
|
||||||
|
max = 48
|
||||||
|
}
|
||||||
|
mem_limits = {
|
||||||
|
max = 182
|
||||||
|
}
|
||||||
|
# Can be GPUs or TPUs
|
||||||
|
accelerator_resources = [
|
||||||
|
{
|
||||||
|
resource_type = "nvidia-l4"
|
||||||
|
max = 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
resource_type = "tpu-v5-lite-podslice"
|
||||||
|
max = 2
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
# tftest modules=1 resources=1
|
||||||
|
```
|
||||||
<!-- BEGIN TFDOC -->
|
<!-- BEGIN TFDOC -->
|
||||||
## Variables
|
## Variables
|
||||||
|
|
||||||
|
@ -315,7 +357,7 @@ module "cluster-1" {
|
||||||
| [project_id](variables.tf#L410) | Cluster project id. | <code>string</code> | ✓ | |
|
| [project_id](variables.tf#L410) | Cluster project id. | <code>string</code> | ✓ | |
|
||||||
| [vpc_config](variables.tf#L421) | VPC-level configuration. | <code title="object({ network = string subnetwork = string master_ipv4_cidr_block = optional(string) master_endpoint_subnetwork = optional(string) secondary_range_blocks = optional(object({ pods = string services = string })) secondary_range_names = optional(object({ pods = optional(string, "pods") services = optional(string, "services") })) additional_ranges = optional(list(string)) master_authorized_ranges = optional(map(string)) stack_type = optional(string) })">object({…})</code> | ✓ | |
|
| [vpc_config](variables.tf#L421) | VPC-level configuration. | <code title="object({ network = string subnetwork = string master_ipv4_cidr_block = optional(string) master_endpoint_subnetwork = optional(string) secondary_range_blocks = optional(object({ pods = string services = string })) secondary_range_names = optional(object({ pods = optional(string, "pods") services = optional(string, "services") })) additional_ranges = optional(list(string)) master_authorized_ranges = optional(map(string)) stack_type = optional(string) })">object({…})</code> | ✓ | |
|
||||||
| [backup_configs](variables.tf#L17) | Configuration for Backup for GKE. | <code title="object({ enable_backup_agent = optional(bool, false) backup_plans = optional(map(object({ region = string applications = optional(map(list(string))) encryption_key = optional(string) include_secrets = optional(bool, true) include_volume_data = optional(bool, true) labels = optional(map(string)) namespaces = optional(list(string)) schedule = optional(string) retention_policy_days = optional(number) retention_policy_lock = optional(bool, false) retention_policy_delete_lock_days = optional(number) })), {}) })">object({…})</code> | | <code>{}</code> |
|
| [backup_configs](variables.tf#L17) | Configuration for Backup for GKE. | <code title="object({ enable_backup_agent = optional(bool, false) backup_plans = optional(map(object({ region = string applications = optional(map(list(string))) encryption_key = optional(string) include_secrets = optional(bool, true) include_volume_data = optional(bool, true) labels = optional(map(string)) namespaces = optional(list(string)) schedule = optional(string) retention_policy_days = optional(number) retention_policy_lock = optional(bool, false) retention_policy_delete_lock_days = optional(number) })), {}) })">object({…})</code> | | <code>{}</code> |
|
||||||
| [cluster_autoscaling](variables.tf#L39) | Enable and configure limits for Node Auto-Provisioning with Cluster Autoscaler. | <code title="object({ enabled = optional(bool, true) autoscaling_profile = optional(string, "BALANCED") auto_provisioning_defaults = optional(object({ boot_disk_kms_key = optional(string) disk_size = optional(number) disk_type = optional(string, "pd-standard") image_type = optional(string) oauth_scopes = optional(list(string)) service_account = optional(string) management = optional(object({ auto_repair = optional(bool, true) auto_upgrade = optional(bool, true) })) shielded_instance_config = optional(object({ integrity_monitoring = optional(bool, true) secure_boot = optional(bool, false) })) upgrade_settings = optional(object({ blue_green = optional(object({ node_pool_soak_duration = optional(string) standard_rollout_policy = optional(object({ batch_percentage = optional(number) batch_node_count = optional(number) batch_soak_duration = optional(string) })) })) surge = optional(object({ max = optional(number) unavailable = optional(number) })) })) })) cpu_limits = optional(object({ min = number max = number })) mem_limits = optional(object({ min = number max = number })) gpu_resources = optional(list(object({ resource_type = string min = number max = number }))) })">object({…})</code> | | <code>null</code> |
|
| [cluster_autoscaling](variables.tf#L39) | Enable and configure limits for Node Auto-Provisioning with Cluster Autoscaler. | <code title="object({ enabled = optional(bool, true) autoscaling_profile = optional(string, "BALANCED") auto_provisioning_defaults = optional(object({ boot_disk_kms_key = optional(string) disk_size = optional(number) disk_type = optional(string, "pd-standard") image_type = optional(string) oauth_scopes = optional(list(string)) service_account = optional(string) management = optional(object({ auto_repair = optional(bool, true) auto_upgrade = optional(bool, true) })) shielded_instance_config = optional(object({ integrity_monitoring = optional(bool, true) secure_boot = optional(bool, false) })) upgrade_settings = optional(object({ blue_green = optional(object({ node_pool_soak_duration = optional(string) standard_rollout_policy = optional(object({ batch_percentage = optional(number) batch_node_count = optional(number) batch_soak_duration = optional(string) })) })) surge = optional(object({ max = optional(number) unavailable = optional(number) })) })) })) cpu_limits = optional(object({ min = optional(number, 0) max = number })) mem_limits = optional(object({ min = optional(number, 0) max = number })) accelerator_resources = optional(list(object({ resource_type = string min = optional(number, 0) max = number }))) })">object({…})</code> | | <code>null</code> |
|
||||||
| [default_nodepool](variables.tf#L118) | Enable default nodepool. | <code title="object({ remove_pool = optional(bool, true) initial_node_count = optional(number, 1) })">object({…})</code> | | <code>{}</code> |
|
| [default_nodepool](variables.tf#L118) | Enable default nodepool. | <code title="object({ remove_pool = optional(bool, true) initial_node_count = optional(number, 1) })">object({…})</code> | | <code>{}</code> |
|
||||||
| [deletion_protection](variables.tf#L136) | Whether or not to allow Terraform to destroy the cluster. Unless this field is set to false in Terraform state, a terraform destroy or terraform apply that would delete the cluster will fail. | <code>bool</code> | | <code>true</code> |
|
| [deletion_protection](variables.tf#L136) | Whether or not to allow Terraform to destroy the cluster. Unless this field is set to false in Terraform state, a terraform destroy or terraform apply that would delete the cluster will fail. | <code>bool</code> | | <code>true</code> |
|
||||||
| [description](variables.tf#L143) | Cluster description. | <code>string</code> | | <code>null</code> |
|
| [description](variables.tf#L143) | Cluster description. | <code>string</code> | | <code>null</code> |
|
||||||
|
|
|
@ -222,15 +222,15 @@ resource "google_container_cluster" "cluster" {
|
||||||
}
|
}
|
||||||
dynamic "resource_limits" {
|
dynamic "resource_limits" {
|
||||||
for_each = (
|
for_each = (
|
||||||
try(local.cas.gpu_resources, null) == null
|
try(local.cas.accelerator_resources, null) == null
|
||||||
? []
|
? []
|
||||||
: local.cas.gpu_resources
|
: local.cas.accelerator_resources
|
||||||
)
|
)
|
||||||
iterator = gpu_resources
|
iterator = accelerator_resources
|
||||||
content {
|
content {
|
||||||
resource_type = gpu_resources.value.resource_type
|
resource_type = accelerator_resources.value.resource_type
|
||||||
minimum = gpu_resources.value.min
|
minimum = accelerator_resources.value.min
|
||||||
maximum = gpu_resources.value.max
|
maximum = accelerator_resources.value.max
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -73,16 +73,16 @@ variable "cluster_autoscaling" {
|
||||||
# add validation rule to ensure only one is present if upgrade settings is defined
|
# add validation rule to ensure only one is present if upgrade settings is defined
|
||||||
}))
|
}))
|
||||||
cpu_limits = optional(object({
|
cpu_limits = optional(object({
|
||||||
min = number
|
min = optional(number, 0)
|
||||||
max = number
|
max = number
|
||||||
}))
|
}))
|
||||||
mem_limits = optional(object({
|
mem_limits = optional(object({
|
||||||
min = number
|
min = optional(number, 0)
|
||||||
max = number
|
max = number
|
||||||
}))
|
}))
|
||||||
gpu_resources = optional(list(object({
|
accelerator_resources = optional(list(object({
|
||||||
resource_type = string
|
resource_type = string
|
||||||
min = number
|
min = optional(number, 0)
|
||||||
max = number
|
max = number
|
||||||
})))
|
})))
|
||||||
})
|
})
|
||||||
|
|
Loading…
Reference in New Issue