diff --git a/ci/README.md b/ci/README.md index 84bec8d620..0e44ef8a66 100644 --- a/ci/README.md +++ b/ci/README.md @@ -2,13 +2,7 @@ Our CI infrastructure is built around [BuildKite](https://buildkite.com) with some additional GitHub integration provided by https://github.com/mvines/ci-gate -## Buildkite AWS CloudFormation Setup - -We use AWS CloudFormation to scale machines up and down based on the current CI -load. If no machine is currently running it can take up to 60 seconds to spin -up a new instance, please remain calm during this time. - -### Agent Queues +## Agent Queues We define two [Agent Queues](https://buildkite.com/docs/agent/v3/queues): `queue=default` and `queue=cuda`. The `default` queue should be favored and @@ -18,7 +12,58 @@ be run on the `default` queue, and the [buildkite artifact system](https://buildkite.com/docs/builds/artifacts) used to transfer build products over to a GPU instance for testing. -### AMI +## Buildkite Agent Management + +### Buildkite GCP Setup + +CI runs on Google Cloud Platform via two Compute Engine Instance groups: +`ci-default` and `ci-cuda`. Autoscaling is currently disabled and the number of +VM Instances in each group is manually adjusted. + +#### Updating a CI Disk Image + +Each Instance group has its own disk image, `ci-default-vX` and +`ci-cuda-vY`, where *X* and *Y* are incremented each time the image is changed. + +The process to update a disk image is as follows (TODO: make this less manual): + +1. Create a new VM Instance using the disk image to modify. +2. Once the VM boots, ssh to it and modify the disk as desired. +3. Stop the VM Instance running the modified disk. Remember the name of the VM disk +4. From another machine, `gcloud auth login`, then create a new Disk Image based +off the modified VM Instance: +``` + $ gcloud compute images create ci-default-v5 --source-disk xxx --source-disk-zone us-east1-b +``` +or +``` + $ gcloud compute images create ci-cuda-v5 --source-disk xxx --source-disk-zone us-east1-b +``` +5. Delete the new VM instance. +6. Go to the Instance templates tab, find the existing template named +`ci-default-vX` or `ci-cuda-vY` and select it. Use the "Copy" button to create +a new Instance template called `ci-default-vX+1` or `ci-cuda-vY+1` with the +newly created Disk image. +7. Go to the Instance Groups tag and find the applicable group, `ci-default` or +`ci-cuda`. Edit the Instance Group in two steps: (a) Set the number of +instances to 0 and wait for them all to terminate, (b) Update the Instance +template and restore the number of instances to the original value. +8. Clean up the previous version by deleting it from Instance Templates and +Images. + + +## Reference + +### Buildkite AWS CloudFormation Setup + +**AWS CloudFormation is currently inactive, although it may be restored in the +future** + +AWS CloudFormation can be used to scale machines up and down based on the +current CI load. If no machine is currently running it can take up to 60 +seconds to spin up a new instance, please remain calm during this time. + +#### AMI We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda. Use the following process to update this AMI as dependencies change: