Document GCP setup

This commit is contained in:
Michael Vines 2018-06-23 00:21:59 -07:00
parent ec86b1dffa
commit e4782b19a3
1 changed files with 53 additions and 8 deletions

View File

@ -2,13 +2,7 @@
Our CI infrastructure is built around [BuildKite](https://buildkite.com) with some
additional GitHub integration provided by https://github.com/mvines/ci-gate
## Buildkite AWS CloudFormation Setup
We use AWS CloudFormation to scale machines up and down based on the current CI
load. If no machine is currently running it can take up to 60 seconds to spin
up a new instance, please remain calm during this time.
### Agent Queues
## Agent Queues
We define two [Agent Queues](https://buildkite.com/docs/agent/v3/queues):
`queue=default` and `queue=cuda`. The `default` queue should be favored and
@ -18,7 +12,58 @@ be run on the `default` queue, and the [buildkite artifact
system](https://buildkite.com/docs/builds/artifacts) used to transfer build
products over to a GPU instance for testing.
### AMI
## Buildkite Agent Management
### Buildkite GCP Setup
CI runs on Google Cloud Platform via two Compute Engine Instance groups:
`ci-default` and `ci-cuda`. Autoscaling is currently disabled and the number of
VM Instances in each group is manually adjusted.
#### Updating a CI Disk Image
Each Instance group has its own disk image, `ci-default-vX` and
`ci-cuda-vY`, where *X* and *Y* are incremented each time the image is changed.
The process to update a disk image is as follows (TODO: make this less manual):
1. Create a new VM Instance using the disk image to modify.
2. Once the VM boots, ssh to it and modify the disk as desired.
3. Stop the VM Instance running the modified disk. Remember the name of the VM disk
4. From another machine, `gcloud auth login`, then create a new Disk Image based
off the modified VM Instance:
```
$ gcloud compute images create ci-default-v5 --source-disk xxx --source-disk-zone us-east1-b
```
or
```
$ gcloud compute images create ci-cuda-v5 --source-disk xxx --source-disk-zone us-east1-b
```
5. Delete the new VM instance.
6. Go to the Instance templates tab, find the existing template named
`ci-default-vX` or `ci-cuda-vY` and select it. Use the "Copy" button to create
a new Instance template called `ci-default-vX+1` or `ci-cuda-vY+1` with the
newly created Disk image.
7. Go to the Instance Groups tag and find the applicable group, `ci-default` or
`ci-cuda`. Edit the Instance Group in two steps: (a) Set the number of
instances to 0 and wait for them all to terminate, (b) Update the Instance
template and restore the number of instances to the original value.
8. Clean up the previous version by deleting it from Instance Templates and
Images.
## Reference
### Buildkite AWS CloudFormation Setup
**AWS CloudFormation is currently inactive, although it may be restored in the
future**
AWS CloudFormation can be used to scale machines up and down based on the
current CI load. If no machine is currently running it can take up to 60
seconds to spin up a new instance, please remain calm during this time.
#### AMI
We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.
Use the following process to update this AMI as dependencies change: