solana

History

Michael Vines 0d1f463f7f Update testnet-manager.sh		2019-10-25 10:56:20 -07:00
..
docker-rust	Upgrade to rust 1.38	2019-10-02 22:51:14 -07:00
docker-rust-nightly	Jump to nightly-2019-10-03 (#6233 )	2019-10-03 20:05:44 -06:00
semver_bash	Vendor https://github.com/cloudflare/semver_bash/tree/c1133faf0e	2018-08-17 23:15:48 -07:00
.gitignore	Package solana as a snap	2018-06-18 17:36:03 -07:00
README.md	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
_	Initial revision	2018-12-18 14:27:37 -08:00
affects-files.sh	Introduce normalized CI environment vars: ci/env.sh (#4571 )	2019-06-06 12:20:47 -07:00
buildkite-release.yml	Don't rebuild/retest release tags (#5385 )	2019-08-01 13:11:42 -07:00
buildkite-secondary.yml	🐌 Publish crates for even longer	2019-08-16 21:52:12 -07:00
buildkite.yml	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
channel-info.sh	Introduce normalized CI environment vars: ci/env.sh (#4571 )	2019-06-06 12:20:47 -07:00
check-ssh-keys.sh	Add ssh key check (#6149 )	2019-09-27 10:55:51 -07:00
crate-version.sh	Turn top-level Cargo.toml into a virtual manifest	2019-03-21 08:47:58 -07:00
docker-run.sh	Introduce normalized CI environment vars: ci/env.sh (#4571 )	2019-06-06 12:20:47 -07:00
env.sh	Ensure CI_OS_NAME is set for appveyor server	2019-07-19 20:06:32 -07:00
format-url.sh	Add format-url.sh	2018-12-15 15:10:04 -08:00
hoover.sh	codemod --extensions sh '#!/bin/bash' '#!/usr/bin/env bash'	2018-11-11 16:24:36 -08:00
iterations-localnet.sh	Add explicit validator-cuda crate (#5985 )	2019-09-19 20:50:34 -07:00
localnet-sanity.sh	Colo: Put NVMe disks to use (#6357 )	2019-10-17 14:44:45 -07:00
nits.sh	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
order-crates-for-publishing.py	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
publish-book.sh	Call book/build.sh from docker (#5237 )	2019-07-22 21:37:43 -07:00
publish-bpf-sdk.sh	Remove dead code	2019-03-20 20:51:58 -07:00
publish-crate.sh	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
publish-metrics-dashboard.sh	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
publish-tarball.sh	Keep the build green when there's nowhere to publish	2019-10-03 14:55:04 -07:00
rust-version.sh	Jump to nightly-2019-10-03 (#6233 )	2019-10-03 20:05:44 -06:00
shellcheck.sh	nit: hide echo	2019-01-08 21:11:43 -08:00
test-bench.sh	Remove some TODOs (#6488 )	2019-10-21 22:25:06 -07:00
test-checks.sh	Remove Backend trait (#6407 )	2019-10-17 15:19:27 -06:00
test-coverage.sh	gzip -f	2019-10-17 13:08:51 -07:00
test-local-cluster.sh	Release builds for local cluster tests (#5891 )	2019-09-18 13:10:50 -07:00
test-stable-perf.sh	Run featurized tests on sub-packages (#2867 )	2019-02-21 22:38:36 -08:00
test-stable.sh	ignore test_fail_entry_verification_leader (#6537 )	2019-10-24 21:16:17 -07:00
testnet-deploy.sh	Reduce TdS fees to 1 lamport per sig, and slots_per_epoch/2 (#6542 )	2019-10-24 20:37:23 -07:00
testnet-manager.sh	Update testnet-manager.sh	2019-10-25 10:56:20 -07:00
testnet-sanity.sh	Remove validator sanity check (#6435 )	2019-10-18 08:26:08 -07:00
upload-ci-artifact.sh	Desnake upload_ci_artifact for consistency	2018-12-13 22:25:27 -08:00
upload-github-release-asset.sh	Add CI_REPO_SLUG (#4714 )	2019-06-17 20:42:09 -07:00

README.md

Our CI infrastructure is built around BuildKite with some additional GitHub integration provided by https://github.com/mvines/ci-gate

Agent Queues

We define two Agent Queues: queue=default and queue=cuda. The default queue should be favored and runs on lower-cost CPU instances. The cuda queue is only necessary for running tests that depend on GPU (via CUDA) access -- CUDA builds may still be run on the default queue, and the buildkite artifact system used to transfer build products over to a GPU instance for testing.

Buildkite Agent Management

Buildkite Azure Setup

Create a new Azure-based "queue=default" agent by running the following command:

$ az vm create \
   --resource-group ci \
   --name XYZ \
   --image boilerplate \
   --admin-username $(whoami) \
   --ssh-key-value ~/.ssh/id_rsa.pub

The "boilerplate" image contains all the required packages pre-installed so the new machine should immediately show up in the Buildkite agent list once it has been provisioned and be ready for service.

Creating a "queue=cuda" agent follows the same process but additionally:

Resize the image from the Azure port to include a GPU
Edit the tags field in /etc/buildkite-agent/buildkite-agent.cfg to tags="queue=cuda,queue=default" and decrease the value of the priority field by one

Updating the CI Disk Image

Create a new VM Instance as described above
Modify it as required
When ready, ssh into the instance and start a root shell with sudo -i. Then prepare it for deallocation by running: waagent -deprovision+user; cd /etc; ln -s ../run/systemd/resolve/stub-resolv.conf resolv.conf
Run az vm deallocate --resource-group ci --name XYZ
Run az vm generalize --resource-group ci --name XYZ
Run az image create --resource-group ci --source XYZ --name boilerplate
Goto the ci resource group in the Azure portal and remove all resources with the XYZ name in them

Reference

This section contains details regarding previous CI setups that have been used, and that we may return to one day.

Buildkite AWS CloudFormation Setup

AWS CloudFormation is currently inactive, although it may be restored in the future

AWS CloudFormation can be used to scale machines up and down based on the current CI load. If no machine is currently running it can take up to 60 seconds to spin up a new instance, please remain calm during this time.

AMI

We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.

Use the following process to update this AMI as dependencies change:

$ export AWS_ACCESS_KEY_ID=my_access_key
$ export AWS_SECRET_ACCESS_KEY=my_secret_access_key
$ git clone https://github.com/solana-labs/elastic-ci-stack-for-aws.git -b solana/cuda
$ cd elastic-ci-stack-for-aws/
$ make build
$ make build-ami

Watch for the "amazon-ebs: AMI:" log message to extract the name of the new AMI. For example:

amazon-ebs: AMI: ami-07118545e8b4ce6dc

The new AMI should also now be visible in your EC2 Dashboard. Go to the desired AWS CloudFormation stack, update the ImageId field to the new AMI id, and apply the stack changes.

Buildkite GCP Setup

CI runs on Google Cloud Platform via two Compute Engine Instance groups: ci-default and ci-cuda. Autoscaling is currently disabled and the number of VM Instances in each group is manually adjusted.

Updating a CI Disk Image

Each Instance group has its own disk image, ci-default-vX and ci-cuda-vY, where X and Y are incremented each time the image is changed.

The manual process to update a disk image is as follows:

Create a new VM Instance using the disk image to modify.
Once the VM boots, ssh to it and modify the disk as desired.
Stop the VM Instance running the modified disk. Remember the name of the VM disk
From another machine, gcloud auth login, then create a new Disk Image based off the modified VM Instance:

 $ gcloud compute images create ci-default-$(date +%Y%m%d%H%M) --source-disk xxx --source-disk-zone us-east1-b --family ci-default

  $ gcloud compute images create ci-cuda-$(date +%Y%m%d%H%M) --source-disk xxx --source-disk-zone us-east1-b --family ci-cuda

Delete the new VM instance.
Go to the Instance templates tab, find the existing template named ci-default-vX or ci-cuda-vY and select it. Use the "Copy" button to create a new Instance template called ci-default-vX+1 or ci-cuda-vY+1 with the newly created Disk image.
Go to the Instance Groups tag and find the applicable group, ci-default or ci-cuda. Edit the Instance Group in two steps: (a) Set the number of instances to 0 and wait for them all to terminate, (b) Update the Instance template and restore the number of instances to the original value.
Clean up the previous version by deleting it from Instance Templates and Images.