solana/ci
Yihau Chen 405db3e436
ci: fix publish crate version checking (#31894)
* ci: fix publish crate version checking

* ci: add warning when publish crates don't use worksapce version
2023-06-01 14:28:10 +08:00
..
bench ci: separate bench tests (#31763) 2023-05-25 21:39:17 +08:00
common ci: cleanup (#31196) 2023-04-14 20:37:07 +00:00
docker-rust ci: install lld to docker images (#31645) 2023-05-15 11:34:14 +00:00
docker-rust-nightly Upgrades Rust to 1.69.0 (#31336) 2023-04-25 10:22:38 -04:00
downstream-projects ci: reorg downstream projects (#30463) 2023-02-24 15:55:24 +08:00
semver_bash Vendor https://github.com/cloudflare/semver_bash/tree/c1133faf0e 2018-08-17 23:15:48 -07:00
setup-new-buildkite-agent bump recommended maps/nofiles 2021-08-04 11:15:18 -06:00
stable ci: refactor local cluster tests (#31730) 2023-05-24 15:07:33 +08:00
.gitignore Package solana as a snap 2018-06-18 17:36:03 -07:00
README.md Add cargo sort check to test-checks.sh 2022-06-24 12:41:38 -07:00
_ Ignore RUSTSEC-2020-0006 for the moment (#9057) 2020-03-24 20:10:20 -07:00
affects.sh chore: add docs actions (#25029) 2022-05-06 12:14:50 +08:00
buildkite-pipeline-in-disk.sh ci: run stable tests partially (#31151) 2023-04-21 14:51:17 +08:00
buildkite-pipeline.sh ci: separate bench tests (#31763) 2023-05-25 21:39:17 +08:00
buildkite-secondary.yml chore: add cargo audit at the beginning of secondary pipeline (#27470) 2022-08-31 15:45:29 +00:00
buildkite-solana-private.sh ci: separate bench tests (#31763) 2023-05-25 21:39:17 +08:00
channel-info.sh Add channel version check 2021-03-30 08:46:32 -07:00
channel_restriction.sh chore: add docs actions (#25029) 2022-05-06 12:14:50 +08:00
check-channel-version.sh chore: workspace inheritance (#29893) 2023-02-23 22:01:54 +08:00
check-crates.sh add and recommend script for reserving new package names on crates.io (#31416) 2023-05-08 09:31:24 -06:00
check-ssh-keys.sh Add ssh key check (#6149) 2019-09-27 10:55:51 -07:00
crate-version.sh Turn top-level Cargo.toml into a virtual manifest 2019-03-21 08:47:58 -07:00
dependabot-pr.sh reverting back to the original state 2021-12-09 13:16:05 +05:30
dependabot-updater.sh Follow new dependabot's commit author name (#14091) 2020-12-13 02:27:59 +09:00
do-audit.sh ci: fix do-audit don't report error (#30728) 2023-03-16 11:58:08 +08:00
docker-run.sh ci: show sccache version and prefix (#31317) 2023-04-25 13:11:58 +09:00
env.sh chore: build windows artifacts on Github Actions (#25188) 2022-05-13 17:10:03 +08:00
format-url.sh Add format-url.sh 2018-12-15 15:10:04 -08:00
hoover.sh codemod --extensions sh '#!/bin/bash' '#!/usr/bin/env bash' 2018-11-11 16:24:36 -08:00
intercept.sh ci: silence ci test output while recording in full (#30654) 2023-03-16 22:17:29 +09:00
localnet-sanity.sh Add --allow-private-addr to bootstrap-validator.sh (#30163) 2023-02-22 09:54:15 -08:00
nits.sh ci: suggest to use hidden_unless_forced() (#31321) 2023-04-25 09:43:42 +09:00
order-crates-for-publishing.py ci: use versioned cargo wrapper for crate ordering 2021-06-24 22:14:54 -06:00
platform-tools-info.sh Update references to platform-tools (#30764) 2023-03-22 07:41:40 -07:00
publish-crate.sh ci: fix publish crate version checking (#31894) 2023-06-01 14:28:10 +08:00
publish-installer.sh chore: separate publish installer process (#26137) 2022-06-23 11:18:30 +08:00
publish-metrics-dashboard.sh Remove some TODOs (#6488) 2019-10-21 22:25:06 -07:00
publish-tarball.sh take rust version from toolchain file (#29320) 2022-12-19 17:00:14 +01:00
run-local.sh Migrate SDK from BPF to SBF 2022-10-07 08:57:06 -04:00
run-sanity.sh Make run-sanity.sh ledger-tool verify operate on more slots (#25591) 2022-05-26 15:44:09 -05:00
rust-version.sh Update to nightly rustc to 2023-04-19 (#31381) 2023-05-11 15:48:13 +09:00
shellcheck.sh ci: update buildkite's hook (#31298) 2023-04-25 15:49:58 +00:00
test-bench.sh Add toolchain file usage (#29370) 2023-01-17 20:55:41 +01:00
test-checks.sh Explain use of nightly clippy over whole monorepo (#31833) 2023-05-26 16:43:14 +09:00
test-coverage.sh ci: silence ci test output while recording in full (#30654) 2023-03-16 22:17:29 +09:00
test-docs.sh Split out rust doc tests in CI (#24397) 2022-04-15 19:40:27 -06:00
test-downstream-builds.sh ci: reorg downstream projects (#30463) 2023-02-24 15:55:24 +08:00
test-local-cluster-flakey.sh Split up local cluster tests into separate CI steps (#22295) 2022-01-05 14:44:15 +00:00
test-local-cluster-slow-1.sh Move long-running local-cluster tests to local-cluster-slow (#24952) 2022-05-04 06:03:38 -05:00
test-local-cluster-slow-2.sh Move long-running local-cluster tests to local-cluster-slow (#24952) 2022-05-04 06:03:38 -05:00
test-local-cluster.sh Release builds for local cluster tests (#5891) 2019-09-18 13:10:50 -07:00
test-sanity.sh ci: make merge conflict marker sanity check more robust (#29995) 2023-02-01 21:15:12 -07:00
test-stable-perf.sh Run featurized tests on sub-packages (#2867) 2019-02-21 22:38:36 -08:00
test-stable-sbf.sh Migrate SDK from BPF to SBF 2022-10-07 08:57:06 -04:00
test-stable.sh ci: cleanup (#31196) 2023-04-14 20:37:07 +00:00
test-wasm.sh Add wasm bindings for `Pubkey` and `Keypair` 2021-12-09 15:53:58 -08:00
upload-ci-artifact.sh Use experimential docker virtualization framework for arm64 2022-01-03 15:57:06 -08:00
upload-github-release-asset.sh Make curl verbose when uploading assets to github (#10757) 2020-06-24 00:27:55 +00:00

README.md

Our CI infrastructure is built around BuildKite with some additional GitHub integration provided by https://github.com/mvines/ci-gate

Running Locally

To run the CI suite locally, you can run the run-local.sh script.

Before you do, there are a few dependencies that need to be installed:

cargo install cargo-audit cargo-sort grcov

macOS

On macOS, you will need to install coreutils:

brew install coreutils

Make sure to update your PATH environment variable:

export PATH="/usr/local/opt/coreutils/libexec/gnubin:$PATH"

If you notice the error UnableToSetOpenFileDescriptorLimit you may need to increase the number of available file descriptors:

sudo launchctl limit maxfiles 100000
ulimit -n 1000000

Agent Queues

We define two Agent Queues: queue=default and queue=cuda. The default queue should be favored and runs on lower-cost CPU instances. The cuda queue is only necessary for running tests that depend on GPU (via CUDA) access -- CUDA builds may still be run on the default queue, and the buildkite artifact system used to transfer build products over to a GPU instance for testing.

Buildkite Agent Management

Manual Node Setup for Colocated Hardware

This section describes how to set up a new machine that does not have a pre-configured image with all the requirements installed. Used for custom-built hardware at a colocation or office facility. Also works for vanilla Ubuntu cloud instances.

Pre-Requisites

  • Install Ubuntu 18.04 LTS Server
  • Log in as a local or remote user with sudo privileges

Install Core Requirements

Non-GPU enabled machines
sudo ./setup-new-buildkite-agent/setup-new-machine.sh
GPU-enabled machines
  • 1 or more NVIDIA GPUs should be installed in the machine (tested with 2080Ti)
sudo CUDA=1 ./setup-new-buildkite-agent/setup-new-machine.sh

Configure Node for Buildkite-agent based CI

  • Install buildkite-agent and set up it user environment with:
sudo ./setup-new-buildkite-agent/setup-buildkite.sh
  • Copy the pubkey contents from ~buildkite-agent/.ssh/id_ecdsa.pub and add the pubkey as an authorized SSH key on github.
    • In net/scripts/solana-user-authorized_keys.sh
    • Bug mvines to add it to the "solana-grimes" github user
  • Edit /etc/buildkite-agent/buildkite-agent.cfg and/or /etc/systemd/system/buildkite-agent@* to the desired configuration of the agent(s)
  • Copy ejson keys from another CI node at /opt/ejson/keys/ to the same location on the new node.
  • Start the new agent(s) with sudo systemctl enable --now buildkite-agent

Reference

This section contains details regarding previous CI setups that have been used, and that we may return to one day.

Buildkite Azure Setup

Create a new Azure-based "queue=default" agent by running the following command:

$ az vm create \
   --resource-group ci \
   --name XYZ \
   --image boilerplate \
   --admin-username $(whoami) \
   --ssh-key-value ~/.ssh/id_rsa.pub

The "boilerplate" image contains all the required packages pre-installed so the new machine should immediately show up in the Buildkite agent list once it has been provisioned and be ready for service.

Creating a "queue=cuda" agent follows the same process but additionally:

  1. Resize the image from the Azure port to include a GPU
  2. Edit the tags field in /etc/buildkite-agent/buildkite-agent.cfg to tags="queue=cuda,queue=default" and decrease the value of the priority field by one

Updating the CI Disk Image

  1. Create a new VM Instance as described above
  2. Modify it as required
  3. When ready, ssh into the instance and start a root shell with sudo -i. Then prepare it for deallocation by running: waagent -deprovision+user; cd /etc; ln -s ../run/systemd/resolve/stub-resolv.conf resolv.conf
  4. Run az vm deallocate --resource-group ci --name XYZ
  5. Run az vm generalize --resource-group ci --name XYZ
  6. Run az image create --resource-group ci --source XYZ --name boilerplate
  7. Goto the ci resource group in the Azure portal and remove all resources with the XYZ name in them

Buildkite AWS CloudFormation Setup

AWS CloudFormation is currently inactive, although it may be restored in the future

AWS CloudFormation can be used to scale machines up and down based on the current CI load. If no machine is currently running it can take up to 60 seconds to spin up a new instance, please remain calm during this time.

AMI

We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.

Use the following process to update this AMI as dependencies change:

$ export AWS_ACCESS_KEY_ID=my_access_key
$ export AWS_SECRET_ACCESS_KEY=my_secret_access_key
$ git clone https://github.com/solana-labs/elastic-ci-stack-for-aws.git -b solana/cuda
$ cd elastic-ci-stack-for-aws/
$ make build
$ make build-ami

Watch for the "amazon-ebs: AMI:" log message to extract the name of the new AMI. For example:

amazon-ebs: AMI: ami-07118545e8b4ce6dc

The new AMI should also now be visible in your EC2 Dashboard. Go to the desired AWS CloudFormation stack, update the ImageId field to the new AMI id, and apply the stack changes.

Buildkite GCP Setup

CI runs on Google Cloud Platform via two Compute Engine Instance groups: ci-default and ci-cuda. Autoscaling is currently disabled and the number of VM Instances in each group is manually adjusted.

Updating a CI Disk Image

Each Instance group has its own disk image, ci-default-vX and ci-cuda-vY, where X and Y are incremented each time the image is changed.

The manual process to update a disk image is as follows:

  1. Create a new VM Instance using the disk image to modify.
  2. Once the VM boots, ssh to it and modify the disk as desired.
  3. Stop the VM Instance running the modified disk. Remember the name of the VM disk
  4. From another machine, gcloud auth login, then create a new Disk Image based off the modified VM Instance:
 $ gcloud compute images create ci-default-$(date +%Y%m%d%H%M) --source-disk xxx --source-disk-zone us-east1-b --family ci-default

or

  $ gcloud compute images create ci-cuda-$(date +%Y%m%d%H%M) --source-disk xxx --source-disk-zone us-east1-b --family ci-cuda
  1. Delete the new VM instance.
  2. Go to the Instance templates tab, find the existing template named ci-default-vX or ci-cuda-vY and select it. Use the "Copy" button to create a new Instance template called ci-default-vX+1 or ci-cuda-vY+1 with the newly created Disk image.
  3. Go to the Instance Groups tag and find the applicable group, ci-default or ci-cuda. Edit the Instance Group in two steps: (a) Set the number of instances to 0 and wait for them all to terminate, (b) Update the Instance template and restore the number of instances to the original value.
  4. Clean up the previous version by deleting it from Instance Templates and Images.