Fixup scripts to set up a new CI node (#9348)

* Clean up node setup scripts for new CI boxes

* Move files under ci directory

* Set CUDA env var to setup cuda drivers

* Fixup and add README

* shellcheck

* Apply review feedback, rename dir and setup files

Co-authored-by: publish-docs.sh <maintainers@solana.com>
This commit is contained in:
Dan Albert 2020-04-20 17:43:13 -06:00 committed by GitHub
parent 41fec5bd5b
commit 3fbe7f0bb3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
19 changed files with 266 additions and 147 deletions

View File

@ -2,7 +2,7 @@
Our CI infrastructure is built around [BuildKite](https://buildkite.com) with some
additional GitHub integration provided by https://github.com/mvines/ci-gate
## Agent Queues
# Agent Queues
We define two [Agent Queues](https://buildkite.com/docs/agent/v3/queues):
`queue=default` and `queue=cuda`. The `default` queue should be favored and
@ -12,9 +12,52 @@ be run on the `default` queue, and the [buildkite artifact
system](https://buildkite.com/docs/builds/artifacts) used to transfer build
products over to a GPU instance for testing.
## Buildkite Agent Management
# Buildkite Agent Management
### Buildkite Azure Setup
## Manual Node Setup for Colocated Hardware
This section describes how to set up a new machine that does not have a
pre-configured image with all the requirements installed. Used for custom-built
hardware at a colocation or office facility. Also works for vanilla Ubuntu cloud
instances.
### Pre-Requisites
- Install Ubuntu 18.04 LTS Server
- Log in as a local or remote user with `sudo` privileges
### Install Core Requirements
##### Non-GPU enabled machines
```bash
sudo ./setup-new-buildkite-agent/setup-new-machine.sh
```
##### GPU-enabled machines
- 1 or more NVIDIA GPUs should be installed in the machine (tested with 2080Ti)
```bash
sudo CUDA=1 ./setup-new-buildkite-agent/setup-new-machine.sh
```
### Configure Node for Buildkite-agent based CI
- Install `buildkite-agent` and set up it user environment with:
```bash
sudo ./setup-new-buildkite-agent/setup-buildkite.sh
```
- Copy the pubkey contents from `~buildkite-agent/.ssh/id_ecdsa.pub` and
add the pubkey as an authorized SSH key on github.
- Edit `/etc/buildkite-agent/buildkite-agent.cfg` and/or `/etc/systemd/system/buildkite-agent@*` to the desired configuration of the agent(s)
- Copy `ejson` keys from another CI node at `/opt/ejson/keys/`
to the same location on the new node.
- Start the new agent(s) with `sudo systemctl enable --now buildkite-agent`
# Reference
This section contains details regarding previous CI setups that have been used,
and that we may return to one day.
## Buildkite Azure Setup
Create a new Azure-based "queue=default" agent by running the following command:
```
@ -35,7 +78,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
2. Edit the tags field in /etc/buildkite-agent/buildkite-agent.cfg to `tags="queue=cuda,queue=default"`
and decrease the value of the priority field by one
#### Updating the CI Disk Image
### Updating the CI Disk Image
1. Create a new VM Instance as described above
1. Modify it as required
@ -48,12 +91,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
1. Goto the `ci` resource group in the Azure portal and remove all resources
with the XYZ name in them
## Reference
This section contains details regarding previous CI setups that have been used,
and that we may return to one day.
### Buildkite AWS CloudFormation Setup
## Buildkite AWS CloudFormation Setup
**AWS CloudFormation is currently inactive, although it may be restored in the
future**
@ -62,7 +100,7 @@ AWS CloudFormation can be used to scale machines up and down based on the
current CI load. If no machine is currently running it can take up to 60
seconds to spin up a new instance, please remain calm during this time.
#### AMI
### AMI
We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.
Use the following process to update this AMI as dependencies change:
@ -84,13 +122,13 @@ The new AMI should also now be visible in your EC2 Dashboard. Go to the desired
AWS CloudFormation stack, update the **ImageId** field to the new AMI id, and
*apply* the stack changes.
### Buildkite GCP Setup
## Buildkite GCP Setup
CI runs on Google Cloud Platform via two Compute Engine Instance groups:
`ci-default` and `ci-cuda`. Autoscaling is currently disabled and the number of
VM Instances in each group is manually adjusted.
#### Updating a CI Disk Image
### Updating a CI Disk Image
Each Instance group has its own disk image, `ci-default-vX` and
`ci-cuda-vY`, where *X* and *Y* are incremented each time the image is changed.

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -0,0 +1,4 @@
#!/usr/bin/env bash
sudo systemctl daemon-reload
sudo systemctl enable --now buildkite-agent

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -0,0 +1,84 @@
#!/usr/bin/env bash
HERE="$(dirname "$0")"
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
set -e
# Install buildkite-agent
echo "deb https://apt.buildkite.com/buildkite-agent stable main" | tee /etc/apt/sources.list.d/buildkite-agent.list
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 32A37959C2FA5C3C99EFBC32A79206696452D198
apt-get update
apt-get install -y buildkite-agent
# Configure the installation
echo "Go to https://buildkite.com/organizations/solana-labs/agents"
echo "Click Reveal Agent Token"
echo "Paste the Agent Token, then press Enter:"
read -r agent_token
sudo sed -i "s/xxx/$agent_token/g" /etc/buildkite-agent/buildkite-agent.cfg
cat > /etc/buildkite-agent/hooks/environment <<EOF
set -e
export BUILDKITE_GIT_CLEAN_FLAGS="-ffdqx"
# Hack for non-docker rust builds
export PATH='$PATH':~buildkite-agent/.cargo/bin
# Add path to snaps
source /etc/profile.d/apps-bin-path.sh
if [[ '$BUILDKITE_BRANCH' =~ pull/* ]]; then
export BUILDKITE_REFSPEC="+'$BUILDKITE_BRANCH':refs/remotes/origin/'$BUILDKITE_BRANCH'"
fi
EOF
chown buildkite-agent:buildkite-agent /etc/buildkite-agent/hooks/environment
# Create SSH key
sudo -u buildkite-agent mkdir -p ~buildkite-agent/.ssh
sudo -u buildkite-agent ssh-keygen -t ecdsa -q -N "" -f ~buildkite-agent/.ssh/id_ecdsa
# Set buildkite-agent user's shell
sudo usermod --shell /bin/bash buildkite-agent
# Install Rust for buildkite-agent
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs -o /tmp/rustup-init.sh
sudo -u buildkite-agent HOME=~buildkite-agent sh /tmp/rustup-init.sh -y
# Add to docker and sudoers group
addgroup buildkite-agent docker
addgroup buildkite-agent sudo
# Edit the systemd unit file to include LimitNOFILE
cat > /lib/systemd/system/buildkite-agent.service <<EOF
[Unit]
Description=Buildkite Agent
Documentation=https://buildkite.com/agent
After=syslog.target
After=network.target
[Service]
Type=simple
User=buildkite-agent
Environment=HOME=/var/lib/buildkite-agent
ExecStart=/usr/bin/buildkite-agent start
RestartSec=5
Restart=on-failure
RestartForceExitStatus=SIGPIPE
TimeoutStartSec=10
TimeoutStopSec=0
KillMode=process
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
DefaultInstance=1
EOF

View File

@ -2,12 +2,13 @@
# https://developer.nvidia.com/cuda-toolkit-archive
VERSIONS=()
VERSIONS+=("https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux")
VERSIONS+=("https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run")
#VERSIONS+=("https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux")
#VERSIONS+=("https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run")
VERSIONS+=("http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run")
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
@ -51,3 +52,14 @@ done
# Allow normal users to use CUDA profiler
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' > /etc/modprobe.d/nvidia-enable-user-profiling.conf
# setup persistence mode across reboots
TMPDIR="$(mktemp -d)"
if pushd "$TMPDIR"; then
tar -xvf /usr/share/doc/NVIDIA_GLX-1.0/samples/nvidia-persistenced-init.tar.bz2
./nvidia-persistenced-init/install.sh systemd
popd
rm -rf "$TMPDIR"
fi
nvidia-smi -pm ENABLED

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -0,0 +1,49 @@
#!/usr/bin/env bash
HERE="$(dirname "$0")"
SOLANA_ROOT="$HERE"/../..
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
set -ex
apt update
apt upgrade -y
cat >/etc/apt/apt.conf.d/99-solana <<'EOF'
// Set and persist extra caps on iftop binary
Dpkg::Post-Invoke { "which iftop 2>&1 >/dev/null && setcap cap_net_raw=eip $(which iftop) || true"; };
EOF
apt install -y build-essential pkg-config clang cmake sysstat linux-tools-common \
linux-generic-hwe-18.04-edge linux-tools-generic-hwe-18.04-edge \
iftop heaptrack jq ruby python3-venv gcc-multilib libudev-dev
gem install ejson ejson2env
mkdir -p /opt/ejson/keys
"$SOLANA_ROOT"/net/scripts/install-docker.sh
usermod -aG docker "$SETUP_USER"
"$SOLANA_ROOT"/net/scripts/install-certbot.sh
"$HERE"/setup-sudoers.sh
"$HERE"/setup-ssh.sh
"$HERE"/disable-nouveau.sh
"$HERE"/disable-networkd-wait.sh
"$SOLANA_ROOT"/net/scripts/install-earlyoom.sh
"$SOLANA_ROOT"/net/scripts/install-nodejs.sh
"$SOLANA_ROOT"/net/scripts/localtime.sh
"$SOLANA_ROOT"/net/scripts/install-redis.sh
"$SOLANA_ROOT"/net/scripts/install-rsync.sh
"$SOLANA_ROOT"/net/scripts/install-libssl-compatability.sh
"$HERE"/setup-procfs-knobs.sh
"$HERE"/setup-limits.sh
[[ -n $CUDA ]] && "$HERE"/setup-cuda.sh
exit 0

View File

@ -2,12 +2,12 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
set -xe
set -ex
"$HERE"/disable-nouveau.sh
"$HERE"/disable-networkd-wait.sh

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -2,7 +2,7 @@
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
# shellcheck source=ci/setup-new-buildkite-agent/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1

View File

@ -1,25 +0,0 @@
# Introduction
These scripts are intended to facilitate the preparation of dedicated Solana
nodes. They have been tested as working from a clean installation of Ubuntu
18.04 Server. Use elsewhere is unsupported.
# Installation
Both installation methods require that the NVIDIA proprietary driver installer
programs be downloaded alongside [setup-cuda.sh](./setup-cuda.sh). If they do
not exist at runtime, an attempt will be made to download them automatically. To
avoid downloading the installers at runtime, they may be downloaded in advance
and placed as siblings to [setup-cuda.sh](./setup-cuda.sh).
For up-to-date NVIDIA driver version requirements, see [setup-cuda.sh](./setup-cuda.sh)
## Datacenter Node
1) `sudo ./setup-dc-node-1.sh`
2) `sudo reboot`
3) `sudo ./setup-dc-node-2.sh`
## Partner Node
1) `$ sudo ./setup-partner-node.sh`

View File

@ -1,61 +0,0 @@
#!/usr/bin/env bash
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
if [[ -n "$1" ]]; then
PUBKEY_FILE="$1"
else
cat <<EOF
Usage: $0 [pubkey_file]
The pubkey_file should be the pubkey that will be set up to allow the current user
(assumed to be the machine admin) to log in via ssh
EOF
exit 1
fi
set -xe
apt update
apt upgrade -y
cat >/etc/apt/apt.conf.d/99-solana <<'EOF'
// Set and persist extra caps on iftop binary
Dpkg::Post-Invoke { "which iftop 2>&1 >/dev/null && setcap cap_net_raw=eip $(which iftop) || true"; };
EOF
apt install -y build-essential pkg-config clang cmake sysstat linux-tools-common \
linux-generic-hwe-18.04-edge linux-tools-generic-hwe-18.04-edge \
iftop heaptrack
"$HERE"/../scripts/install-docker.sh
usermod -aG docker "$SETUP_USER"
"$HERE"/../scripts/install-certbot.sh
"$HERE"/setup-sudoers.sh
"$HERE"/setup-ssh.sh
# Allow admin user to log in
BASE_SSH_DIR="${SETUP_HOME}/.ssh"
mkdir "$BASE_SSH_DIR"
chown "$SETUP_USER:$SETUP_USER" "$BASE_SSH_DIR"
cat "$PUBKEY_FILE" > "${BASE_SSH_DIR}/authorized_keys"
chown "$SETUP_USER:$SETUP_USER" "${BASE_SSH_DIR}/.ssh/authorized_keys"
"$HERE"/disable-nouveau.sh
"$HERE"/disable-networkd-wait.sh
"$HERE"/setup-grub.sh
"$HERE"/../scripts/install-earlyoom.sh
"$HERE"/../scripts/install-nodeljs.sh
"$HERE"/../scripts/localtime.sh
"$HERE"/../scripts/install-redis.sh
"$HERE"/../scripts/install-rsync.sh
"$HERE"/../scripts/install-libssl-compatability.sh
"$HERE"/setup-procfs-knobs.sh
"$HERE"/setup-limits.sh
echo "Please reboot then run setup-dc-node-2.sh"

View File

@ -1,22 +0,0 @@
#!/usr/bin/env bash
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
set -xe
"$HERE"/setup-cuda.sh
# setup persistence mode across reboots
TMPDIR="$(mktemp)"
mkdir -p "$TMPDIR"
if pushd "$TMPDIR"; then
tar -xvf /usr/share/doc/NVIDIA_GLX-1.0/sample/nvidia-persistenced-init.tar.bz2
./nvidia-persistenced-init/install.sh systemd
popd
rm -rf "$TMPDIR"
fi

View File

@ -1,13 +0,0 @@
#!/usr/bin/env bash
HERE="$(dirname "$0")"
# shellcheck source=net/datacenter-node-install/utils.sh
source "$HERE"/utils.sh
ensure_env || exit 1
set -xe
printf "GRUB_GFXPAYLOAD_LINUX=1280x1024x32\n\n" >> /etc/default/grub
update-grub

View File

@ -18,9 +18,62 @@ add-apt-repository \
apt-get update
apt-get install -y docker-ce
docker run hello-world
cat > /lib/systemd/system/docker.service <<EOF
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H unix://
ExecReload=/bin/kill -s HUP '$MAINPID'
TimeoutSec=0
RestartSec=2
Restart=always
# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3
# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now /lib/systemd/system/docker.service
# Grant the solana user access to docker
if id solana; then
addgroup solana docker
fi
docker run hello-world