Go to file

Lcchy fdfee7d8f8 Update solana version and other minor fixes (#6 ) * Update Solana version, 1.4.9 is no longer supported * Replace ssh address of origin repo with http address * Force the snapshot to be fetched on startup * Remove the comments from api.sh and mention the issue in the README		2021-01-25 12:43:33 +08:00
deploy	Update solana version and other minor fixes (#6 )	2021-01-25 12:43:33 +08:00
sol	Update validators and healthcheck endpoint (#5 )	2021-01-11 10:54:17 -08:00
.gitignore	open source v2	2020-10-19 17:41:51 +08:00
LICENSE	Initial commit	2020-10-19 11:20:14 +08:00
README.md	Update solana version and other minor fixes (#6 )	2021-01-25 12:43:33 +08:00

README.md

Validators

Motivation

This repository is meant to serve as an example for how to run a solana validator. It does not give specifics on the architecture of Solana, and should not be used as a substitute for Solana's documentation. It is highly recommended to read Solana's Documentation on running a validator. This repository should be used in conjunction with Solana's guide. It provides practical real-world examples of a cluster setup, and should act as a starting point for participating in mainnet validation.

This repository gives two examples of potential validator setups. The first is a single node validator that can be used as an entry point for querying on-chain Solana data, or validating transactions. The second is a cluster of Solana validators that are load balanced by an NGINX server. NGINX has an active health check feature offered in their premium version. A load balancer configuration using the active health checks feature is also included.

The end goal of this guide is to have a solana validator cluster running in a cloud environment.

Running a single validator

Choosing an instance type

Solana's documentation recommends choosing a node type with the highest number of cores available and a CUDA enabled GPU (see here). Solana uses GPUs to increase throughput and the documentation recommends using Nvidia Turing or Volta family GPUs which are available through most cloud providers.

This guide was tested using Amazon AWS g4dn.xlarge instances with the Ubuntu 18.04 Deep Learning AMI. g4dn.xlarge istances come with Nvidia T4 GPUs which have a balance of high network throughput and CPU resources. Additionally the Deep Learning AMI comes preinstalled with CUDA which is required by Solana.

The Solana ledger is stored on disk. As of this writing, specifying the minimum required ledger length uses roughly 200GiB of disk space. When provisioning your instance, you should choose a disk size of at least 200GiB to ensure sufficient space for the ledger.

Instance network configuration

After provisioning an instance, it is important to configure network whitelists to be compatible with a validator's network usage. Solana nodes communicate via a gossip protocol on a port range that is specified upon validator startup. For this guide we will set that port range to 8000-8012 (see sol/api.sh). Be sure to whitelist all network traffic on whichever port range you choose. Validator RPC servers also bind to configurable ports. This guide will set RPC servers to use port 8899 for standard REST requests and 8900 for websocket connections (also defined in sol/api.sh).

Setting up a single validator

Once an instance has been deployed and is accessible over SSH, we can use ansible to run the validator setup script. Ansible works by inspecting the contents of the hosts.yaml file, which defines the inventory of servers. To make our servers accesible to Ansible, add your server's url or ip address to the validators block in deploy/hosts.yaml. This adds the specified server to the validators group, which describes our inventory of validator machines.

Ansible will need access to the sol user over ssh, which can be done by adding an ssh key to the sol user's ~/.ssh/authorized_keys file. Our ansible setup script takes care of this by copying the authorized_keys file in deploy/authorized_keys to the sol user's authorized_keys file. Be sure to add any ssh keys that need deploy permissions to that file prior to running the setup script.

deploy/setup.yaml contains a set of common setup steps for configuring a validator from the base OS image. You can run these setup steps using:

$ # run this from the /deploy directory
$ ansible-playbook -i hosts.yaml -l validators setup.yaml

Among other things, the above command will modify your new instance by

creating and configure the sol user
installing the contents of deploy/etc/common and deploy/etc/validator to the /etc directory on the validator
increasing the memory mapped file limit (see here)
increasing the UDP buffer size (see here)
increasing dns cache size (see deploy/common/dnsmasq.d/local.conf)
caching negative dns replies even if they don't have a ttl (see deploy/common/dnsmasq.d/local.conf)
creating a basic NGINX config at /etc/nginx
setting up supervisor and running the deploy/validator.conf configuration
installing NGINX sites located at deploy/etc/validator/nginx/sites-available/validator.conf
installing code from this repository
installing the Solana CLI
increasing the file descriptor limit for processes managed by supervisor to 600000
creating a tmpfs device for the accounts store mounted at /mnt/accounts

After running the setup.yaml script, a reboot is necessary to pick up various system configs. If it is the first time that the validator starts, it is possible that it thinks that a local ledger is present when there is not: commenting out the lines 64-66 of deploy/api.sh will prevent this. Post-reboot, supervisor should start up the validator using the sol/api.sh script. The validator will be listening on port 8899 for rest requests, so issuing a curl to the /health path will return the health status of the validator.

$ curl http://localhost:8899

This curl will likely return the status behind, as the validator is catching up with the existing cluster. Once the validator has successfully caught up, it will be ready to serve RPC requests on port 8899

A note about the accounts store

The accounts store is a portion of the Solana state model that requires high throughput access for validators to function properly. One method for achieving higher throughput is by keeping the accounts store in memory rather than on disk. A straightforward way to do this is by using a tmpfs block device. For the uninitiated, tmpfs is a temporary file storage system in Unix that emulates a filesystem, with the data actually being store in memory. Keeping the accounts store in memory allow for quicker accesses, and significantly reduces latency for expensive account operations.

TL;DR

deploy a cloud instance with an NVIDIA GPU with CUDA enabled
choose a disk size greater than 200 GiB
Whitelist all internet traffic on ports 8000-8012
add your ssh key to deploy/authorized_keys
add your server's URL or IP address to the validators block of deploy/hosts.yaml
run ansible-playbook -i hosts.yaml -l validators setup.yaml from your local terminal
ssh into the new machine and reboot to pick up system config changes
check that the new validator is running with curl http://localhost:8899

Running a cluster of validators

Running a cluster of validators allows for load balancing RPC requests across many machines. Using this repo's setup, each validator in the cluster can act as a standalone validator and will respond to RPC requests over port 8899. The cluster is composed of many validators with an NGINX load balancer directing traffic.

Adding nodes

To add a node to the cluster, initialize an instance as we did in the first section. After running the setup.yaml script and rebooting the instance the new validator should be in a state where it is responding to querys on its RPC port.

Adding a load balancer

At this point, we need to add a third machine which will act as the load balancer, sitting in front of the solana validators and directing traffic to the cluster. The load balancer machine will only be used for proxying traffic to upstream validators so there is no need to choose a GPU instance as we did for the validators. A general purpose compute instance is likely sufficient (e.g. a good choice might be an AWS m5 instance).

The load balancer setup in this repo uses an NGINX reverse proxy. NGINX plus (their premium version) includes an active health checks feature which is useful for maintaining a healthy cluster. A common failure pattern is for RPC requests to overload compute resources on a validator, which causes the validator to fall behind. This will lead to client issues. Reads directed at the lagging node will be delayed and writes will fail due to expired transaction signatures. NGINX active health checks provide a mechanism for directing traffic away from a lagging node, which allows the node to recover. Once the node starts passing the health check, NGINX will add the node back into the set of load balanced machines. It is highly recommended to use the active health checks if your application requires high uptime.

If your validators are located in AWS, one easy way to provision NGINX plus is through the NGINX Plus Ubuntu AMI. This AMI comes preinstalled with NGINX plus and removes some manual setup involved in licensing. If you choose not to use active health checks, a standard Ubuntu LTS image will suffice.

Configure NGINX

To setup the reverse proxy, we need to specify the upstream servers to which NGINX should proxy incoming requests. The NGINX config works by importing NGINX configurations that are defined in deploy/etc/lb/nginx/sites-available to form a global configuration. NGINX includes all configuration files located in the /etc/nginx/sites-enabled directory. To enable a site, Ansible creates a symbolic link from the /etc/nginx/sites-available directory to /etc/nginx/sites-enabled. Ansible will create such a link for any site that is listed in the nginx_sites entry of the deploy/hosts.yaml for a particular server.

To add servers to the set of upstreams, modify the two upstream blocks of deploy/etc/lb/nginx/sites-available/validator.conf as follows:

...
upstream validator_backend {
    least_conn;
    keepalive 8192;
    
    // Delete these
    server validator-1.test.net:8899 max_fails=20 fail_timeout=2;
    server validator-2.test.net:8899 max_fails=20 fail_timeout=2;
    
    // add new servers here at port 8899 for REST connections
}

upstream validator_ws_backend {
    least_conn;

    // Delete these
    server validator-1.test.net:8900 max_fails=20 fail_timeout=2;
    server validator-2.test.net:8900 max_fails=20 fail_timeout=2;
    
    // add new servers here at port 8900 for websocket connections
}
...

You will also need to replace the validator-lb.test.net in the server_name directive with the URL you will be using to access this server. Once these steps are completed HTTP access on port 80 will be properly configured. To add HTTPS support, modify the certificate files at deploy/etc/lb/ssl to include your HTTPS certificates and change the lines of validator.conf starting with ssl_ to point to your new certificate files.

HTTPS certificates can be obtained through any certificate authority. One such CA is Let's Encrypt. Alternatively if you are routing traffic through Cloudflare to prevent DDOS attacks, Cloudflare provides SSL certificates. If you are running validators behind a firewall and are not concerned with SSL, you can comment out the entire NGINX server listening on port 443.

Deploying to the load balancer

Once the load balancer machine is provisioned and NGINX is configured, add the URL or IP address of the new machines to the deploy/hosts.yaml file in the load_balancers group. An example load balancer for each setup is provided. If you are using NGINX plus, use the validator-lb-health-checks.test.net setup, otherwise use the validator-lb-standard.test.net setup. Note that if you are using the NGINX plus AMI it is important not to include nginx in the extra_packages list in hosts.yaml. Deploy to the load balancer machine using

# from the deploy/directory
$ ansible-playbook -i hosts.yaml -l load_balancers setup.yaml

After the deploy it might be necessary to restart NGINX on the load balancer machine.

Cluster monitoring

One advantage of using NGINX plus is the monitoring dashboard that is included. With Active health checks enabled the NGINX dashboard served at port 30000 gives an overview of hosts that are failing the health check.

Active health checks work by pinging a flask server on port 9090. This server checks the block height of the local validator against that of Solana mainnet validators. If the local validator is more than 15 blocks behind (this is configurable by changing sol/data/unhealthy_block_threshold) the validator will respond with a 500. NGINX marks servers responding with an error code in the 5xx range as unhealthy, and will avoid routing traffic to unhealthy nodes.

TL;DR

setup validator nodes following part one of this guide
add URLs for those nodes to the NGINX config
setup a load balancer machine which will run an NGINX reverse proxy load balancer
modify the NGINX template configs in deploy/etc/lb/nginx/sites-available/. Use validator.conf if you are using standard open source NGINX, and validator-health-checks.conf if you are using NGINX plus
add SSL certificates to deploy/etc/lb/ssl
deploy to the load balancer machine using ansible-playbook -i hosts.yaml -l load_balancers setup.yaml