Remove testnet/metrics server debug info from book

This commit is contained in:
Michael Vines 2019-04-10 11:55:02 -07:00 committed by Grimes
parent 842d146b0d
commit f243a96e01
2 changed files with 41 additions and 40 deletions

View File

@ -125,6 +125,47 @@ can run your own testnet using the scripts in the `net/` directory.
Edit `ci/testnet-manager.sh`
## Metrics Server Maintenance
Sometimes the dashboard becomes unresponsive. This happens due to glitch in the metrics server.
The current solution is to reset the metrics server. Use the following steps.
1. The server is hosted in a GCP VM instance. Check if the VM instance is down by trying to SSH
into it from the GCP console. The name of the VM is ```metrics-solana-com```.
2. If the VM is inaccessible, reset it from the GCP console.
3. Once VM is up (or, was already up), the metrics services can be restarted from build automation.
1. Navigate to https://buildkite.com/solana-labs/metrics-dot-solana-dot-com in your web browser
2. Click on ```New Build```
3. This will show a pop up dialog. Click on ```options``` drop down.
4. Type in ```FORCE_START=true``` in ```Environment Variables``` text box.
5. Click ```Create Build```
6. This will restart the metrics services, and the dashboards should be accessible afterwards.
## Debugging Testnet
Testnet may exhibit different symptoms of failures. Primary statistics to check are
1. Rise in Confirmation Time
2. Nodes are not voting
3. Panics, and OOM notifications
Check the following if there are any signs of failure.
1. Did testnet deployment fail?
1. View buildkite logs for the last deployment: https://buildkite.com/solana-labs/testnet-management
2. Use the relevant branch
3. If the deployment failed, look at the build logs. The build artifacts for each remote node is uploaded.
It's a good first step to triage from these logs.
2. You may have to log into remote node if the deployment succeeded, but something failed during runtime.
1. Get the private key for the testnet deployment from ```metrics-solana-com``` GCP instance.
2. SSH into ```metrics-solana-com``` using GCP console and do the following.
```bash
sudo bash
cd ~buildkite-agent/.ssh
ls
```
3. Copy the relevant private key to your local machine
4. Find the public IP address of the AWS instance for the remote node using AWS console
5. ```ssh -i <private key file> ubuntu@<ip address of remote node>```
6. The logs are in ```~solana\solana``` folder
Benchmarking
---

View File

@ -99,43 +99,3 @@ export SOLANA_METRICS_CONFIG="db=testnet-beta,u=${u:?},p=${p:?}"
source scripts/configure-metrics.sh
```
Inspect for your contributions to our [metrics dashboard](https://metrics.solana.com:3000/d/U9-26Cqmk/testnet-monitor-cloud?refresh=60s&orgId=2&var-hostid=All).
#### Metrics Server Maintenance
Sometimes the dashboard becomes unresponsive. This happens due to glitch in the metrics server.
The current solution is to reset the metrics server. Use the following steps.
1. The server is hosted in a GCP VM instance. Check if the VM instance is down by trying to SSH
into it from the GCP console. The name of the VM is ```metrics-solana-com```.
2. If the VM is inaccessible, reset it from the GCP console.
3. Once VM is up (or, was already up), the metrics services can be restarted from build automation.
1. Navigate to https://buildkite.com/solana-labs/metrics-dot-solana-dot-com in your web browser
2. Click on ```New Build```
3. This will show a pop up dialog. Click on ```options``` drop down.
4. Type in ```FORCE_START=true``` in ```Environment Variables``` text box.
5. Click ```Create Build```
6. This will restart the metrics services, and the dashboards should be accessible afterwards.
#### Debugging Testnet
Testnet may exhibit different symptoms of failures. Primary statistics to check are
1. Rise in Confirmation Time
2. Nodes are not voting
3. Panics, and OOM notifications
Check the following if there are any signs of failure.
1. Did testnet deployment fail?
1. View buildkite logs for the last deployment: https://buildkite.com/solana-labs/testnet-management
2. Use the relevant branch
3. If the deployment failed, look at the build logs. The build artifacts for each remote node is uploaded.
It's a good first step to triage from these logs.
2. You may have to log into remote node if the deployment succeeded, but something failed during runtime.
1. Get the private key for the testnet deployment from ```metrics-solana-com``` GCP instance.
2. SSH into ```metrics-solana-com``` using GCP console and do the following.
```bash
sudo bash
cd ~buildkite-agent/.ssh
ls
```
3. Copy the relevant private key to your local machine
4. Find the public IP address of the AWS instance for the remote node using AWS console
5. ```ssh -i <private key file> ubuntu@<ip address of remote node>```
6. The logs are in ```~solana\solana``` folder