Remove testnet/metrics server debug info from book
This commit is contained in:
parent
842d146b0d
commit
f243a96e01
41
README.md
41
README.md
|
@ -125,6 +125,47 @@ can run your own testnet using the scripts in the `net/` directory.
|
|||
Edit `ci/testnet-manager.sh`
|
||||
|
||||
|
||||
## Metrics Server Maintenance
|
||||
Sometimes the dashboard becomes unresponsive. This happens due to glitch in the metrics server.
|
||||
The current solution is to reset the metrics server. Use the following steps.
|
||||
|
||||
1. The server is hosted in a GCP VM instance. Check if the VM instance is down by trying to SSH
|
||||
into it from the GCP console. The name of the VM is ```metrics-solana-com```.
|
||||
2. If the VM is inaccessible, reset it from the GCP console.
|
||||
3. Once VM is up (or, was already up), the metrics services can be restarted from build automation.
|
||||
1. Navigate to https://buildkite.com/solana-labs/metrics-dot-solana-dot-com in your web browser
|
||||
2. Click on ```New Build```
|
||||
3. This will show a pop up dialog. Click on ```options``` drop down.
|
||||
4. Type in ```FORCE_START=true``` in ```Environment Variables``` text box.
|
||||
5. Click ```Create Build```
|
||||
6. This will restart the metrics services, and the dashboards should be accessible afterwards.
|
||||
|
||||
## Debugging Testnet
|
||||
Testnet may exhibit different symptoms of failures. Primary statistics to check are
|
||||
1. Rise in Confirmation Time
|
||||
2. Nodes are not voting
|
||||
3. Panics, and OOM notifications
|
||||
|
||||
Check the following if there are any signs of failure.
|
||||
1. Did testnet deployment fail?
|
||||
1. View buildkite logs for the last deployment: https://buildkite.com/solana-labs/testnet-management
|
||||
2. Use the relevant branch
|
||||
3. If the deployment failed, look at the build logs. The build artifacts for each remote node is uploaded.
|
||||
It's a good first step to triage from these logs.
|
||||
2. You may have to log into remote node if the deployment succeeded, but something failed during runtime.
|
||||
1. Get the private key for the testnet deployment from ```metrics-solana-com``` GCP instance.
|
||||
2. SSH into ```metrics-solana-com``` using GCP console and do the following.
|
||||
```bash
|
||||
sudo bash
|
||||
cd ~buildkite-agent/.ssh
|
||||
ls
|
||||
```
|
||||
3. Copy the relevant private key to your local machine
|
||||
4. Find the public IP address of the AWS instance for the remote node using AWS console
|
||||
5. ```ssh -i <private key file> ubuntu@<ip address of remote node>```
|
||||
6. The logs are in ```~solana\solana``` folder
|
||||
|
||||
|
||||
Benchmarking
|
||||
---
|
||||
|
||||
|
|
|
@ -99,43 +99,3 @@ export SOLANA_METRICS_CONFIG="db=testnet-beta,u=${u:?},p=${p:?}"
|
|||
source scripts/configure-metrics.sh
|
||||
```
|
||||
Inspect for your contributions to our [metrics dashboard](https://metrics.solana.com:3000/d/U9-26Cqmk/testnet-monitor-cloud?refresh=60s&orgId=2&var-hostid=All).
|
||||
|
||||
#### Metrics Server Maintenance
|
||||
Sometimes the dashboard becomes unresponsive. This happens due to glitch in the metrics server.
|
||||
The current solution is to reset the metrics server. Use the following steps.
|
||||
|
||||
1. The server is hosted in a GCP VM instance. Check if the VM instance is down by trying to SSH
|
||||
into it from the GCP console. The name of the VM is ```metrics-solana-com```.
|
||||
2. If the VM is inaccessible, reset it from the GCP console.
|
||||
3. Once VM is up (or, was already up), the metrics services can be restarted from build automation.
|
||||
1. Navigate to https://buildkite.com/solana-labs/metrics-dot-solana-dot-com in your web browser
|
||||
2. Click on ```New Build```
|
||||
3. This will show a pop up dialog. Click on ```options``` drop down.
|
||||
4. Type in ```FORCE_START=true``` in ```Environment Variables``` text box.
|
||||
5. Click ```Create Build```
|
||||
6. This will restart the metrics services, and the dashboards should be accessible afterwards.
|
||||
|
||||
#### Debugging Testnet
|
||||
Testnet may exhibit different symptoms of failures. Primary statistics to check are
|
||||
1. Rise in Confirmation Time
|
||||
2. Nodes are not voting
|
||||
3. Panics, and OOM notifications
|
||||
|
||||
Check the following if there are any signs of failure.
|
||||
1. Did testnet deployment fail?
|
||||
1. View buildkite logs for the last deployment: https://buildkite.com/solana-labs/testnet-management
|
||||
2. Use the relevant branch
|
||||
3. If the deployment failed, look at the build logs. The build artifacts for each remote node is uploaded.
|
||||
It's a good first step to triage from these logs.
|
||||
2. You may have to log into remote node if the deployment succeeded, but something failed during runtime.
|
||||
1. Get the private key for the testnet deployment from ```metrics-solana-com``` GCP instance.
|
||||
2. SSH into ```metrics-solana-com``` using GCP console and do the following.
|
||||
```bash
|
||||
sudo bash
|
||||
cd ~buildkite-agent/.ssh
|
||||
ls
|
||||
```
|
||||
3. Copy the relevant private key to your local machine
|
||||
4. Find the public IP address of the AWS instance for the remote node using AWS console
|
||||
5. ```ssh -i <private key file> ubuntu@<ip address of remote node>```
|
||||
6. The logs are in ```~solana\solana``` folder
|
||||
|
|
Loading…
Reference in New Issue