* Remove a redundant sprout full sync job
* Add two new full sync jobs
* Allow the full sync test to run for 48 hours (estimated current time 40-45 hours)
* chore: add Network as a label
* Fix network parameter in continous-delivery.yml
* Standardise network usage in zcashd-manual-deploy
* Use lowercase network labels
* Fix some shellcheck errors
* Hard-code a Mainnet default to support contexts where env is not available
* Fix string syntax
* Fix more shellcheck errors
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
Co-authored-by: Arya <aryasolhi@gmail.com>
* feat(gcp): add label to instances for cost and logs grouping
Previous behavior:
We couldn't search GCP logs using the instance name if that instance was
already deleted. And if we want to know how we're spending our budget its
also difficult to know if specific tests or type of instances are the one
responsible for a certain % of the costs
Fixes#5153
Fixses #5543
Expected behavior:
Be able to search logs using the test ID or at least the github reference,
and be able to group GCP costs by labels
Solution:
- Add labels to instances
* chore: add Network as a label
* Revert "chore: add Network as a label"
This reverts commit 146f747d50.
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* ci(compute): use debian public image on the VM, not the container
Previous behavior:
We were pulling the debian image the wrong way, as this was being used
as a container but it was meant to be the VM image
The image being pulled to create the internal container has been causing
crashes as this images do not exists on Google's container repositories
Expected behavior:
Use a public image as debian-11 to get multiple benefits from it, as being
able to use machine-images (#5615) and automatic disk resizing (which
is now possible as we're using COS images, but those are more restrictive)
Solution
Add `--image-project=debian-cloud` and `--image-family=debian-11` as
stated in the official documentation: https://cloud.google.com/sdk/gcloud/reference/compute/instances/create-with-container#--image-project
More info: https://cloud.google.com/compute/docs/images/os-details#import
* fix: use a public image with docker on the host
* fix(logs): missing sudo before docker command
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* feat(ssh): enable OS Login for GCP test instances
* fix(ssh): force service account impersonation for OS Login
* debug: show actual user trying to impersonate SA
* fix(glcloud): configure gcloud before running commands
* fix(ssh): add VM zone to ssh command
* fix(auth): bringing changes from #5614
* fix(auth): impersonation is working as expected now
* fix(gcloud): setup the GCP CLI after authenticating (#5606)
Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials
Expected behavior:
All `gcloud` commands should be properly configured and authenticated.
Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.
Remove the need of an OAuth Access token when not required by following
steps
* fix(auth): revert to latest version
* fix: wrong replace
* fix(ci): use a specific debian image for VM containers
* fix(ssh): delete generated SSH keys by CI after 30 seconds
* debug: remove debug commands
* fix(compute): use a lightweight container image
* fix(ci): add missing sudo to docker command
* Update .github/workflows/deploy-gcp-tests.yml
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
* fix(ssh): delete ssh-keys for the specific GHA service account
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials
Expected behavior:
All `gcloud` commands should be properly configured and authenticated.
Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.
Remove the need of an OAuth Access token when not required by following
steps
* Only run multiple test jobs if they are needed for a long test
* Remove unused job steps
* Remove trailing whitespace
* Follow logs in the Run step
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Revert "ci(ssh): connect using `ssh-compute` action by Google (#5330)"
This reverts commit b366d6e7bb.
* ci(ssh): use sudo for docker commands if user is not root
* ci(ssh): specify the service account to connect with
* ci(ssh): increase the Google Cloud instance sshd connection limit
* chore: add a new line at the end of the script
* chore: update our VM image to bullseye
* chore: fix `tj-actions/changed-files` file comparison
* ci(disk): use an official image on CI VMs for disk auto-resizing
Previous behavior:
We've presented issues in the past with resizing as the device is busy,
for example:
```
e2fsck: Cannot continue, aborting.
/dev/sdb is in use.
```
Expected behavior:
We've been manually resizing the disk as this task was not being done
automatically, but having an official Public Image from GCP would make
this easier (automatic) and it also integrates better with other GCP
services
Configuration differences: https://cloud.google.com/compute/docs/images/os-details#notable-difference-debian
Solution:
- Use `debian-11` from the official public images https://cloud.google.com/compute/docs/images/os-details#debian
- Remove the manual disk resizing from the pipeline
* ci: increase VM disk size to fit future cached states sizes
Some GCP disk images are 160 GB, which means they could get to the current
200 GB size soon.
* Revert "ci(ssh): connect using `ssh-compute` action by Google (#5330)"
This reverts commit b366d6e7bb.
* ci(ssh): use sudo for docker commands if user is not root
* ci(ssh): specify the service account to connect with
* ci(ssh): increase the Google Cloud instance sshd connection limit
* chore: add a new line at the end of the script
* chore: update our VM image to bullseye
* chore: fix `tj-actions/changed-files` file comparison
* refactor(ssh): connect using `ssh-compute` action by Google
Previous behavior:
From time to time SSH connections to deployed VMs fails with the following
error: `kex_exchange_identification: Connection closed by remote host`
This was still happening after implementing https://github.com/ZcashFoundation/zebra/pull/5292
Excpected behavior:
Ensure we're not creating SSH key pairs on the fly to improve our connections
guarantees
Solution:
- Enable the Cloud Identity-Aware Proxy API in GCP
- Create a firewall rule to enable connections from IAP
- Grant the required IAM permissions to enable IAP TCP forwarding
- Generate an SSH keys pair and set a private key as an input param
- Set the GitHub Action SA to have authorized ssh connection to the VMs
- Implement the `google-github-actions/ssh-compute` action to connect
* fix(ssh): id `compute-ssh` cannot be used more than once within the same scope
* fix(ci): try to enclose commands to override parsing issues
* tmp: remove ssh_args
* fix(action): secrets must be inherited to be used
* tmp: validate command enclosing fixes executin
* fix(ssh): ssh_args are not implemented correctly
* fix(ssh): login with the root user
* fix(privelege): uso sudo with docker commands
* tmp: add sudo
* fix(ssh): use sudo for all docker commands
* fix(ssh): add missing `sudo` commands
* fix(ssh): get sync height from ssh stdout
* fix(height): get the height correctly
Previous behavior
From time to time SSH connections to deployed VMs fails with the following
error: `kex_exchange_identification: Connection closed by remote host`
Expected behavior
If the connection fails, attempt to reconnect once again (or multiple times)
Solution
Add the `ConnectionAttempts` and `ConnectTimeout` with 20 and 5 values
respectively, which attempst to reconnect 19 more times every 5 seconds
* Increase search range for sync height
* Update sync height regexes for zebrad and lwd cached states
* Add labels to cached state images
* Update deploy-gcp-tests.yml
* Don't create new cached states for lwd updates
* Add a missing line continuation
* Fix a comment
* Revert a mistaken comment change
* Clarify a TODO comment
* Partially revert to old docker height log handling
* Use an output for the cached disk name
* Increase search range for sync height
* Update sync height regexes for zebrad and lwd cached states
* Add labels to cached state images
* Add a missing line continuation
* Show the arguments of acceptance test functions in the logs
* Show all the logs in the "Run tests" jobs
* Document expected "broken pipe" error from `tee`
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Expand cached state disks before running tests
* Install partition management tool
* There isn't actually a partition on the cached state image
* Make e2fsck non-interactive
* Limit the length of image names to 63 characters
* Ignore possibly long branch names when matching images, just match the commit
* Increase full sync timeout to 24 hours
Expected sync time is ~21 hours as of August 2022.
* Split final checkpoint job into two smaller jobs to avoid timeouts
Also make regexes easier to read.
* Fix a job name typo
* Save cached state on full syncs and updates
* Add an -update suffix to CI images created by updating cached state
* Make disk image names unique by adding a time suffix
* Use the latest image from any branch, but prefer the current commit if available
* Document Zebra's continuous integration tests
* Fix typos in environmental variable names
* Expand documentation
* Fix variable name typo
* Fix shell syntax
Previous behavior:
Sometimes Google Cloud authentication fails, this might happen before
IAM permissions are fully propagated
Expected behavior:
If the authentication fails, retry at least 3 times before exiting with
a non zero exit code
Applied solution:
Google GitHub Actions for auth recently added this a `retries` feature
which is now implemented to workaround this issue.
Note: 95a6bc2a27
Fixes https://github.com/ZcashFoundation/zebra/issues/4846
* update timeout
* update the doc comment
* Increase test timeouts for Zebra update syncs
* Stop failing the 1740k job if the cached state is after block 1740k
Co-authored-by: teor <teor@riseup.net>
* Make code execution time logs shorter
* Do ZK parameter preloads in the lightwalletd tests that need them
* Try to re-launch `lightwalletd` when it hangs during sync tests
* Increase full sync timeout
* Clear the `zebrad` logs during `lightwalletd` tests, to avoid logging deadlocks
* Actually clear more than one line of logs
* Check zebrad and lightwalletd output in parallel threads, while waiting for zebrad
* Check zebrad and lightwalletd output in parallel threads, while waiting for lightwalletd
* Improve test logging
* Fix a log typo
* Only wait for lightwalletd once, because its logs stop after the initial sync
* Look for cached state disks for this commit and branch first
* Only copy the state once in the send transactions test
* Wait longer for lightwalletd gRPC server startup
* Add some function docs
* cargo fmt --all
* Checkout zebra in each job to avoid warnings
But put TODOs where we might be able to skip checkouts
* Split log following into sprout checkpoints, sapling/orchard checkpoints, and full validation
* Make job IDs shorter
* Use /dev/stderr because docker doesn't have a tty
* remove pipefail
* Revert "remove pipefail"
This reverts commit a7ee37bebdc107a4215e7dd307b189d925969234.
* Make tee ignore errors writing to a grep pipe
* Avoid launching multiple docker instances for duplicate jobs
* Ignore broken pipe error messages and statuses
* fix(ci): docker wait not finding container
We had this issue before, I can't recall if this was a parsing error between GitHub Actions and gcloud `--command` parsing, but we had to change this into two pieces.
This implementation keeps it how we did it before 9b9578c999/.github/workflows/test.yml (L235-L243)
* docs: remove pending TODO
We can't remove `actions/checkout` nor set `create_credentials_file` to `false` as next steps won't be able to authenticate to GCP.
We can surely remove `actions/checkout` and leave `create_credentials_file` as `true`, but this will raise a warning on each step, and there's no benefit of doing so.
* Show `docker wait` and `gcloud ssh` output
* If `docker wait` fails, get the exit code using `docker inspect`
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Put arguments to "docker run" on different lines
And update some comments.
* Split docker run into launch, logs, and wait
* Remove mistaken "needs state" condition on log and results job
* Exit the ssh and the job with the container test's exit status
* Split full sync into checkpoint and full validation
* Sort workflow variables into categories and add descriptions
* Split Create instance/volume and Run test into separate jobs
* Copy initial conditions to all jobs in the series
* Actually create a cached state image
* fix(state): use same disk naming convention for all test instances
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
* Upgrade tracing and related dependencies
```sh
cargo upgrade --workspace
tracing-error
tracing-subscrber
color-eyre
tracing-flame
tracing-journald
sentry
sentry-tracing
metrics
metrics-exporter-prometheus
reqwest
```
* Update duplicate dependency checks
* Enable the tracing/env-filter feature
* Fix type inference for metrics
Manual changes, plus:
```sh
fastmod "as _" "as f64"
```
* Tidy up some unrelated test code
* Update metrics-exporter-prometheus API
And make unused dependencies optional.
* Adjust test regexes to new tracing format
Also fix some regex bugs, and refactor to simplify.
* Disable color-eyre span traces and track caller in release builds
* Add a feature that enables extra debugging in release builds
* Clean up some redundant features
* Increase a test timeout
* refactor(ci): keep tests jobs under the 6 hour timeout
When running a full sync or any other test which takes almost 5 hours, having those jobs running with other actions that might take several minutes, also reduces the overall time from the job_id.
We use a separate job for image creation and deletion to handle this cases.
* fix(ci): instance deletion can't run on non finished tests
* fix(ci): tests without a cached state might save to disk
* fix(ci): ignore failures when deleting an instance
* fix(ci): remove delete step `needs` redundancy
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* fix(ci): allow for the lightwalletd-full-sync to mount the lwd-cache dir
* fix(ci): compare with a string
* imp(ci): run a lightwalletd tip if there's no lwd tip disk available
* docs(ci): add TODO explaining this is a temporal condition