* Move format upgrades to their own module and enum
* Launch a format change thread if needed, and shut it down during shutdown
* Add some TODOs and remove a redundant timer
* Regularly check for panics in the state upgrade task
* Only run example upgrade once, change version field names
* Increment database format to 25.0.2: add format change task
* Log the running and initial disk database format versions on startup
* Add initial disk and running state versions to cached state images in CI
* Fix missing imports
* Fix typo in logs workflow command
* Add a force_save_to_disk argument to the CI workflow
* Move use_internet_connection into zebrad_config()
* fastmod can_spawn_zebrad_for_rpc can_spawn_zebrad_for_test_type zebra*
* Add a spawn_zebrad_without_rpc() function
* Remove unused copy_state() test code
* Assert that upgrades and downgrades happen with the correct versions
* Add a kill_and_return_output() method for tests
* Add a test for new_state_format() versions (no upgrades or downgrades)
* Add use_internet_connection to can_spawn_zebrad_for_test_type()
* Fix workflow parameter passing
* Check that reopening a new database doesn't upgrade (or downgrade) the format
* Allow ephemeral to be set to false even if we don't have a cached state
* Add a test type that will accept any kind of state
* When re-using a directory, configure the state test config with that path
* Actually mark newly created databases with their format versions
* Wait for the state to be opened before testing the format
* Run state format tests on mainnet and testnet configs (no network access)
* run multiple reopens in tests
* Test upgrades run correctly
* Test that version downgrades work as expected (best effort)
* Add a TODO for testing partial updates
* Fix missing test arguments
* clippy if chain
* Fix typo
* another typo
* Pass a database instance to the format upgrade task
* Fix a timing issue in the tests
* Fix version matching in CI
* Use correct env var reference
* Use correct github env file
* Wait for the database to be written before killing Zebra
* Use correct workflow syntax
* Version changes aren't always upgrades
---------
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Delete an unused CI job that was previously partially deleted
* Add 2 more jobs to the full sync test
* Increase Rust test time: current expected time is 60 hours
* ci(build): unpin specific `buildkit` version
We previously had an issue with the following error: `cannot reuse body, request must be retried`
This commonly was a wrong error, caused by a containerd issue which has being tracked and solved here: https://github.com/docker/build-push-action/issues/761#issuecomment-1406261692
We're having errors when building, and this might be caused by an underliying error which containerd is not showing us correctly.
* ci(deploy): Use specific subnetworks on GCP VMs
* refactor(ci): use GitHub secrets and variables
We've been using values that are variable across multiple workflows,
and those can only be changed if modifying the workflows, but we should
be able to change the values without committing new changes in the code
for this purpose we're now using GitHub Variables, and even moving
non-sensitive information into variables instead of secrets. Allowing
more flexibility and other scenarios that should be easier to manage,
like deploying to Mainnet or Testnet.
* refactor(ci): use new GitHub variables for GCP auth
* fix(ci): typo
* fix(ci): do not use multiple variables for the same value
* fix(ci): typo in variable
* fix(vars): use different variables for machine types
* fix(vars): missing substitution
* fix: typo
* fix: make the input CI network override the default network
* Use the correct network variable for creating disks
---------
Co-authored-by: teor <teor@riseup.net>
* Remove a redundant sprout full sync job
* Add two new full sync jobs
* Allow the full sync test to run for 48 hours (estimated current time 40-45 hours)
* chore: add Network as a label
* Fix network parameter in continous-delivery.yml
* Standardise network usage in zcashd-manual-deploy
* Use lowercase network labels
* Fix some shellcheck errors
* Hard-code a Mainnet default to support contexts where env is not available
* Fix string syntax
* Fix more shellcheck errors
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
Co-authored-by: Arya <aryasolhi@gmail.com>
* feat(gcp): add label to instances for cost and logs grouping
Previous behavior:
We couldn't search GCP logs using the instance name if that instance was
already deleted. And if we want to know how we're spending our budget its
also difficult to know if specific tests or type of instances are the one
responsible for a certain % of the costs
Fixes#5153
Fixses #5543
Expected behavior:
Be able to search logs using the test ID or at least the github reference,
and be able to group GCP costs by labels
Solution:
- Add labels to instances
* chore: add Network as a label
* Revert "chore: add Network as a label"
This reverts commit 146f747d50.
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* ci(compute): use debian public image on the VM, not the container
Previous behavior:
We were pulling the debian image the wrong way, as this was being used
as a container but it was meant to be the VM image
The image being pulled to create the internal container has been causing
crashes as this images do not exists on Google's container repositories
Expected behavior:
Use a public image as debian-11 to get multiple benefits from it, as being
able to use machine-images (#5615) and automatic disk resizing (which
is now possible as we're using COS images, but those are more restrictive)
Solution
Add `--image-project=debian-cloud` and `--image-family=debian-11` as
stated in the official documentation: https://cloud.google.com/sdk/gcloud/reference/compute/instances/create-with-container#--image-project
More info: https://cloud.google.com/compute/docs/images/os-details#import
* fix: use a public image with docker on the host
* fix(logs): missing sudo before docker command
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* feat(ssh): enable OS Login for GCP test instances
* fix(ssh): force service account impersonation for OS Login
* debug: show actual user trying to impersonate SA
* fix(glcloud): configure gcloud before running commands
* fix(ssh): add VM zone to ssh command
* fix(auth): bringing changes from #5614
* fix(auth): impersonation is working as expected now
* fix(gcloud): setup the GCP CLI after authenticating (#5606)
Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials
Expected behavior:
All `gcloud` commands should be properly configured and authenticated.
Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.
Remove the need of an OAuth Access token when not required by following
steps
* fix(auth): revert to latest version
* fix: wrong replace
* fix(ci): use a specific debian image for VM containers
* fix(ssh): delete generated SSH keys by CI after 30 seconds
* debug: remove debug commands
* fix(compute): use a lightweight container image
* fix(ci): add missing sudo to docker command
* Update .github/workflows/deploy-gcp-tests.yml
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
* fix(ssh): delete ssh-keys for the specific GHA service account
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials
Expected behavior:
All `gcloud` commands should be properly configured and authenticated.
Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.
Remove the need of an OAuth Access token when not required by following
steps
* Only run multiple test jobs if they are needed for a long test
* Remove unused job steps
* Remove trailing whitespace
* Follow logs in the Run step
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Revert "ci(ssh): connect using `ssh-compute` action by Google (#5330)"
This reverts commit b366d6e7bb.
* ci(ssh): use sudo for docker commands if user is not root
* ci(ssh): specify the service account to connect with
* ci(ssh): increase the Google Cloud instance sshd connection limit
* chore: add a new line at the end of the script
* chore: update our VM image to bullseye
* chore: fix `tj-actions/changed-files` file comparison
* ci(disk): use an official image on CI VMs for disk auto-resizing
Previous behavior:
We've presented issues in the past with resizing as the device is busy,
for example:
```
e2fsck: Cannot continue, aborting.
/dev/sdb is in use.
```
Expected behavior:
We've been manually resizing the disk as this task was not being done
automatically, but having an official Public Image from GCP would make
this easier (automatic) and it also integrates better with other GCP
services
Configuration differences: https://cloud.google.com/compute/docs/images/os-details#notable-difference-debian
Solution:
- Use `debian-11` from the official public images https://cloud.google.com/compute/docs/images/os-details#debian
- Remove the manual disk resizing from the pipeline
* ci: increase VM disk size to fit future cached states sizes
Some GCP disk images are 160 GB, which means they could get to the current
200 GB size soon.
* Revert "ci(ssh): connect using `ssh-compute` action by Google (#5330)"
This reverts commit b366d6e7bb.
* ci(ssh): use sudo for docker commands if user is not root
* ci(ssh): specify the service account to connect with
* ci(ssh): increase the Google Cloud instance sshd connection limit
* chore: add a new line at the end of the script
* chore: update our VM image to bullseye
* chore: fix `tj-actions/changed-files` file comparison
* refactor(ssh): connect using `ssh-compute` action by Google
Previous behavior:
From time to time SSH connections to deployed VMs fails with the following
error: `kex_exchange_identification: Connection closed by remote host`
This was still happening after implementing https://github.com/ZcashFoundation/zebra/pull/5292
Excpected behavior:
Ensure we're not creating SSH key pairs on the fly to improve our connections
guarantees
Solution:
- Enable the Cloud Identity-Aware Proxy API in GCP
- Create a firewall rule to enable connections from IAP
- Grant the required IAM permissions to enable IAP TCP forwarding
- Generate an SSH keys pair and set a private key as an input param
- Set the GitHub Action SA to have authorized ssh connection to the VMs
- Implement the `google-github-actions/ssh-compute` action to connect
* fix(ssh): id `compute-ssh` cannot be used more than once within the same scope
* fix(ci): try to enclose commands to override parsing issues
* tmp: remove ssh_args
* fix(action): secrets must be inherited to be used
* tmp: validate command enclosing fixes executin
* fix(ssh): ssh_args are not implemented correctly
* fix(ssh): login with the root user
* fix(privelege): uso sudo with docker commands
* tmp: add sudo
* fix(ssh): use sudo for all docker commands
* fix(ssh): add missing `sudo` commands
* fix(ssh): get sync height from ssh stdout
* fix(height): get the height correctly
Previous behavior
From time to time SSH connections to deployed VMs fails with the following
error: `kex_exchange_identification: Connection closed by remote host`
Expected behavior
If the connection fails, attempt to reconnect once again (or multiple times)
Solution
Add the `ConnectionAttempts` and `ConnectTimeout` with 20 and 5 values
respectively, which attempst to reconnect 19 more times every 5 seconds