Previous behavior:
When a push was detected in the `main` branch, the workflow would run the
`versioning` job and crash trying to detect the version being deployed as
there was none.
Expected behavior:
Do not fail the `versioning` job when pushing to `main`
Solution:
Limit the `versioning` job to only run when a release event is triggered
and allow the `deploy-nodes` job to run even if `versioning` is skipped
* Show the arguments of acceptance test functions in the logs
* Show all the logs in the "Run tests" jobs
* Document expected "broken pipe" error from `tee`
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* feat(build): deploy long running instances on release
Previous behavior:
Each time we merged to main new nodes would be deployed, this is an
expected behavior as we need to ensure nodes get deployed and run
without issues, but this could also replace nodes very hastily.
Expected behavior:
We want instances which would run for a longer time, to allow us to
troubleshoot issues or inspect the behavior of this instances for longer
periods of time (2+ weeks)
Applied solution:
Deploy a versioned manage instance group (MiG) using the major version
of the release semver. We just use the first part of the version to
replace old instances, and change it when a major version is released
to keep a segregation between new and old versions.
* ci(build): allow v0 as a major version tag
* fix(build): use rust conventions for versioning
* fix(deploy): improve documentation and trigger on release
* Update .github/workflows/continous-delivery.yml
Co-authored-by: teor <teor@riseup.net>
* fix(versioning): typo
* fix(deploy): use `zebrad-v1` as the instance name, with no SHA
* fix(deploy): create and update MiG must use the same name
* docs(deployments): add Continuous Delivery process
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Expand cached state disks before running tests
* Install partition management tool
* There isn't actually a partition on the cached state image
* Make e2fsck non-interactive
* Limit the length of image names to 63 characters
* Ignore possibly long branch names when matching images, just match the commit
* Increase full sync timeout to 24 hours
Expected sync time is ~21 hours as of August 2022.
* Split final checkpoint job into two smaller jobs to avoid timeouts
Also make regexes easier to read.
* Fix a job name typo
Previous behavior:
If warnings or error are added in `.cargo/config.toml` or `clippy.toml`,
and those could generate CI failures, we wouldn't catch those new as the
pipelines are not run when this files are changed
Expected behavior:
If warnings or error are added in `.cargo/config.toml` or `clippy.toml`,
run all the builds and test jobs which also track a `Cargo.toml`.
Solution:
Add `.cargo/config.toml` and `clippy.toml` as paths to all the required
jobs which needs to be triggered when these files changes.
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Save cached state on full syncs and updates
* Add an -update suffix to CI images created by updating cached state
* Make disk image names unique by adding a time suffix
* Use the latest image from any branch, but prefer the current commit if available
* Document Zebra's continuous integration tests
* Fix typos in environmental variable names
* Expand documentation
* Fix variable name typo
* Fix shell syntax
Previous behavior:
Sometimes Google Cloud authentication fails, this might happen before
IAM permissions are fully propagated
Expected behavior:
If the authentication fails, retry at least 3 times before exiting with
a non zero exit code
Applied solution:
Google GitHub Actions for auth recently added this a `retries` feature
which is now implemented to workaround this issue.
Note: 95a6bc2a27
Fixes https://github.com/ZcashFoundation/zebra/issues/4846
* update timeout
* update the doc comment
* Increase test timeouts for Zebra update syncs
* Stop failing the 1740k job if the cached state is after block 1740k
Co-authored-by: teor <teor@riseup.net>
* Apply the same Rust logging settings to all GitHub workflows
* Enable full optimisations in dev builds for downloading large parameter files
* Disable beta Rust tests in CI
* Make code execution time logs shorter
* Do ZK parameter preloads in the lightwalletd tests that need them
* Try to re-launch `lightwalletd` when it hangs during sync tests
* Increase full sync timeout
* Clear the `zebrad` logs during `lightwalletd` tests, to avoid logging deadlocks
* Actually clear more than one line of logs
* Check zebrad and lightwalletd output in parallel threads, while waiting for zebrad
* Check zebrad and lightwalletd output in parallel threads, while waiting for lightwalletd
* Improve test logging
* Fix a log typo
* Only wait for lightwalletd once, because its logs stop after the initial sync
* Look for cached state disks for this commit and branch first
* Only copy the state once in the send transactions test
* Wait longer for lightwalletd gRPC server startup
* Add some function docs
* cargo fmt --all
* Fix clippy::let_and_return
* Increase lightwalletd test timeouts for zebrad slowness
* Add a `zebrad_update_sync()` test, that update syncs Zebra without lightwalletd
* Run the zebrad-update-sync test in CI
* Add extra zebrad time to workaround lightwalletd bugs
* Initialize the rayon threadpool with a new config for CPU-bound threads
* Verify proofs and signatures on the rayon thread pool
* Only spawn one concurrent batch per verifier, for now
* Allow tower-batch to queue multiple batches
* Fix up a potentially incorrect comment
* Rename some variables for concurrent batches
* Spawn multiple batches concurrently, without any limits
* Simplify batch worker loop using OptionFuture
* Clear pending batches once they finish
* Stop accepting new items when we're at the concurrent batch limit
* Fail queued requests on drop
* Move pending_items and the batch timer into the worker struct
* Add worker fields to batch trace logs
* Run docker tests on PR series
* During full verification, process 20 blocks concurrently
* Remove an outdated comment about yielding to other tasks
* Make the release checklist shorter and hide some details
* Ignore any `fastmod` updates to previous release notes in `CHANGELOG.md`
* Use recent versions in examples
* Fix markdown that doesn't render correctly
* Fix some weird line breaks
* Use capital letters to start list items
* Clarify `fastmod` and `CHANGELOG.md`
* Clarify version format by changing highlighting
* Checkout zebra in each job to avoid warnings
But put TODOs where we might be able to skip checkouts
* Split log following into sprout checkpoints, sapling/orchard checkpoints, and full validation
* Make job IDs shorter
* Use /dev/stderr because docker doesn't have a tty
* remove pipefail
* Revert "remove pipefail"
This reverts commit a7ee37bebdc107a4215e7dd307b189d925969234.
* Make tee ignore errors writing to a grep pipe
* Avoid launching multiple docker instances for duplicate jobs
* Ignore broken pipe error messages and statuses
* fix(ci): docker wait not finding container
We had this issue before, I can't recall if this was a parsing error between GitHub Actions and gcloud `--command` parsing, but we had to change this into two pieces.
This implementation keeps it how we did it before 9b9578c999/.github/workflows/test.yml (L235-L243)
* docs: remove pending TODO
We can't remove `actions/checkout` nor set `create_credentials_file` to `false` as next steps won't be able to authenticate to GCP.
We can surely remove `actions/checkout` and leave `create_credentials_file` as `true`, but this will raise a warning on each step, and there's no benefit of doing so.
* Show `docker wait` and `gcloud ssh` output
* If `docker wait` fails, get the exit code using `docker inspect`
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Put arguments to "docker run" on different lines
And update some comments.
* Split docker run into launch, logs, and wait
* Remove mistaken "needs state" condition on log and results job
* Exit the ssh and the job with the container test's exit status
* Split full sync into checkpoint and full validation
* Sort workflow variables into categories and add descriptions
* Split Create instance/volume and Run test into separate jobs
* Copy initial conditions to all jobs in the series
* Actually create a cached state image
* fix(state): use same disk naming convention for all test instances
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
* feat(ci): build each crate individually
* fix(ci): use valid names for each job
* feat(ci): builds and checks with and without all features
* refactor(ci): build job matrix dinamically
* fix: use a JSON_CRATES variable with resulting values
* test: check-matrix
* fix(ci): use "crate" in singular for reference
* imp(ci): use a matrix for feature build arguments
* fix(ci): use correct naming and includes
* fix(ci): implement most recommendations given in review
* fix(ci): use simpler shell script
* fix: typo
* fix: add string to file, not cmd
* fix: some shellchecks
* fix(ci): remove warnings and errors from shellcheck
* imp(ci): add patch file for `Build crates individually` workflow
* Remove unused configs in patch job
Co-authored-by: teor <teor@riseup.net>
* feat(actions): delete old GCP resources
* fix(ci): delete old instances templates
* fix(actions): use correct date arguments and conversion
* fix(actions): missing command in gcloud
* fix(gcp): if an instance can't be deleted, continue
* refacor(action): cleanup and execute monthly
* increase lightwalletd timeout
* switch back to aditya's fork
* manually point to new aditya's lightwalletd image
* disable sync_one_checkpoint_testnet test
* disable restart_stop_at_height in testnet
* rever to 'latest' lightwalletd image