* ref(docker): remove all unrequired docker arguments
* fix(ci): use correct `$NETWORK` approach for tests
* fix(release): do not change default `$NETWORK` for experimental image
* Update .github/workflows/continous-integration-docker.yml
Co-authored-by: Marek <mail@marek.onl>
* Revert "fix(release): do not change default `$NETWORK` for experimental image"
This reverts commit bd5b6c831b.
* fix: typo
---------
Co-authored-by: Marek <mail@marek.onl>
* Update the workflow run conditions for CI docker tests
* Run release builds and release Docker image tests on pull requests
* Remove the manual docker test from the release checklist
* Fix workflow syntax
* Use the right kind of quotes
* fix(deploy): allow the container to raise in MIGs
* fix(docker): add the `ZEBRA_CACHED_STATE_DIR` as a default `ENV`
This no longer requires the env variable to be defined in other places, unless we're changing the default configuration
* ci(build): unpin specific `buildkit` version
We previously had an issue with the following error: `cannot reuse body, request must be retried`
This commonly was a wrong error, caused by a containerd issue which has being tracked and solved here: https://github.com/docker/build-push-action/issues/761#issuecomment-1406261692
We're having errors when building, and this might be caused by an underliying error which containerd is not showing us correctly.
* ci(deploy): Use specific subnetworks on GCP VMs
* refactor(ci): use GitHub secrets and variables
We've been using values that are variable across multiple workflows,
and those can only be changed if modifying the workflows, but we should
be able to change the values without committing new changes in the code
for this purpose we're now using GitHub Variables, and even moving
non-sensitive information into variables instead of secrets. Allowing
more flexibility and other scenarios that should be easier to manage,
like deploying to Mainnet or Testnet.
* refactor(ci): use new GitHub variables for GCP auth
* fix(ci): typo
* fix(ci): do not use multiple variables for the same value
* fix(ci): typo in variable
* fix(vars): use different variables for machine types
* fix(vars): missing substitution
* fix: typo
* fix: make the input CI network override the default network
* Use the correct network variable for creating disks
---------
Co-authored-by: teor <teor@riseup.net>
* ci: add a test to validate Zebra's config file and path
* fix: use `ZEBRA_CONF_PATH` as single variable locating the conf
* fix: do not remove the containers
* fix: use extended regex
* fix: use different steps to validate the conf tests
* fix: do not specify a default CMD for running Docker in test builds
* fix: use actual starting commands for entrypoint
* fix: do not add cargo twice if cargo is in $1
* fix: allow to run `zebrad` in the `tests` stage of Dockerfile
* fix: new entrypoint does not allow an empty CMD
* fix: do not duplicate the `zebrad` command
* fix: segregate configuration jobs
* refactor(entrypoint): handle better parameters conditions
* fix: make `zebrad` an executable command in `tests` stage
* Show the commands that are being executed in the new docker test
* Show full logs without tee or grep
* Apply suggestions from code review
Co-authored-by: teor <teor@riseup.net>
* fix: use the actual path inside docker
* fix: use `grep` with exit code
If the container is logging to stderr, piping works only for stdout, so we're adding `2>&1`
* fix: use `grep -q` to get an exit code
* fix: fail if any error is detected
* fix: fail if this test takes more than 5 minutes
* fix: update patch workflows
* feat: test Dockerfile `runtime` config
* fix: depend on the configuration test to continue
Co-authored-by: teor <teor@riseup.net>
* fix(ci): remove warnings caused by missing `actions/checkout`
* fix: typo in arguments
* fix: add the whole disk name as this is a single instance
* fix: add dis name to mount
* chore: add Network as a label
* Fix network parameter in continous-delivery.yml
* Standardise network usage in zcashd-manual-deploy
* Use lowercase network labels
* Fix some shellcheck errors
* Hard-code a Mainnet default to support contexts where env is not available
* Fix string syntax
* Fix more shellcheck errors
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>
Co-authored-by: Arya <aryasolhi@gmail.com>
* feat(gcp): add label to instances for cost and logs grouping
Previous behavior:
We couldn't search GCP logs using the instance name if that instance was
already deleted. And if we want to know how we're spending our budget its
also difficult to know if specific tests or type of instances are the one
responsible for a certain % of the costs
Fixes#5153
Fixses #5543
Expected behavior:
Be able to search logs using the test ID or at least the github reference,
and be able to group GCP costs by labels
Solution:
- Add labels to instances
* chore: add Network as a label
* Revert "chore: add Network as a label"
This reverts commit 146f747d50.
* Update .github/workflows/zcashd-manual-deploy.yml
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* fix(cd): allow deploying instance templates without disk errors
Motivation:
PR #5670 failed in `main` as it was tested with `gcloud compute instances create-with-container`
and even the manual deployment uses `instances`, and it works.
But the one that failed uses `gcloud compute instance-templates create-with-container`
using `instance-template` and it's complaining with: `When attaching or creating a disk that is also being mounted to a container, must specify the disk name`
Based on the documentation, the name is optional when using `create-with-container`,
for both `instances` and `instance-templates`
Source: https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create-with-container#--container-mount-disk
Solution:
Revert this specific job as how it was, and do not scale the instances
above 1, as this would cause the following error:
`Instance template specifies a disk with a custom name. This will cause instance group not to scale beyond 1 instance per zone.`
* chore: reduce diff
* ci(compute): use debian public image on the VM, not the container
Previous behavior:
We were pulling the debian image the wrong way, as this was being used
as a container but it was meant to be the VM image
The image being pulled to create the internal container has been causing
crashes as this images do not exists on Google's container repositories
Expected behavior:
Use a public image as debian-11 to get multiple benefits from it, as being
able to use machine-images (#5615) and automatic disk resizing (which
is now possible as we're using COS images, but those are more restrictive)
Solution
Add `--image-project=debian-cloud` and `--image-family=debian-11` as
stated in the official documentation: https://cloud.google.com/sdk/gcloud/reference/compute/instances/create-with-container#--image-project
More info: https://cloud.google.com/compute/docs/images/os-details#import
* fix: use a public image with docker on the host
* fix(logs): missing sudo before docker command
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials
Expected behavior:
All `gcloud` commands should be properly configured and authenticated.
Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.
Remove the need of an OAuth Access token when not required by following
steps
* Run CI jobs on dependent PRs
* Change job names to be unique
* Fix outdated workflow name
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Revert "ci(ssh): connect using `ssh-compute` action by Google (#5330)"
This reverts commit b366d6e7bb.
* ci(ssh): use sudo for docker commands if user is not root
* ci(ssh): specify the service account to connect with
* ci(ssh): increase the Google Cloud instance sshd connection limit
* chore: add a new line at the end of the script
* chore: update our VM image to bullseye
* chore: fix `tj-actions/changed-files` file comparison
* ci(disk): use an official image on CI VMs for disk auto-resizing
Previous behavior:
We've presented issues in the past with resizing as the device is busy,
for example:
```
e2fsck: Cannot continue, aborting.
/dev/sdb is in use.
```
Expected behavior:
We've been manually resizing the disk as this task was not being done
automatically, but having an official Public Image from GCP would make
this easier (automatic) and it also integrates better with other GCP
services
Configuration differences: https://cloud.google.com/compute/docs/images/os-details#notable-difference-debian
Solution:
- Use `debian-11` from the official public images https://cloud.google.com/compute/docs/images/os-details#debian
- Remove the manual disk resizing from the pipeline
* ci: increase VM disk size to fit future cached states sizes
Some GCP disk images are 160 GB, which means they could get to the current
200 GB size soon.
* ci(concurrency)!: run a single CI workflow as required
Previous behavior:
Multiple Mainnet full syncs were able to run on the main branch at the
same time, and pushing multiple commits to the same branch would run
multiple CI workflows, when only the run from last commit was relevant
Expected behavior:
Ensure that only a single CI workflow runs at the same time in PRs.
The latest commit should cancel any previous running workflows from the
same PR.
Solution:
Use GitHub actions concurrency feature https://docs.github.com/en/actions/using-jobs/using-concurrency
Fixes https://github.com/ZcashFoundation/zebra/issues/4977
Fixes https://github.com/ZcashFoundation/zebra/issues/4857
* docs: typo
* ci(concurrency): do not cancel running full syncs
Co-authored-by: teor <teor@riseup.net>
* fix(concurrency): explain the behavior better & add new ones
Co-authored-by: teor <teor@riseup.net>
Previous behavior:
When a push was detected in the `main` branch, the workflow would run the
`versioning` job and crash trying to detect the version being deployed as
there was none.
Expected behavior:
Do not fail the `versioning` job when pushing to `main`
Solution:
Limit the `versioning` job to only run when a release event is triggered
and allow the `deploy-nodes` job to run even if `versioning` is skipped