114 lines
7.7 KiB
Go
114 lines
7.7 KiB
Go
|
/*
|
||
|
Package upgrade provides a Cosmos SDK module that can be used for smoothly upgrading a live Cosmos chain to a
|
||
|
new software version. It accomplishes this by providing a BeginBlocker hook that prevents the blockchain state
|
||
|
machine from proceeding once a pre-defined upgrade block time or height has been reached. The module does not prescribe
|
||
|
anything regarding how governance decides to do an upgrade, but just the mechanism for coordinating the upgrade safely.
|
||
|
Without software support for upgrades, upgrading a live chain is risky because all of the validators need to pause
|
||
|
their state machines at exactly the same point in the process. If this is not done correctly, there can be state
|
||
|
inconsistencies which are hard to recover from.
|
||
|
|
||
|
General Workflow
|
||
|
|
||
|
Let's assume we are running v0.38.0 of our software in our testnet and want to upgrade to v0.40.0.
|
||
|
How would this look in practice? First of all, we want to finalize the v0.40.0 release candidate
|
||
|
and there install a specially named upgrade handler (eg. "testnet-v2" or even "v0.40.0"). An upgrade
|
||
|
handler should be defined in a new version of the software to define what migrations
|
||
|
to run to migrate from the older version of the software. Naturally, this is app-specific rather
|
||
|
than module specific, and must be defined in `app.go`, even if it imports logic from various
|
||
|
modules to perform the actions. You can register them with `upgradeKeeper.SetUpgradeHandler`
|
||
|
during the app initialization (before starting the abci server), and they serve not only to
|
||
|
perform a migration, but also to identify if this is the old or new version (eg. presence of
|
||
|
a handler registered for the named upgrade).
|
||
|
|
||
|
Once the release candidate along with an appropriate upgrade handler is frozen,
|
||
|
we can have a governance vote to approve this upgrade at some future block time
|
||
|
or block height (e.g. 200000). This is known as an upgrade.Plan. The v0.38.0 code will not know of this
|
||
|
handler, but will continue to run until block 200000, when the plan kicks in at BeginBlock. It will check
|
||
|
for existence of the handler, and finding it missing, know that it is running the obsolete software,
|
||
|
and gracefully exit.
|
||
|
|
||
|
Generally the application binary will restart on exit, but then will execute this BeginBlocker
|
||
|
again and exit, causing a restart loop. Either the operator can manually install the new software,
|
||
|
or you can make use of an external watcher daemon to possibly download and then switch binaries,
|
||
|
also potentially doing a backup. An example of such a daemon is https://github.com/regen-network/cosmosd/
|
||
|
described below under "Automation".
|
||
|
|
||
|
When the binary restarts with the upgraded version (here v0.40.0), it will detect we have registered the
|
||
|
"testnet-v2" upgrade handler in the code, and realize it is the new version. It then will run the upgrade handler
|
||
|
and *migrate the database in-place*. Once finished, it marks the upgrade as done, and continues processing
|
||
|
the rest of the block as normal. Once 2/3 of the voting power has upgraded, the blockchain will immediately
|
||
|
resume the consensus mechanism. If the majority of operators add a custom `do-upgrade` script, this should
|
||
|
be a matter of minutes and not even require them to be awake at that time.
|
||
|
|
||
|
Integrating With An App
|
||
|
|
||
|
Setup an upgrade Keeper for the app and then define a BeginBlocker that calls the upgrade
|
||
|
keeper's BeginBlocker method:
|
||
|
func (app *myApp) BeginBlocker(ctx sdk.Context, req abci.RequestBeginBlock) abci.ResponseBeginBlock {
|
||
|
app.upgradeKeeper.BeginBlocker(ctx, req)
|
||
|
return abci.ResponseBeginBlock{}
|
||
|
}
|
||
|
|
||
|
The app must then integrate the upgrade keeper with its governance module as appropriate. The governance module
|
||
|
should call ScheduleUpgrade to schedule an upgrade and ClearUpgradePlan to cancel a pending upgrade.
|
||
|
|
||
|
Performing Upgrades
|
||
|
|
||
|
Upgrades can be scheduled at either a predefined block height or time. Once this block height or time is reached, the
|
||
|
existing software will cease to process ABCI messages and a new version with code that handles the upgrade must be deployed.
|
||
|
All upgrades are coordinated by a unique upgrade name that cannot be reused on the same blockchain. In order for the upgrade
|
||
|
module to know that the upgrade has been safely applied, a handler with the name of the upgrade must be installed.
|
||
|
Here is an example handler for an upgrade named "my-fancy-upgrade":
|
||
|
app.upgradeKeeper.SetUpgradeHandler("my-fancy-upgrade", func(ctx sdk.Context, plan upgrade.Plan) {
|
||
|
// Perform any migrations of the state store needed for this upgrade
|
||
|
})
|
||
|
|
||
|
This upgrade handler performs the dual function of alerting the upgrade module that the named upgrade has been applied,
|
||
|
as well as providing the opportunity for the upgraded software to perform any necessary state migrations. Both the halt
|
||
|
(with the old binary) and applying the migration (with the new binary) are enforced in the state machine. Actually
|
||
|
switching the binaries is an ops task and not handled inside the sdk / abci app.
|
||
|
|
||
|
Halt Behavior
|
||
|
|
||
|
Before halting the ABCI state machine in the BeginBlocker method, the upgrade module will log an error
|
||
|
that looks like:
|
||
|
UPGRADE "<Name>" NEEDED at height <NNNN>: <Info>
|
||
|
where Name are Info are the values of the respective fields on the upgrade Plan.
|
||
|
|
||
|
To perform the actual halt of the blockchain, the upgrade keeper simply panics which prevents the ABCI state machine
|
||
|
from proceeding but doesn't actually exit the process. Exiting the process can cause issues for other nodes that start
|
||
|
to lose connectivity with the exiting nodes, thus this module prefers to just halt but not exit.
|
||
|
|
||
|
Automation and Plan.Info
|
||
|
|
||
|
We have deprecated calling out to scripts, instead with propose https://github.com/regen-network/cosmosd
|
||
|
as a model for a watcher daemon that can launch gaiad as a subprocess and then read the upgrade log message
|
||
|
to swap binaries as needed. You can pass in information into Plan.Info according to the format
|
||
|
specified here https://github.com/regen-network/cosmosd/blob/master/README.md#auto-download .
|
||
|
This will allow a properly configured cosmsod daemon to auto-download new binaries and auto-upgrade.
|
||
|
As noted there, this is intended more for full nodes than validators.
|
||
|
|
||
|
Cancelling Upgrades
|
||
|
|
||
|
There are two ways to cancel a planned upgrade - with on-chain governance or off-chain social consensus.
|
||
|
For the first one, there is a CancelSoftwareUpgrade proposal type, which can be voted on and will
|
||
|
remove the scheduled upgrade plan. Of course this requires that the upgrade was known to be a bad idea
|
||
|
well before the upgrade itself, to allow time for a vote. If you want to allow such a possibility, you
|
||
|
should set the upgrade height to be 2 * (votingperiod + depositperiod) + (safety delta) from the beginning of
|
||
|
the first upgrade proposal. Safety delta is the time available from the success of an upgrade proposal
|
||
|
and the realization it was a bad idea (due to external testing). You can also start a CancelSoftwareUpgrade
|
||
|
proposal while the original SoftwareUpgrade proposal is still being voted upon, as long as the voting
|
||
|
period ends after the SoftwareUpgrade proposal.
|
||
|
|
||
|
However, let's assume that we don't realize the upgrade has a bug until shortly before it will occur
|
||
|
(or while we try it out - hitting some panic in the migration). It would seem the blockchain is stuck,
|
||
|
but we need to allow an escape for social consensus to overrule the planned upgrade. To do so, we are
|
||
|
adding a --unsafe-skip-upgrade flag to the start command, which will cause the node to mark the upgrade
|
||
|
as done upon hiting the planned upgrade height, without halting and without actually performing a migration.
|
||
|
If over two-thirds run their nodes with this flag on the old binary, it will allow the chain to continue through
|
||
|
the upgrade with a manual override. (This must be well-documented for anyone syncing from genesis later on).
|
||
|
|
||
|
(Skip-upgrade flag is in a WIP PR - will update this text when merged ^^)
|
||
|
*/
|
||
|
package upgrade
|