[Proposal ]Partitioned Inflationary Rewards Distribution (#27455)

* add epoch-boundary-stake-reward proposal

* 80 col

* clarify rewarding interval selection for skipping slots

* update proposal with reward credit based on jeff's comments

* Update docs/src/proposals/epoch-boundary-stake-reward.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update docs/src/proposals/epoch-boundary-stake-reward.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* rename

* update proposal with more feedbacks

* revise

* update with carl's feedback

* use mathmatic notation to clarify interval boundaries

* more feedbacks

* remove parenthesis

* update snapshot paragraph

* update with reward calc service

* more feedbacks

* update with more feedbacks

* more feedbacks from carllin

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>
This commit is contained in:
HaoranYi 2022-10-06 14:26:47 -05:00 committed by GitHub
parent d9ef04772d
commit 6eeedaec4f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 140 additions and 0 deletions

View File

@ -0,0 +1,140 @@
---
title: Partitioned Inflationary Rewards Distribution
---
## Problem
With the increase of number of stake accounts, computing and redeeming the stake
rewards at the start block of the epoch boundary becomes very expensive.
Currently, with 550K stake accounts, the stake reward time has already taken
more than 10 seconds. This prolonged computation slows down the network, and can
cause large number of forks at the epoch boundary, which makes the matter even
worse.
## Proposed Solutions
Instead of computing and reward stake accounts at epoch boundary, we will
decouple reward computation and reward credit into two phases.
A separate service, "EpochRewardCalculationService" will be created. The service
will listen to a channel for any incoming rewards calculation requests, and
perform the calculation for the rewards. For each block that cross the epoch
boundary, the bank will send a request to the `EpochRewardCalculationService`.
This marks the start of the reward computation phase.
```
N-1 -- N -- N+1
\
\
N+2
```
In the above example, N is the start of the new epoch. Two rewards calculation
requests will be sent out at slot N and slot N+2 because they both cross the
epoch boundary and are on different forks. To avoid repeated computation with
the same input, the signature of the computation requests, `hash(epoch_number,
hash(stake_accounts_data), hash(vote_accounts), hash(delegation_map))`, are
calculated. Duplicated computation requests will be discard. For the above
example, if there are no stake/vote accounts changes between slot N and slot
N+2, the 2nd computation request will be discarded.
When reaching block height `N` after the start of the `reward computation
phase`, the bank starts the second phase - reward credit, in which, the bank
first query the `epoch calc service` with the request signature to get the
rewards result, which will be resented as a map from accounts_pubkey->rewards,
then credit the rewards to the stake accounts for the next `M` blocks. If the
rewards result is not available, the bank will wait until the results are
available.
We call them: <br/>
(a) calculating interval: `[epoch_start, epoch_start+N]` <br/>
(b) credit interval: `[epoch_start+N+1, epoch_start+N+M]`, respectively. <br/>
And the combined interval `[epoch_start, epoch_start+N+M]` is called
`rewarding interval`.
For `calculating interval`, `N` is chosen to be sufficiently large so that the
background computation should have completed and the result of the reward
computation is available at the end of `calculating interval`. `N` can be fixed
such as 100 (roughly equivalent to 50 seconds), or chosen as a function of the
number of stake accounts, `f(num_stake_accounts)`.
In `credit interval`, the bank will fetch the reward computation results from
the background thread and start credit the rewards during the next `M` blocks.
The idea is partition the accounts into `M` partitions. And each block, the bank
credit `1/M` accounts. The partition is required to be deterministic for the
current epoch, but must also be random across different epochs. One way to
achieve these properties is to hash the account's pubkey with some epoch
dependent values, sort the results, and divide them into `M` bins. The epoch
dependent value can be the epoch number, total rewards for the epoch, the leader
pubkey for the epoch block, etc. `M` can be choses based on 50K account per
block, which equal to `ceil(num_stake_accounts/50,000)`.
`num_stake_account` is extracted from `leader_schedule_epoch` block, so we don't
run into discrepancy where new transactions right before an epoch boundary
creates one fork with `X` stake accounts and another fork with `Y` stake accounts.
In order to avoid putting extra burden of computing and credit the stake reward
for blocks produced during the `rewarding interval`, we can reduce the compute
budget limits on those blocks in `rewarding interval`, and reserve some computing
and read/write capacity to perform stake rewarding.
### Challenges
1. stake accounts reads/writes during the `rewarding interval`
`epoch_start..epoch_start+N+M` Because of the delayed credit of the rewards,
Reads to those stake accounts will not return the value that the user are
expecting (viz. not include the recent epoch stake rewards). Writes to those
stake accounts will be lost once the reward are credited on block
`epoch_start+N+M`. We will need to modify the runtime to restrict read/writes to
stake accounts during the `rewarding interval`. Any transactions, which involves
stake accounts, will result in a new execution error, i.e. "stake rewards
pending, account access is restricted". However, normal rpc queries, such as
'getBalance', will return the current lamport of the account. The user can
expect the rewards to be credit as some time point during the 'rewarding
interval'.
2. snapshot taken during the `rewarding interval`
If a snapshot is taken during the `rewarding interval`, it would miss the
rewards for the stake accounts. Any plain restart from those snapshots will be
wrong, unless we reconstruct the rewards from the recent epoch boundary. This
will add some complexity to validator restart. In the first implementation, we
will force *not* taking any snapshot and *not* performing accounts hash
calculation during the `rewarding interval`. Incremental snapshot request will
be skipped. Full snapshot request will be re-queued be picked up later at the
end of the `reward interval`.
In future, if needed, we can
revisit to enable taking snapshots and perform hash calculation during reward
interval.
3. account-db related action during the `rewarding interval`
Account-db related action such as flush, clean, squash, shrink etc. may touch
and evict the stake accounts from account db's cache during the `rewarding
interval`. This will slow down the credit in the future at bank `epoch_start+N`.
We may need to exclude such accounts_db actions for stake_accounts during
`rewarding interval`. This is going to be a performance tuning problem. In the
first implementation, for simplicity, we will keep the account-db action as it
is, and make the `credit interval` larger to accommodate the performance hit
when writing back those accounts. In future, we can continue tuning account db
actions during 'rewarding interval'.
4. view of total epoch capitalization change
The view of total epoch capitalization, instead of being available at every
epoch boundary, is only available after the `rewarding interval`. Any third
party application logic, which depends on total epoch capitalization, need to
wait after `rewarding interval`.
5. `getInflationReward` JSONRPC API method call
Today, the `getInflationReward` JSONRPC API method call can simply grab the
first block in the target epoch and lookup the target stake account's rewards
entry. With these changes, the call will need updated to derive the target
stake account's credit block, grab _that_ block, then lookup rewards.
Additionally we'll need to return more informative errors for queries made
during the lockout period, so users can know that their rewards are pending for
the target epoch. A new rpc API, i.e. `getRewardInterval`, will be added for
querying the `rewarding interval` for the current epoch.