[Proposal ]Partitioned Inflationary Rewards Distribution (#27455)
* add epoch-boundary-stake-reward proposal * 80 col * clarify rewarding interval selection for skipping slots * update proposal with reward credit based on jeff's comments * Update docs/src/proposals/epoch-boundary-stake-reward.md Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com> * Update docs/src/proposals/epoch-boundary-stake-reward.md Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com> * rename * update proposal with more feedbacks * revise * update with carl's feedback * use mathmatic notation to clarify interval boundaries * more feedbacks * remove parenthesis * update snapshot paragraph * update with reward calc service * more feedbacks * update with more feedbacks * more feedbacks from carllin Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>
This commit is contained in:
parent
d9ef04772d
commit
6eeedaec4f
|
@ -0,0 +1,140 @@
|
||||||
|
---
|
||||||
|
title: Partitioned Inflationary Rewards Distribution
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
With the increase of number of stake accounts, computing and redeeming the stake
|
||||||
|
rewards at the start block of the epoch boundary becomes very expensive.
|
||||||
|
Currently, with 550K stake accounts, the stake reward time has already taken
|
||||||
|
more than 10 seconds. This prolonged computation slows down the network, and can
|
||||||
|
cause large number of forks at the epoch boundary, which makes the matter even
|
||||||
|
worse.
|
||||||
|
|
||||||
|
## Proposed Solutions
|
||||||
|
|
||||||
|
Instead of computing and reward stake accounts at epoch boundary, we will
|
||||||
|
decouple reward computation and reward credit into two phases.
|
||||||
|
|
||||||
|
A separate service, "EpochRewardCalculationService" will be created. The service
|
||||||
|
will listen to a channel for any incoming rewards calculation requests, and
|
||||||
|
perform the calculation for the rewards. For each block that cross the epoch
|
||||||
|
boundary, the bank will send a request to the `EpochRewardCalculationService`.
|
||||||
|
This marks the start of the reward computation phase.
|
||||||
|
|
||||||
|
```
|
||||||
|
N-1 -- N -- N+1
|
||||||
|
\
|
||||||
|
\
|
||||||
|
N+2
|
||||||
|
```
|
||||||
|
|
||||||
|
In the above example, N is the start of the new epoch. Two rewards calculation
|
||||||
|
requests will be sent out at slot N and slot N+2 because they both cross the
|
||||||
|
epoch boundary and are on different forks. To avoid repeated computation with
|
||||||
|
the same input, the signature of the computation requests, `hash(epoch_number,
|
||||||
|
hash(stake_accounts_data), hash(vote_accounts), hash(delegation_map))`, are
|
||||||
|
calculated. Duplicated computation requests will be discard. For the above
|
||||||
|
example, if there are no stake/vote accounts changes between slot N and slot
|
||||||
|
N+2, the 2nd computation request will be discarded.
|
||||||
|
|
||||||
|
When reaching block height `N` after the start of the `reward computation
|
||||||
|
phase`, the bank starts the second phase - reward credit, in which, the bank
|
||||||
|
first query the `epoch calc service` with the request signature to get the
|
||||||
|
rewards result, which will be resented as a map from accounts_pubkey->rewards,
|
||||||
|
then credit the rewards to the stake accounts for the next `M` blocks. If the
|
||||||
|
rewards result is not available, the bank will wait until the results are
|
||||||
|
available.
|
||||||
|
|
||||||
|
We call them: <br/>
|
||||||
|
(a) calculating interval: `[epoch_start, epoch_start+N]` <br/>
|
||||||
|
(b) credit interval: `[epoch_start+N+1, epoch_start+N+M]`, respectively. <br/>
|
||||||
|
And the combined interval `[epoch_start, epoch_start+N+M]` is called
|
||||||
|
`rewarding interval`.
|
||||||
|
|
||||||
|
For `calculating interval`, `N` is chosen to be sufficiently large so that the
|
||||||
|
background computation should have completed and the result of the reward
|
||||||
|
computation is available at the end of `calculating interval`. `N` can be fixed
|
||||||
|
such as 100 (roughly equivalent to 50 seconds), or chosen as a function of the
|
||||||
|
number of stake accounts, `f(num_stake_accounts)`.
|
||||||
|
|
||||||
|
In `credit interval`, the bank will fetch the reward computation results from
|
||||||
|
the background thread and start credit the rewards during the next `M` blocks.
|
||||||
|
The idea is partition the accounts into `M` partitions. And each block, the bank
|
||||||
|
credit `1/M` accounts. The partition is required to be deterministic for the
|
||||||
|
current epoch, but must also be random across different epochs. One way to
|
||||||
|
achieve these properties is to hash the account's pubkey with some epoch
|
||||||
|
dependent values, sort the results, and divide them into `M` bins. The epoch
|
||||||
|
dependent value can be the epoch number, total rewards for the epoch, the leader
|
||||||
|
pubkey for the epoch block, etc. `M` can be choses based on 50K account per
|
||||||
|
block, which equal to `ceil(num_stake_accounts/50,000)`.
|
||||||
|
|
||||||
|
`num_stake_account` is extracted from `leader_schedule_epoch` block, so we don't
|
||||||
|
run into discrepancy where new transactions right before an epoch boundary
|
||||||
|
creates one fork with `X` stake accounts and another fork with `Y` stake accounts.
|
||||||
|
|
||||||
|
In order to avoid putting extra burden of computing and credit the stake reward
|
||||||
|
for blocks produced during the `rewarding interval`, we can reduce the compute
|
||||||
|
budget limits on those blocks in `rewarding interval`, and reserve some computing
|
||||||
|
and read/write capacity to perform stake rewarding.
|
||||||
|
|
||||||
|
### Challenges
|
||||||
|
|
||||||
|
1. stake accounts reads/writes during the `rewarding interval`
|
||||||
|
|
||||||
|
`epoch_start..epoch_start+N+M` Because of the delayed credit of the rewards,
|
||||||
|
Reads to those stake accounts will not return the value that the user are
|
||||||
|
expecting (viz. not include the recent epoch stake rewards). Writes to those
|
||||||
|
stake accounts will be lost once the reward are credited on block
|
||||||
|
`epoch_start+N+M`. We will need to modify the runtime to restrict read/writes to
|
||||||
|
stake accounts during the `rewarding interval`. Any transactions, which involves
|
||||||
|
stake accounts, will result in a new execution error, i.e. "stake rewards
|
||||||
|
pending, account access is restricted". However, normal rpc queries, such as
|
||||||
|
'getBalance', will return the current lamport of the account. The user can
|
||||||
|
expect the rewards to be credit as some time point during the 'rewarding
|
||||||
|
interval'.
|
||||||
|
|
||||||
|
2. snapshot taken during the `rewarding interval`
|
||||||
|
|
||||||
|
If a snapshot is taken during the `rewarding interval`, it would miss the
|
||||||
|
rewards for the stake accounts. Any plain restart from those snapshots will be
|
||||||
|
wrong, unless we reconstruct the rewards from the recent epoch boundary. This
|
||||||
|
will add some complexity to validator restart. In the first implementation, we
|
||||||
|
will force *not* taking any snapshot and *not* performing accounts hash
|
||||||
|
calculation during the `rewarding interval`. Incremental snapshot request will
|
||||||
|
be skipped. Full snapshot request will be re-queued be picked up later at the
|
||||||
|
end of the `reward interval`.
|
||||||
|
|
||||||
|
In future, if needed, we can
|
||||||
|
revisit to enable taking snapshots and perform hash calculation during reward
|
||||||
|
interval.
|
||||||
|
|
||||||
|
3. account-db related action during the `rewarding interval`
|
||||||
|
|
||||||
|
Account-db related action such as flush, clean, squash, shrink etc. may touch
|
||||||
|
and evict the stake accounts from account db's cache during the `rewarding
|
||||||
|
interval`. This will slow down the credit in the future at bank `epoch_start+N`.
|
||||||
|
We may need to exclude such accounts_db actions for stake_accounts during
|
||||||
|
`rewarding interval`. This is going to be a performance tuning problem. In the
|
||||||
|
first implementation, for simplicity, we will keep the account-db action as it
|
||||||
|
is, and make the `credit interval` larger to accommodate the performance hit
|
||||||
|
when writing back those accounts. In future, we can continue tuning account db
|
||||||
|
actions during 'rewarding interval'.
|
||||||
|
|
||||||
|
4. view of total epoch capitalization change
|
||||||
|
|
||||||
|
The view of total epoch capitalization, instead of being available at every
|
||||||
|
epoch boundary, is only available after the `rewarding interval`. Any third
|
||||||
|
party application logic, which depends on total epoch capitalization, need to
|
||||||
|
wait after `rewarding interval`.
|
||||||
|
|
||||||
|
5. `getInflationReward` JSONRPC API method call
|
||||||
|
|
||||||
|
Today, the `getInflationReward` JSONRPC API method call can simply grab the
|
||||||
|
first block in the target epoch and lookup the target stake account's rewards
|
||||||
|
entry. With these changes, the call will need updated to derive the target
|
||||||
|
stake account's credit block, grab _that_ block, then lookup rewards.
|
||||||
|
Additionally we'll need to return more informative errors for queries made
|
||||||
|
during the lockout period, so users can know that their rewards are pending for
|
||||||
|
the target epoch. A new rpc API, i.e. `getRewardInterval`, will be added for
|
||||||
|
querying the `rewarding interval` for the current epoch.
|
Loading…
Reference in New Issue