[Proposal ]Partitioned Inflationary Rewards Distribution (#27455)

* add epoch-boundary-stake-reward proposal * 80 col * clarify rewarding interval selection for skipping slots * update proposal with reward credit based on jeff's comments * Update docs/src/proposals/epoch-boundary-stake-reward.md Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com> * Update docs/src/proposals/epoch-boundary-stake-reward.md Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com> * rename * update proposal with more feedbacks * revise * update with carl's feedback * use mathmatic notation to clarify interval boundaries * more feedbacks * remove parenthesis * update snapshot paragraph * update with reward calc service * more feedbacks * update with more feedbacks * more feedbacks from carllin Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>
2022-10-06 14:26:47 -05:00 · 2022-10-06 14:26:47 -05:00 · 6eeedaec4f
parent d9ef04772d
commit 6eeedaec4f
1 changed files with 140 additions and 0 deletions
--- a/docs/src/proposals/partitioned-inflationary-rewards-distribution.md
+++ b/docs/src/proposals/partitioned-inflationary-rewards-distribution.md
@ -0,0 +1,140 @@
+---
+title: Partitioned Inflationary Rewards Distribution
+---
+
+## Problem
+
+With the increase of number of stake accounts, computing and redeeming the stake
+rewards at the start block of the epoch boundary becomes very expensive.
+Currently, with 550K stake accounts, the stake reward time has already taken
+more than 10 seconds. This prolonged computation slows down the network, and can
+cause large number of forks at the epoch boundary, which makes the matter even
+worse.
+
+## Proposed Solutions
+
+Instead of computing and reward stake accounts at epoch boundary, we will
+decouple reward computation and reward credit into two phases.
+
+A separate service, "EpochRewardCalculationService" will be created. The service
+will listen to a channel for any incoming rewards calculation requests, and
+perform the calculation for the rewards. For each block that cross the epoch
+boundary, the bank will send a request to the `EpochRewardCalculationService`.
+This marks the start of the reward computation phase.
+
+```
+N-1 -- N -- N+1
+     \
+      \
+        N+2
+```
+
+In the above example, N is the start of the new epoch. Two rewards calculation
+requests will be sent out at slot N and slot N+2 because they both cross the
+epoch boundary and are on different forks. To avoid repeated computation with
+the same input, the signature of the computation requests, `hash(epoch_number,
+hash(stake_accounts_data), hash(vote_accounts), hash(delegation_map))`, are
+calculated. Duplicated computation requests will be discard. For the above
+example, if there are no stake/vote accounts changes between slot N and slot
+N+2, the 2nd computation request will be discarded.
+
+When reaching block height `N` after the start of the `reward computation
+phase`, the bank starts the second phase - reward credit, in which, the bank
+first query the `epoch calc service` with the request signature to get the
+rewards result, which will be resented as a map from accounts_pubkey->rewards,
+then credit the rewards to the stake accounts for the next `M` blocks. If the
+rewards result is not available, the bank will wait until the results are
+available.
+
+We call them: <br/>
+(a) calculating interval: `[epoch_start, epoch_start+N]` <br/>
+(b) credit interval: `[epoch_start+N+1, epoch_start+N+M]`, respectively. <br/>
+And the combined interval `[epoch_start, epoch_start+N+M]` is called
+`rewarding interval`.
+
+For `calculating interval`, `N` is chosen to be sufficiently large so that the
+background computation should have completed and the result of the reward
+computation is available at the end of `calculating interval`. `N` can be fixed
+such as 100 (roughly equivalent to 50 seconds), or chosen as a function of the
+number of stake accounts, `f(num_stake_accounts)`.
+
+In `credit interval`,  the bank will fetch the reward computation results from
+the background thread and start credit the rewards during the next `M` blocks.
+The idea is partition the accounts into `M` partitions. And each block, the bank
+credit `1/M` accounts. The partition is required to be deterministic for the
+current epoch, but must also be random across different epochs. One way to
+achieve these properties is to hash the account's pubkey with some epoch
+dependent values, sort the results, and divide them into `M` bins. The epoch
+dependent value can be the epoch number, total rewards for the epoch, the leader
+pubkey for the epoch block, etc. `M` can be choses based on 50K account per
+block, which equal to `ceil(num_stake_accounts/50,000)`.
+
+`num_stake_account` is extracted from `leader_schedule_epoch` block, so we don't
+run into discrepancy where new transactions right before an epoch boundary
+creates one fork with `X` stake accounts and another fork with `Y` stake accounts.
+
+In order to avoid putting extra burden of computing and credit the stake reward
+for blocks produced during the `rewarding interval`, we can reduce the compute
+budget limits on those blocks in `rewarding interval`, and reserve some computing
+and read/write capacity to perform stake rewarding.
+
+### Challenges
+
+1. stake accounts reads/writes during the `rewarding interval`
+
+`epoch_start..epoch_start+N+M` Because of the delayed credit of the rewards,
+Reads to those stake accounts will not return the value that the user are
+expecting (viz. not include the recent epoch stake rewards). Writes to those
+stake accounts will be lost once the reward are credited on block
+`epoch_start+N+M`. We will need to modify the runtime to restrict read/writes to
+stake accounts during the `rewarding interval`. Any transactions, which involves
+stake accounts, will result in a new execution error, i.e. "stake rewards
+pending, account access is restricted". However, normal rpc queries, such as
+'getBalance', will return the current lamport of the account. The user can
+expect the rewards to be credit as some time point during the 'rewarding
+interval'.
+
+2. snapshot taken during the `rewarding interval`
+
+If a snapshot is taken during the `rewarding interval`, it would miss the
+rewards for the stake accounts. Any plain restart from those snapshots will be
+wrong, unless we reconstruct the rewards from the recent epoch boundary. This
+will add some complexity to validator restart. In the first implementation, we
+will force *not* taking any snapshot and *not* performing accounts hash
+calculation during the `rewarding interval`. Incremental snapshot request will
+be skipped. Full snapshot request will be re-queued be picked up later at the
+end of the `reward interval`.
+
+In future, if needed, we can
+revisit to enable taking snapshots and perform hash calculation during reward
+interval.
+
+3. account-db related action during the `rewarding interval`
+
+Account-db related action such as flush, clean, squash, shrink etc. may touch
+and evict the stake accounts from account db's cache during the `rewarding
+interval`. This will slow down the credit in the future at bank `epoch_start+N`.
+We may need to exclude such accounts_db actions for stake_accounts during
+`rewarding interval`. This is going to be a performance tuning problem. In the
+first implementation, for simplicity, we will keep the account-db action as it
+is, and make the `credit interval` larger to accommodate the performance hit
+when writing back those accounts. In future, we can continue tuning account db
+actions during 'rewarding interval'.
+
+4. view of total epoch capitalization change
+
+The view of total epoch capitalization, instead of being available at every
+epoch boundary, is only available after the `rewarding interval`. Any third
+party application logic, which depends on total epoch capitalization, need to
+wait after `rewarding interval`.
+
+5. `getInflationReward` JSONRPC API method call
+
+Today, the `getInflationReward` JSONRPC API method call can simply grab the
+first block in the target epoch and lookup the target stake account's rewards
+entry.  With these changes, the call will need updated to derive the target
+stake account's credit block, grab _that_ block, then lookup rewards.
+Additionally we'll need to return more informative errors for queries made
+during the lockout period, so users can know that their rewards are pending for
+the target epoch. A new rpc API, i.e. `getRewardInterval`, will be added for
+querying the `rewarding interval` for the current epoch.