document optimistic confirmation and slashing roadmap (#10164)

* docs

* book nits

* Update docs/src/proposals/optimistic-confirmation-and-slashing.md

Co-authored-by: Michael Vines <mvines@gmail.com>

* Update optimistic-confirmation-and-slashing.md

* Update optimistic-confirmation-and-slashing.md

* Update optimistic-confirmation-and-slashing.md

* Update optimistic-confirmation-and-slashing.md

* Update optimistic-confirmation-and-slashing.md

* fixups

Co-authored-by: Michael Vines <mvines@gmail.com>
This commit is contained in:
anatoly yakovenko 2020-05-21 18:15:09 -07:00 committed by GitHub
parent 12a3b1ba6a
commit c78fd2b36d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 90 additions and 0 deletions

View File

@ -96,6 +96,7 @@
* [Commitment](implemented-proposals/commitment.md)
* [Snapshot Verification](implemented-proposals/snapshot-verification.md)
* [Accepted Design Proposals](proposals/README.md)
* [Optimistic Confirmation and Slashing](proposals/optimistic-confirmation-and-slashing.md)
* [Secure Vote Signing](proposals/vote-signing-to-implement.md)
* [Cluster Test Framework](proposals/cluster-test-framework.md)
* [Validator](proposals/validator-proposal.md)

View File

@ -0,0 +1,89 @@
# Optimistic Confirmation and Slashing
Progress on optimistic confirmation can be tracked here
https://github.com/solana-labs/solana/projects/52
At the end of May, the mainnet-beta is moving to 1.1, and testnet
is moving to 1.2. With 1.2, testnet will behave as if we have 1-block
conf as long as no more than 4.66% of the validators are acting
maliciously. Applications can assume that 2/3+ votes observed in
gossip confirm a block or that at least 4.66% of the network is
violating the protocol.
## How does it work?
The general idea is that validators have to continue voting, following
their last fork, unless they can construct a proof that their fork
may not reach finality. The way validators construct this proof is
by collecting votes for all the other forks, excluding their own.
If the set of valid votes represents over 1/3+X of the epoch stake
weight, there is may not be a way for the validators current fork
to reach 2/3+ finality. The validator hashes the proof (creates a
witness) and submits it with their vote for the alternative fork.
But if 2/3+ votes for the same block, it is impossible for any of
the nodes to construct this proof, and therefore no node is able
to switch forks and this block will be eventually finalized.
## Tradeoffs
The safety margin is 1/3+X, where X represents the minimum amount
of stake that will be slashed in case the protocol is violated. The
tradeoff is that liveness is now reduced by 2X in the worst case.
If more than 1/3 - 2X of the network is unavailable, the network
may stall and will resume finalizing blocks after the network
recovers. So far, we havent observed a large unavailability hit
on our mainnet, cosmos, or tezos. For our network, which is primarily
composed of high availability systems, this seems unlikely. Currently,
we have set the threshold percentage to 4.66%, which means that if
23.68% have failed the network may stop finalizing blocks. For our
network, which is primarily composed of high availability systems
a 23.68% drop in availabilty seems unlinkely. 1:10^12 odds assuming
five 4.7% staked nodes with 0.995 of uptime.
## Security
Long term average votes per slot has been 670,000,000 votes /
12,000,000 slots, or 55 out of 64 voting validators. This includes
missed blocks due to block producer failures. When a client sees
55/64, or ~86% confirming a block, it can expect that ~24% or (86
- 66.666.. + 4.666..)% of the network must be slashed for this
block to fail full finalization.
## Why Solana?
This approach can be built on other networks, but the implementation
complexity is significantly reduced on Solana because our votes
have provable VDF-based timeouts. Its not clear if switching proofs
can be easily constructed in networks with weak assumptions about
time.
## Slashing roadmap
Slashing is a hard problem, and it becomes harder when the goal of
the network is to be the fastest possible implementation. The
tradeoffs are especially apparent when optimizing for latency. For
example, we would really like the validators to cast and propagate
their votes before the memory has been synced to disk, which means
that the risk of local state corruption is much higher.
Fundamentally, our goal for slashing is to slash 100% in cases where
the node is maliciously trying to violate safety rules and 0% during
routine operation. How we aim to achieve that is to first implement
slashing proofs without any automatic slashing whatsoever.
Right now, for regular consensus, after a safety violation, the
network will halt. We can analyze the data and figure out who was
responsible and propose that the stake should be slashed after
restart. A similar approach will be used with a optimistic conf.
An optimistic conf safety violation is easily observable, but under
normal circumstances, an optimistic confirmation safety violation
may not halt the network. Once the violation has been observed, the
validators will freeze the affected stake in the next epoch and
will decide on the next upgrade if the violation requires slashing.
In the long term, transactions should be able to recover a portion
of the slashing collateral if the optimistic safety violation is
proven. In that scenario, each block is effectively insured by the
network.