Lots more text on transactions and interface

2017-02-03 18:54:19 +01:00 · 2017-02-03 18:54:19 +01:00 · 5c2ead741c
parent b31e5d7764
commit 5c2ead741c
1 changed files with 142 additions and 24 deletions
--- a/docs/architecture/merkle-frey.md
+++ b/docs/architecture/merkle-frey.md
@ -1,17 +1,15 @@
 # Merkle data stores - Frey's proposal

-To allow the efficient creation of an ABCi app, tendermint wishes to provide a reference implemention of a key-value store that provides merkle proofs of the data.  These proofs then quickly allow the ABCi app to provide an apphash to the consensus engine, as well as a full proof to any client.
-
-This is equivalent to building a database, and I would propose designing it from the API first, then looking how to implement this (or make an adaptor from the API to existing implementations). Once we agree on the functionality and the interface, we can implement the API bindings, and then work on building adaptors to existince merkle-ized data stores, or modifying the stores to support this interface
-
 ## TL;DR

-* **Interface**
-  * This interface should be domain-specific - ie. designed just for this use case
-  * It should present a simple go interface for embedding the data store in-process
-  * It should create a gRPC/protobuf API for calling from any client
-  * It should provide and maintain client adaptors from our in-process interface to gRPC client calls for at least golang and java (maybe more languages?)
-  * It should provide and maintain server adaptors from our gRPC calls to the in-process interface for golang at least (unless there is another server we wish to support)
+To allow the efficient creation of an ABCi app, tendermint wishes to provide a reference implemention of a key-value store that provides merkle proofs of the data.  These proofs then quickly allow the ABCi app to provide an apphash to the consensus engine, as well as a full proof to any client.
+
+This is equivalent to building a database, and I would propose designing it from the API first, then looking how to implement this (or make an adaptor from the API to existing implementations). Once we agree on the functionality and the interface, we can implement the API bindings, and then work on building adaptors to existince merkle-ized data stores, or modifying the stores to support this interface.
+
+We need to consider the API (both in-process and over the network), language bindings, maintaining handles to old state (and garbage collecting), persistence, security, providing merkle proofs, and general key-value store operations. To stay consistent with the blockchains "single global order of operations", this data store should only allow one connection at a time to have write access.
+
+## Overview
+
 * **State**
  * There are two concepts of state, "committed state" and "working state"
  * The working state is only accessible from the ABCi app, allows writing, but does not need to support proofs.
@ -26,6 +24,12 @@ This is equivalent to building a database, and I would propose designing it from
  * It should support range queries on subsets of the key space if possible (ie. if the db doesn't hash keys)
  * It should also support listening to changes to a desired key via pub-sub or similar method, so I can quickly notify you on a change to your balance without constant polling.
  * It may support other db-specific query types as an extension to this interface, as long as all specified actions maintain their meaning.
+* **Interface**
+  * This interface should be domain-specific - ie. designed just for this use case
+  * It should present a simple go interface for embedding the data store in-process
+  * It should create a gRPC/protobuf API for calling from any client
+  * It should provide and maintain client adaptors from our in-process interface to gRPC client calls for at least golang and java (maybe more languages?)
+  * It should provide and maintain server adaptors from our gRPC calls to the in-process interface for golang at least (unless there is another server we wish to support)
 * **Persistence**
  * It must support atomic persistance upon committing a new block.  That is, upon crash recovery, the state is guaranteed to represent the state at the end of a complete block (along with a note of which height it was).
  * It must delay deletion of old data as long as there are open read-only transactions refering to it, thus we must maintain some sort of WAL to keep track of pending cleanup.
@ -38,34 +42,148 @@ This is equivalent to building a database, and I would propose designing it from

 Here we go more in-depth in each of the sections, explaining the reasoning and more details on the desired behavior. This document is only the high-level architecture and should support multiple implementations.  When building out a specific implementation, a similar document should be provided for that repo, showing how it implements these concepts, and details about memory usage, storage, efficiency, etc.

-### Interface
-
-I will start with a simple go interface to illustrate the in-process interface.  Once there is agreement on how this looks, we can work out the gRPC bindings to support calling out of process.
-
-The use-case of allowing out-of-process calls is very powerful.  It we allow the ABCi app to maintain the only writable connections, we can guarantee that all transactions are only processed through the tendermint consensus engine.  We could then allow multiple "web server" machines "read-only" access and scale out the database reads, assuming the consensus engine, ABCi logic, and public key cryptography is more the bottleneck than the database.
-
-**TODO**

 ### State

-**TODO**
+The current ABCi interface avoids this question a bit and that has brought confusion.  If I use `merkleeyes` to store data, which state is returned from `Query`?  The current "working" state, which I would like to refer to in my ABCi application?  Or the last committed state, which I would like to return to a client's query?  Or an old state, which I may select based on height?
+
+Right now, `merkleeyes` implements `Query` like a normal ABCi app and only returns committed state, which has lead to problems and confusion.  Thus, we need to be explicit about which state we want to view.  Each viewer can then specify which state it wants to view.  This allows the app to query the workign state in DeliverTx, but the committed state in Query.
+
+We can easily provide two global references for "last committed" and "current working" states.  However, if we want to also allow querying of older commits... then we need some way to keep track of which ones are still in use, so we can garbage collect the unneeded ones. There is a non-trivial overhead in holdign references to all past states, but also a hardcoded solution (hold onto the last 5 commits) may not support all clients.  We should let the client define this somehow.

 ### Transactions

-**TODO**
+Transactions (in the typical database sense) are a clean and estabilished solution to this issue.  We can look at the [isolations levels](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable) which attempt to provide us things like "repeatable reads".  That means if we open a transaction, and query some data 100 times while other processes are writing to the db, we get the same result each time.  This transaction has a reference to its own local state from the time the transaction started. (We are refering to the highest isolation levels here, which correlate well this the blockchain use case).

-Points to add:
+If we implement a read-only transaction as a reference to state at the time of creation of that transaction, we can then hold these references to various snapshots, one per block that we are interested, and allow the client to multiplex queries and proofs from these various blocks.

-* Subtransactions
-* Rollback and commit
-* Multiplexing various transactions over on stateless gRPC connection
-* Ping-ing and timeouts
+If we continue using these concepts (which have informed 30+ years of server side design), we can add a few nice features to our write transactions. The first of which is `Rollback` and `Commit`.  That means all the changes we make in this transaction have no effect on the database until they are committed.  And until they are committed, we can always abort if we detect an anomaly, returning to the last committed state with a rollback.

+There is also a nice extension to this available on some database servers, basically, "nested" transactions or "savepoints".  This means that within one transaction, you can open a subtransaction/savepoint and continue work.  Later you have the option to commit or rollback all work since the savepoint/subtransaction.  And then continue with the main transaction.
+
+If you don't understand why this is useful, look at how basecoin needs to [hold cached state for AppTx](https://github.com/tendermint/basecoin/blob/master/state/execution.go#L126-L149), meaning that it rolls back all modifications if the AppTx returns an error. This was implemented as a wrapper in basecoin, but it is a reasonable thing to support in the DB interface itself (especially since the implementation becomes quite non-trivial as soon as you support range queries).
+
+To give a bit more reference to this concept in practice, read about [Savepoints in Postgesql](https://www.postgresql.org/docs/current/static/tutorial-transactions.html) ([reference](https://www.postgresql.org/docs/current/static/sql-savepoint.html)) or [Nesting transactions in SQL Server](http://dba-presents.com/index.php/databases/sql-server/43-nesting-transactions-and-save-transaction-command) (TL;DR: scroll to the bottom, section "Real nesting transactions with SAVE TRANSACTION")

 ### Functionality

 **TODO**

+### Go Interface
+
+I will start with a simple go interface to illustrate the in-process interface.  Once there is agreement on how this looks, we can work out the gRPC bindings to support calling out of process. These interfaces are not finalized code, but I think the demonstrate the concepts better than text and provide a strawman to get feedback.
+
+```
+// DB represents the committed state of a merkle-ized key-value store
+type DB interface {
+  // Snapshot returns a reference to last commited state to use for
+  // providing proofs, you must close it at the end to garbage collect
+  // the historical state we hold on to to make these proofs
+  Snapshot() Prover
+
+  // Start a transaction - only way to change state
+  // This will return an error if there is an open Transaction
+  Begin() (Transaction, error)
+
+  // These callbacks are triggered when the Transaction is Committed
+  // to the DB.  They can be used to eg. notify clients via websockets when
+  // their account balance changes.
+  AddListener(key []byte, listener Listener)
+  RemoveListener(listener Listener)
+}
+
+// DBReader represents a read-only connection to a snapshot of the db
+type DBReader interface {
+  // Queries on my local view
+  Has(key []byte) (bool, error)
+  Get(key []byte) (Model, error)
+  GetRange(start, end []byte, ascending bool, limit int) ([]Model, error)
+  Closer
+}
+
+// Prover is an interface that lets one query for Proofs, holding the
+// data at a specific location in memory
+type Prover interface {
+  DBReader
+
+  // Hash is the AppHash (RootHash) for this block
+  Hash() (hash []byte)
+
+  // Prove returns the data along with a merkle Proof
+  // Model and Proof are nil if not found
+  Prove(key []byte) (Model, Proof, error)
+}
+
+// Transaction is a set of state changes to the DB to be applied atomically.
+// There can only be one open transaction at a time, which may only have
+// maximum one subtransaction at a time.
+// In short, at any time, there is exactly one object that can write to the
+// DB, and we can use Subtransactions to group operations and roll them back
+// together (kind of like `types.KVCache` from basecoin)
+type Transaction interface {
+  DBReader
+
+  // Change the state - will raise error immediately if this Transaction
+  // is not holding the exclusive write lock
+  Set(model Model) (err error)
+  Remove(key []byte) (removed bool, err error)
+
+  // Subtransaction starts a new subtransaction, rollback will not affect the
+  // parent.  Only on Commit are the changes applied to this transaction.
+  // While the subtransaction exists, no write allowed on the parent.
+  // (You must Commit or Rollback the child to continue)
+  Subtransaction() Transaction
+
+  // Commit this transaction (or subtransaction), the parent reference is
+  // now updated.
+  // This only updates persistant store if the top level transaction commits
+  // (You may have any number of nested sub transactions)
+  Commit() error
+
+  // Rollback ends the transaction and throw away all transaction-local state,
+  // allowing the tree to prune those elements.
+  // The parent transaction now recovers the write lock.
+  Rollback()
+}
+
+// Listener registers callbacks on changes to the data store
+type Listener interface {
+  OnSet(key, value, oldValue []byte)
+  OnRemove(key, oldValue []byte)
+}
+
+// Proof represents a merkle proof for a key
+type Proof interface {
+  RootHash() []byte
+  Verify(key, value, root []byte) bool
+}
+
+type Model interface {
+  Key() []byte
+  Value() []byte
+}
+
+// Closer releases the reference to this state, allowing us to garbage collect
+// Make sure to call it before discarding.
+type Closer interface {
+  Close()
+}
+```
+
+### Remote Interface
+
+The use-case of allowing out-of-process calls is very powerful.  Not just to provide a powerful merkle-ready data store to non-go applications.
+
+It we allow the ABCi app to maintain the only writable connections, we can guarantee that all transactions are only processed through the tendermint consensus engine.  We could then allow multiple "web server" machines "read-only" access and scale out the database reads, assuming the consensus engine, ABCi logic, and public key cryptography is more the bottleneck than the database. We could even place the consensus engine, ABCi app, and data store on one machine, connected with unix sockets for security, and expose a tcp/ssl interface for reading the data, to scale out query processing over multiple machines.
+
+But returning our focus directly to the ABCi app (which is the most important use case).  An app may well want to maintain 100 or 1000 snapshots of different heights to allow people to easily query many proofs at a given height without race conditions (very important for IBC, ask Jae).  Thus, we should not require a separate TCP connection for each height, as this gets quite awkward with so many connections. Also, if we want to use gRPC, we should consider the connections potentially transient (although they are more efficient with keep-alive).
+
+Thus, the wire encoding of a transaction or a snapshot should simply return a unique id.  All methods on a `Prover` or `Transaction` over the wire can send this id along with the arguments for the method call.  And we just need a hash map on the server to map this id to a state.
+
+The only negative of not requiring a persistent tcp connection for each snapshot is there is no auto-detection if the client crashes without explicitly closing the connections.  Thus, I would suggest adding a `Ping` thread in the gRPC interface which keeps the Snapshot alive.  If no ping is received within a server-defined time, it may automatically close those transactions.  And if we consider a client with 500 snapshots that needs to ping each every 10 seconds, that is a lot of overhead, so we should design the ping to accept a list of IDs for the client and update them all.  Or associate all snapshots with a clientID and then just send the clientID in the ping.  (Please add other ideas on how to detect client crashes without persistent connections).
+
+To encourage adoption, we should provide a nice client that uses this gRPC interface (like we do with ABCi).  For go, the client may have the exact same interface as the in-process version, just that the error call may return network errors, not just illegal operations.  We should also add a client with a clean API for Java, since that seems to be popular among app developers in the current tendermint community.  Other bindings as we see the need in the server space.
+
 ### Persistence

 **TODO**