From 1b9f144b9de1b0437be8f65e06ff6f982436b56d Mon Sep 17 00:00:00 2001 From: Simon Warta <2603011+webmaster128@users.noreply.github.com> Date: Wed, 19 Aug 2020 19:22:15 +0200 Subject: [PATCH] Propose ADR 027: Deterministic Protobuf Serialization (#6979) * Add ADR 026: Protocol Buffer Regencode * Use Regencode in ADR-020 * Apply suggestions from code review Co-authored-by: Aaron Craelius * Add reasoning behind omitting empty fields * Rename to ADR 026: Deterministic Protobuf Serialization * i++ * explicitely -> explicitly * Fix description of optional values * Update reasoning behind omitting default values * Fix empty -> defaut * Update docs/architecture/adr-027-deterministic-protobuf-serialization.md Co-authored-by: Amaury Martiny * Add rules to avoid duplicate encoding and specify packed fields * Add rule for variant encoding * Add surrogate pair to test UTF-8 encoding Co-authored-by: Aaron Craelius Co-authored-by: Amaury Martiny --- docs/architecture/README.md | 3 +- .../adr-020-protobuf-transaction-encoding.md | 11 +- ...27-deterministic-protobuf-serialization.md | 251 ++++++++++++++++++ 3 files changed, 256 insertions(+), 9 deletions(-) create mode 100644 docs/architecture/adr-027-deterministic-protobuf-serialization.md diff --git a/docs/architecture/README.md b/docs/architecture/README.md index 3e0879b21..b60a1b38a 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -52,4 +52,5 @@ Please add a entry below in your Pull Request for an ADR. - [ADR 023: Protocol Buffer Naming and Versioning](./adr-023-protobuf-naming.md) - [ADR 024: Coin Metadata](./adr-024-coin-metadata.md) - [ADR 025: IBC Passive Channels](./adr-025-ibc-passive-channels.md) -- [ADR 026: IBC Client Recovery Mechanisms](./adr-026-ibc-client-recovery-mechanisms.md) \ No newline at end of file +- [ADR 026: IBC Client Recovery Mechanisms](./adr-026-ibc-client-recovery-mechanisms.md) +- [ADR 027: Deterministic Protobuf Serialization](./adr-027-deterministic-protobuf-serialization.md) diff --git a/docs/architecture/adr-020-protobuf-transaction-encoding.md b/docs/architecture/adr-020-protobuf-transaction-encoding.md index 7146a1aa3..885e71342 100644 --- a/docs/architecture/adr-020-protobuf-transaction-encoding.md +++ b/docs/architecture/adr-020-protobuf-transaction-encoding.md @@ -8,6 +8,7 @@ - 2020 April 30: Switch to `Any` - 2020 May 14: Describe public key encoding - 2020 June 08: Store `TxBody` and `AuthInfo` as bytes in `SignDoc`; Document `TxRaw` as broadcast and storage type. +- 2020 August 07: Use ADR 027 for serializing `SignDoc`. ## Status @@ -188,8 +189,6 @@ message SignDoc { bytes auth_info = 2; string chain_id = 3; uint64 account_number = 4; - // account_sequence starts at 1 rather than 0 to avoid the case where - // the default 0 value must be omitted in protobuf serialization uint64 account_sequence = 5; } ``` @@ -197,8 +196,7 @@ message SignDoc { In order to sign in the default mode, clients take the following steps: 1. Serialize `TxBody` and `AuthInfo` using any valid protobuf implementation. -2. Create a `SignDoc` and encode it. (The only requirement of the underlying - protobuf implementation is that fields are serialized in order). +2. Create a `SignDoc` and serialize it using [ADR 027](./adr-027-deterministic-protobuf-serialization.md). 3. Sign the encoded `SignDoc` bytes. 4. Build a `TxRaw` and serialize it for broadcasting. @@ -214,8 +212,7 @@ Signature verifiers do: 3. For each required signer: - Pull account number and sequence from the state. - Obtain the public key either from state or `AuthInfo`'s `signer_infos`. - - Create a `SignDoc` and serialize it. Due to the simplicity of the type it - is expected that this matches the serialization used by the signer. + - Create a `SignDoc` and serialize it using [ADR 027](./adr-027-deterministic-protobuf-serialization.md). - Verify the signature at the the same list position against the serialized `SignDoc`. #### `SIGN_MODE_LEGACY_AMINO` @@ -423,8 +420,6 @@ message SignDocAux { PublicKey public_key = 2; string chain_id = 3; uint64 account_number = 4; - // account_sequence starts at 1 rather than 0 to avoid the case where - // the default 0 value must be omitted in protobuf serialization uint64 account_sequence = 5; } ``` diff --git a/docs/architecture/adr-027-deterministic-protobuf-serialization.md b/docs/architecture/adr-027-deterministic-protobuf-serialization.md new file mode 100644 index 000000000..fbc96b7dc --- /dev/null +++ b/docs/architecture/adr-027-deterministic-protobuf-serialization.md @@ -0,0 +1,251 @@ +# ADR 027: Deterministic Protobuf Serialization + +## Changelog + +- 2020-08-07: Initial Draft + +## Status + +Proposed + +## Context + +[Protobuf](https://developers.google.com/protocol-buffers/docs/proto3) +seralization is not unique (i.e. there exist a practically unlimited number of +valid binary representations for a protobuf document)1. For signature +verification in Cosmos SDK, signer and verifier need to agree on the same +serialization of a SignDoc as defined in +[ADR-020](./adr-020-protobuf-transaction-encoding.md) without transmitting the +serialization. This document describes a deterministic serialization scheme for +a subset of protobuf documents, that covers this use case but can be reused in +other cases as well. + +## Decision + +The following encoding scheme is proposed to be used by other ADRs. + +### Scope + +This ADR defines a protobuf3 serializer. The output is a valid protobuf +serialization, such that every protobuf parser can parse it. + +No maps are supported in version 1 due to the complexity of defining a +derterministic serialization. This might change in future. Implementations must +reject documents containing maps as invalid input. + +### Serialization rules + +The serialization is based on the +[protobuf 3 encoding](https://developers.google.com/protocol-buffers/docs/encoding) +with the following additions: + +1. Fields must be serialized only once in ascending order +2. Extra fields or any extra data must not be added +3. [Default values](https://developers.google.com/protocol-buffers/docs/proto3#default) + must be omitted +4. `repeated` fields of scalar numeric types must use + [packed encoding](https://developers.google.com/protocol-buffers/docs/encoding#packed) + by default. +5. Variant encoding of integers must not be longer than needed. + +While rule number 1. and 2. should be pretty straight forward and describe the +default behaviour of all protobuf encoders the author is aware of, the 3rd rule +is more interesting. After a protobuf 3 deserialization you cannot differentiate +between unset fields and fields set to the default value2. At +serialization level however, it is possible to set the fields with an empty +value or omitting them entirely. This is a significant difference to e.g. JSON +where a property can be empty (`""`, `0`), `null` or undefined, leading to 3 +different documents. + +Omitting fields set to default values is valid because the parser must assign +the default value to fields missing in the serialization3. For scalar +types, omitting defaults is required by the spec4. For `repeated` +fields, not serializing them is the only way to express empty lists. Enums must +have a first element of numeric value 0, which is the default5. And +message fields default to unset6. + +Omitting defaults allows for some amount of forward compatibility: users of +newer versions of a protobuf schema produce the same serialization as users of +older versions as long as newly added fields are not used (i.e. set to their +default value). + +### Implementation + +There are three main implementation strategies, ordered from the least to the +most custom development: + +- **Use a protobuf serializer that follows the above rules by default.** E.g. + [gogoproto](https://pkg.go.dev/github.com/gogo/protobuf/gogoproto) is known to + be compliant by in most cases, but not when certain annotations such as + `nullable = false` are used. It might also be an option to configure an + existing serializer accordingly. +- **Normalize default values before encoding them.** If your serializer follows + rule 1. and 2. and allows you to explicitly unset fields for serialization, + you can normalize default values to unset. This can be done when working with + [protobuf.js](https://www.npmjs.com/package/protobufjs): + + ```js + const bytes = SignDoc.encode({ + bodyBytes: body.length > 0 ? body : null, // normalize empty bytes to unset + authInfoBytes: authInfo.length > 0 ? authInfo : null, // normalize empty bytes to unset + chainId: chainId || null, // normalize "" to unset + accountNumber: accountNumber || null, // normalize 0 to unset + accountSequence: accountSequence || null, // normalize 0 to unset + }).finish(); + ``` + +- **Use a hand-written serializer for the types you need.** If none of the above + ways works for you, you can write a serializer yourself. For SignDoc this + would look something like this in Go, building on existing protobuf utilities: + + ```go + if !signDoc.body_bytes.empty() { + buf.WriteUVarInt64(0xA) // wire type and field number for body_bytes + buf.WriteUVarInt64(signDoc.body_bytes.length()) + buf.WriteBytes(signDoc.body_bytes) + } + + if !signDoc.auth_info.empty() { + buf.WriteUVarInt64(0x12) // wire type and field number for auth_info + buf.WriteUVarInt64(signDoc.auth_info.length()) + buf.WriteBytes(signDoc.auth_info) + } + + if !signDoc.chain_id.empty() { + buf.WriteUVarInt64(0x1a) // wire type and field number for chain_id + buf.WriteUVarInt64(signDoc.chain_id.length()) + buf.WriteBytes(signDoc.chain_id) + } + + if signDoc.account_number != 0 { + buf.WriteUVarInt64(0x20) // wire type and field number for account_number + buf.WriteUVarInt(signDoc.account_number) + } + + if signDoc.account_sequence != 0 { + buf.WriteUVarInt64(0x28) // wire type and field number for account_sequence + buf.WriteUVarInt(signDoc.account_sequence) + } + ``` + +### Test vectors + +Given the protobuf definition `Article.proto` + +```protobuf +package blog; +syntax = "proto3"; + +enum Type { + UNSPECIFIED = 0; + IMAGES = 1; + NEWS = 2; +}; + +enum Review { + UNSPECIFIED = 0; + ACCEPTED = 1; + REJECTED = 2; +}; + +message Article { + string title = 1; + string description = 2; + uint64 created = 3; + uint64 updated = 4; + bool public = 5; + bool promoted = 6; + Type type = 7; + Review review = 8; + repeated string comments = 9; + repeated string backlinks = 10; +}; +``` + +serializing the values + +```yaml +title: "The world needs change 🌳" +description: "" +created: 1596806111080 +updated: 0 +public: true +promoted: false +type: Type.NEWS +review: Review.UNSPECIFIED +comments: ["Nice one", "Thank you"] +backlinks: [] +``` + +must result in the serialization + +``` +0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 +``` + +When inspecting the serialized document, you see that every second field is +omitted: + +``` +$ echo 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | xxd -r -p | protoc --decode_raw +1: "The world needs change \360\237\214\263" +3: 1596806111080 +5: 1 +7: 2 +9: "Nice one" +9: "Thank you" +``` + +## Consequences + +Having such an encoding available allows us to get deterministic serialization +for all protobuf documents we need in the context of Cosmos SDK signing. + +### Positive + +- Well defined rules that can be verified independent of a reference + implementation +- Simple enough to keep the barrier to implement transaction signing low +- It allows us to continue to use 0 and other empty values in SignDoc, avoiding + the need to work around 0 sequences. This does not imply the change from + https://github.com/cosmos/cosmos-sdk/pull/6949 should not be merged, but not + too important anymore. + +### Negative + +- When implementing transaction signing, the encoding rules above must be + understood and implemented. +- The need for rule number 3. adds some complexity to implementations. + +### Neutral + +## References + +- 1 _When a message is serialized, there is no guaranteed order for + how its known or unknown fields should be written. Serialization order is an + implementation detail and the details of any particular implementation may + change in the future. Therefore, protocol buffer parsers must be able to parse + fields in any order._ from + https://developers.google.com/protocol-buffers/docs/encoding#order +- 2 _Note that for scalar message fields, once a message is parsed + there's no way of telling whether a field was explicitly set to the default + value (for example whether a boolean was set to false) or just not set at all: + you should bear this in mind when defining your message types. For example, + don't have a boolean that switches on some behaviour when set to false if you + don't want that behaviour to also happen by default._ from + https://developers.google.com/protocol-buffers/docs/proto3#default +- 3 _When a message is parsed, if the encoded message does not + contain a particular singular element, the corresponding field in the parsed + object is set to the default value for that field._ from + https://developers.google.com/protocol-buffers/docs/proto3#default +- 4 _Also note that if a scalar message field is set to its default, + the value will not be serialized on the wire._ from + https://developers.google.com/protocol-buffers/docs/proto3#default +- 5 _For enums, the default value is the first defined enum value, + which must be 0._ from + https://developers.google.com/protocol-buffers/docs/proto3#default +- 6 _For message fields, the field is not set. Its exact value is + language-dependent._ from + https://developers.google.com/protocol-buffers/docs/proto3#default +- Encoding rules and parts of the reasoning taken from + [canonical-proto3 Aaron Craelius](https://github.com/regen-network/canonical-proto3)