315 lines
13 KiB
Markdown
315 lines
13 KiB
Markdown
# ADR 027: Deterministic Protobuf Serialization
|
|
|
|
## Changelog
|
|
|
|
* 2020-08-07: Initial Draft
|
|
* 2020-09-01: Further clarify rules
|
|
|
|
## Status
|
|
|
|
Proposed
|
|
|
|
## Abstract
|
|
|
|
Fully deterministic structure serialization, which works across many languages and clients,
|
|
is needed when signing messages. We need to be sure that whenever we serialize
|
|
a data structure, no matter in which supported language, the raw bytes
|
|
will stay the same.
|
|
[Protobuf](https://developers.google.com/protocol-buffers/docs/proto3)
|
|
serialization is not bijective (i.e. there exist a practically unlimited number of
|
|
valid binary representations for a given protobuf document)<sup>1</sup>.
|
|
|
|
This document describes a deterministic serialization scheme for
|
|
a subset of protobuf documents, that covers this use case but can be reused in
|
|
other cases as well.
|
|
|
|
### Context
|
|
|
|
For signature verification in Cosmos SDK, the signer and verifier need to agree on
|
|
the same serialization of a `SignDoc` as defined in
|
|
[ADR-020](./adr-020-protobuf-transaction-encoding.md) without transmitting the
|
|
serialization.
|
|
|
|
Currently, for block signatures we are using a workaround: we create a new [TxRaw](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L30)
|
|
instance (as defined in [adr-020-protobuf-transaction-encoding](https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-020-protobuf-transaction-encoding.md#transactions))
|
|
by converting all [Tx](https://github.com/cosmos/cosmos-sdk/blob/9e85e81e0e8140067dd893421290c191529c148c/proto/cosmos/tx/v1beta1/tx.proto#L13)
|
|
fields to bytes on the client side. This adds an additional manual
|
|
step when sending and signing transactions.
|
|
|
|
### Decision
|
|
|
|
The following encoding scheme is to be used by other ADRs,
|
|
and in particular for `SignDoc` serialization.
|
|
|
|
## Specification
|
|
|
|
### Scope
|
|
|
|
This ADR defines a protobuf3 serializer. The output is a valid protobuf
|
|
serialization, such that every protobuf parser can parse it.
|
|
|
|
No maps are supported in version 1 due to the complexity of defining a
|
|
deterministic serialization. This might change in future. Implementations must
|
|
reject documents containing maps as invalid input.
|
|
|
|
### Background - Protobuf3 Encoding
|
|
|
|
Most numeric types in protobuf3 are encoded as
|
|
[varints](https://developers.google.com/protocol-buffers/docs/encoding#varints).
|
|
Varints are at most 10 bytes, and since each varint byte has 7 bits of data,
|
|
varints are a representation of `uint70` (70-bit unsigned integer). When
|
|
encoding, numeric values are casted from their base type to `uint70`, and when
|
|
decoding, the parsed `uint70` is casted to the appropriate numeric type.
|
|
|
|
The maximum valid value for a varint that complies with protobuf3 is
|
|
`FF FF FF FF FF FF FF FF FF 7F` (i.e. `2**70 -1`). If the field type is
|
|
`{,u,s}int64`, the highest 6 bits of the 70 are dropped during decoding,
|
|
introducing 6 bits of malleability. If the field type is `{,u,s}int32`, the
|
|
highest 38 bits of the 70 are dropped during decoding, introducing 38 bits of
|
|
malleability.
|
|
|
|
Among other sources of non-determinism, this ADR eliminates the possibility of
|
|
encoding malleability.
|
|
|
|
### Serialization rules
|
|
|
|
The serialization is based on the
|
|
[protobuf3 encoding](https://developers.google.com/protocol-buffers/docs/encoding)
|
|
with the following additions:
|
|
|
|
1. Fields must be serialized only once in ascending order
|
|
2. Extra fields or any extra data must not be added
|
|
3. [Default values](https://developers.google.com/protocol-buffers/docs/proto3#default)
|
|
must be omitted
|
|
4. `repeated` fields of scalar numeric types must use
|
|
[packed encoding](https://developers.google.com/protocol-buffers/docs/encoding#packed)
|
|
5. Varint encoding must not be longer than needed:
|
|
* No trailing zero bytes (in little endian, i.e. no leading zeroes in big
|
|
endian). Per rule 3 above, the default value of `0` must be omitted, so
|
|
this rule does not apply in such cases.
|
|
* The maximum value for a varint must be `FF FF FF FF FF FF FF FF FF 01`.
|
|
In other words, when decoded, the highest 6 bits of the 70-bit unsigned
|
|
integer must be `0`. (10-byte varints are 10 groups of 7 bits, i.e.
|
|
70 bits, of which only the lowest 70-6=64 are useful.)
|
|
* The maximum value for 32-bit values in varint encoding must be `FF FF FF FF 0F`
|
|
with one exception (below). In other words, when decoded, the highest 38
|
|
bits of the 70-bit unsigned integer must be `0`.
|
|
* The one exception to the above is _negative_ `int32`, which must be
|
|
encoded using the full 10 bytes for sign extension<sup>2</sup>.
|
|
* The maximum value for Boolean values in varint encoding must be `01` (i.e.
|
|
it must be `0` or `1`). Per rule 3 above, the default value of `0` must
|
|
be omitted, so if a Boolean is included it must have a value of `1`.
|
|
|
|
While rule number 1. and 2. should be pretty straight forward and describe the
|
|
default behavior of all protobuf encoders the author is aware of, the 3rd rule
|
|
is more interesting. After a protobuf3 deserialization you cannot differentiate
|
|
between unset fields and fields set to the default value<sup>3</sup>. At
|
|
serialization level however, it is possible to set the fields with an empty
|
|
value or omitting them entirely. This is a significant difference to e.g. JSON
|
|
where a property can be empty (`""`, `0`), `null` or undefined, leading to 3
|
|
different documents.
|
|
|
|
Omitting fields set to default values is valid because the parser must assign
|
|
the default value to fields missing in the serialization<sup>4</sup>. For scalar
|
|
types, omitting defaults is required by the spec<sup>5</sup>. For `repeated`
|
|
fields, not serializing them is the only way to express empty lists. Enums must
|
|
have a first element of numeric value 0, which is the default<sup>6</sup>. And
|
|
message fields default to unset<sup>7</sup>.
|
|
|
|
Omitting defaults allows for some amount of forward compatibility: users of
|
|
newer versions of a protobuf schema produce the same serialization as users of
|
|
older versions as long as newly added fields are not used (i.e. set to their
|
|
default value).
|
|
|
|
### Implementation
|
|
|
|
There are three main implementation strategies, ordered from the least to the
|
|
most custom development:
|
|
|
|
* **Use a protobuf serializer that follows the above rules by default.** E.g.
|
|
[gogoproto](https://pkg.go.dev/github.com/cosmos/gogoproto/gogoproto) is known to
|
|
be compliant by in most cases, but not when certain annotations such as
|
|
`nullable = false` are used. It might also be an option to configure an
|
|
existing serializer accordingly.
|
|
* **Normalize default values before encoding them.** If your serializer follows
|
|
rule 1. and 2. and allows you to explicitly unset fields for serialization,
|
|
you can normalize default values to unset. This can be done when working with
|
|
[protobuf.js](https://www.npmjs.com/package/protobufjs):
|
|
|
|
```js
|
|
const bytes = SignDoc.encode({
|
|
bodyBytes: body.length > 0 ? body : null, // normalize empty bytes to unset
|
|
authInfoBytes: authInfo.length > 0 ? authInfo : null, // normalize empty bytes to unset
|
|
chainId: chainId || null, // normalize "" to unset
|
|
accountNumber: accountNumber || null, // normalize 0 to unset
|
|
accountSequence: accountSequence || null, // normalize 0 to unset
|
|
}).finish();
|
|
```
|
|
|
|
* **Use a hand-written serializer for the types you need.** If none of the above
|
|
ways works for you, you can write a serializer yourself. For SignDoc this
|
|
would look something like this in Go, building on existing protobuf utilities:
|
|
|
|
```go
|
|
if !signDoc.body_bytes.empty() {
|
|
buf.WriteUVarInt64(0xA) // wire type and field number for body_bytes
|
|
buf.WriteUVarInt64(signDoc.body_bytes.length())
|
|
buf.WriteBytes(signDoc.body_bytes)
|
|
}
|
|
|
|
if !signDoc.auth_info.empty() {
|
|
buf.WriteUVarInt64(0x12) // wire type and field number for auth_info
|
|
buf.WriteUVarInt64(signDoc.auth_info.length())
|
|
buf.WriteBytes(signDoc.auth_info)
|
|
}
|
|
|
|
if !signDoc.chain_id.empty() {
|
|
buf.WriteUVarInt64(0x1a) // wire type and field number for chain_id
|
|
buf.WriteUVarInt64(signDoc.chain_id.length())
|
|
buf.WriteBytes(signDoc.chain_id)
|
|
}
|
|
|
|
if signDoc.account_number != 0 {
|
|
buf.WriteUVarInt64(0x20) // wire type and field number for account_number
|
|
buf.WriteUVarInt(signDoc.account_number)
|
|
}
|
|
|
|
if signDoc.account_sequence != 0 {
|
|
buf.WriteUVarInt64(0x28) // wire type and field number for account_sequence
|
|
buf.WriteUVarInt(signDoc.account_sequence)
|
|
}
|
|
```
|
|
|
|
### Test vectors
|
|
|
|
Given the protobuf definition `Article.proto`
|
|
|
|
```protobuf
|
|
package blog;
|
|
syntax = "proto3";
|
|
|
|
enum Type {
|
|
UNSPECIFIED = 0;
|
|
IMAGES = 1;
|
|
NEWS = 2;
|
|
};
|
|
|
|
enum Review {
|
|
UNSPECIFIED = 0;
|
|
ACCEPTED = 1;
|
|
REJECTED = 2;
|
|
};
|
|
|
|
message Article {
|
|
string title = 1;
|
|
string description = 2;
|
|
uint64 created = 3;
|
|
uint64 updated = 4;
|
|
bool public = 5;
|
|
bool promoted = 6;
|
|
Type type = 7;
|
|
Review review = 8;
|
|
repeated string comments = 9;
|
|
repeated string backlinks = 10;
|
|
};
|
|
```
|
|
|
|
serializing the values
|
|
|
|
```yaml
|
|
title: "The world needs change 🌳"
|
|
description: ""
|
|
created: 1596806111080
|
|
updated: 0
|
|
public: true
|
|
promoted: false
|
|
type: Type.NEWS
|
|
review: Review.UNSPECIFIED
|
|
comments: ["Nice one", "Thank you"]
|
|
backlinks: []
|
|
```
|
|
|
|
must result in the serialization
|
|
|
|
```text
|
|
0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75
|
|
```
|
|
|
|
When inspecting the serialized document, you see that every second field is
|
|
omitted:
|
|
|
|
```sh
|
|
$ echo 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | xxd -r -p | protoc --decode_raw
|
|
1: "The world needs change \360\237\214\263"
|
|
3: 1596806111080
|
|
5: 1
|
|
7: 2
|
|
9: "Nice one"
|
|
9: "Thank you"
|
|
```
|
|
|
|
## Consequences
|
|
|
|
Having such an encoding available allows us to get deterministic serialization
|
|
for all protobuf documents we need in the context of Cosmos SDK signing.
|
|
|
|
### Positive
|
|
|
|
* Well defined rules that can be verified independent of a reference
|
|
implementation
|
|
* Simple enough to keep the barrier to implement transaction signing low
|
|
* It allows us to continue to use 0 and other empty values in SignDoc, avoiding
|
|
the need to work around 0 sequences. This does not imply the change from
|
|
https://github.com/cosmos/cosmos-sdk/pull/6949 should not be merged, but not
|
|
too important anymore.
|
|
|
|
### Negative
|
|
|
|
* When implementing transaction signing, the encoding rules above must be
|
|
understood and implemented.
|
|
* The need for rule number 3. adds some complexity to implementations.
|
|
* Some data structures may require custom code for serialization. Thus
|
|
the code is not very portable - it will require additional work for each
|
|
client implementing serialization to properly handle custom data structures.
|
|
|
|
### Neutral
|
|
|
|
### Usage in Cosmos SDK
|
|
|
|
For the reasons mentioned above ("Negative" section) we prefer to keep workarounds
|
|
for shared data structure. Example: the aforementioned `TxRaw` is using raw bytes
|
|
as a workaround. This allows them to use any valid Protobuf library without
|
|
the need of implementing a custom serializer that adheres to this standard (and related risks of bugs).
|
|
|
|
## References
|
|
|
|
* <sup>1</sup> _When a message is serialized, there is no guaranteed order for
|
|
how its known or unknown fields should be written. Serialization order is an
|
|
implementation detail and the details of any particular implementation may
|
|
change in the future. Therefore, protocol buffer parsers must be able to parse
|
|
fields in any order._ from
|
|
https://developers.google.com/protocol-buffers/docs/encoding#order
|
|
* <sup>2</sup> https://developers.google.com/protocol-buffers/docs/encoding#signed_integers
|
|
* <sup>3</sup> _Note that for scalar message fields, once a message is parsed
|
|
there's no way of telling whether a field was explicitly set to the default
|
|
value (for example whether a boolean was set to false) or just not set at all:
|
|
you should bear this in mind when defining your message types. For example,
|
|
don't have a boolean that switches on some behavior when set to false if you
|
|
don't want that behavior to also happen by default._ from
|
|
https://developers.google.com/protocol-buffers/docs/proto3#default
|
|
* <sup>4</sup> _When a message is parsed, if the encoded message does not
|
|
contain a particular singular element, the corresponding field in the parsed
|
|
object is set to the default value for that field._ from
|
|
https://developers.google.com/protocol-buffers/docs/proto3#default
|
|
* <sup>5</sup> _Also note that if a scalar message field is set to its default,
|
|
the value will not be serialized on the wire._ from
|
|
https://developers.google.com/protocol-buffers/docs/proto3#default
|
|
* <sup>6</sup> _For enums, the default value is the first defined enum value,
|
|
which must be 0._ from
|
|
https://developers.google.com/protocol-buffers/docs/proto3#default
|
|
* <sup>7</sup> _For message fields, the field is not set. Its exact value is
|
|
language-dependent._ from
|
|
https://developers.google.com/protocol-buffers/docs/proto3#default
|
|
* Encoding rules and parts of the reasoning taken from
|
|
[canonical-proto3 Aaron Craelius](https://github.com/regen-network/canonical-proto3)
|