zips/zip-0173.rst

::

  ZIP: 173
  Title: Bech32 Format
  Owners: Daira Emma Hopwood <daira@electriccoin.co>
  Credits: Pieter Wuille <pieter.wuille@gmail.com>
           Greg Maxwell <greg@xiph.org>
           Rusty Russell
           Mark Friedenbach
  Status: Final
  Category: Standards / Wallet
  Created: 2018-06-13
  License: MIT


Terminology
===========

The key words "MUST", "MUST NOT", "SHOULD", and "SHOULD NOT" in this document are
to be interpreted as described in BCP 14 [#BCP14]_ when, and only when, they appear
in all capitals.

The term "network upgrade" in this document is to be interpreted as described in
ZIP 200. [#zip-0200]_

The term "Sapling" in this document is to be interpreted as described in ZIP 205.
[#zip-0205]_


Abstract
========

This document proposes a checksummed base32 format, "Bech32", and a standard for
Sapling addresses and keys using it.


Motivation
==========

Since launch, Zcash has relied on Base58 addresses with a truncated double-SHA256
checksum, inherited from Bitcoin. They were part of the original Bitcoin software
and their scope was extended in BIP 13 [#bip-0013]_ for P2SH (Pay-to-Script-Hash)
addresses. However, both the character set and the checksum algorithm have
limitations:

* Base58 needs a lot of space in QR codes, as it cannot use the
  ''alphanumeric mode''.

* The mixed case in Base58 makes it inconvenient to reliably write down, type on
  mobile keyboards, or read out loud.

* The double-SHA256 checksum is slow and has no error-detection guarantees.

* Most of the research on error-detecting codes only applies to character-set
  sizes that are a `prime power <https://en.wikipedia.org/wiki/Prime_power>`_,
  which 58 is not.

* Base58 decoding is complicated and relatively slow.

To address these issues, Bitcoin adopted a new encoding called Bech32 [#bip-0173]_,
for use with address types added by its Segregated Witness proposal. Zcash does
not implement Segregated Witness, but it reuses Bech32 with address and key types
introduced by the Sapling network upgrade [#zip-0205]_.

Since the description of Bech32 in [#bip-0173]_ is partially entangled with
Segregated Witness address formats, we have found it clearer to write this ZIP to
specify Bech32 separately. This specification should be read in conjunction with
section 5.6 ("Encodings of Addresses and Keys") of the Zcash Sapling protocol
specification [#protocol]_, and with ZIP 32 which specifies additional key types
for support of shielded hierarchical deterministic wallets [#zip-0032]_.


Specification
=============

We describe the general checksummed base32 format called ''Bech32''. Its use for
Sapling payment addresses and keys is defined in the Sapling protocol specification
[#protocol]_.

A Bech32 string consists of:

* The *human-readable part*, which is intended to convey the type of data, or
  anything else that is relevant to the reader. This part MUST contain 1 to 83
  US-ASCII characters, with each character having a value in the range [33-126].
  HRP validity may be further restricted by specific applications.

* The *separator*, which is always "1". In case "1" is allowed inside the
  human-readable part, the last one in the string is the separator.

* The *data part*, which is at least 6 characters long and only consists of
  alphanumeric characters excluding "1", "b", "i", and "o".

+-----+---+---+---+---+---+---+---+---+
|     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+=====+===+===+===+===+===+===+===+===+
|  +0 | q | p | z | r | y | 9 | x | 8 |
+-----+---+---+---+---+---+---+---+---+
|  +8 | g | f | 2 | t | v | d | w | 0 |
+-----+---+---+---+---+---+---+---+---+
| +16 | s | 3 | j | n | 5 | 4 | k | h |
+-----+---+---+---+---+---+---+---+---+
| +24 | c | e | 6 | m | u | a | 7 | l |
+-----+---+---+---+---+---+---+---+---+


Checksum
--------

The last six characters of the data part form a checksum and contain no information.
Valid strings MUST pass the criteria for validity specified by the Python3 code
snippet below. The checksum is defined so that the function ``bech32_verify_checksum``
returns true when its arguments are:

* ``hrp``: the human-readable part as a string;
* ``data``: the data part as a list of integers representing the characters after
  conversion using the table above.

::

  def bech32_polymod(values):
    GEN = [0x3b6a57b2, 0x26508e6d, 0x1ea119fa, 0x3d4233dd, 0x2a1462b3]
    chk = 1
    for v in values:
      b = (chk >> 25)
      chk = (chk & 0x1ffffff) << 5 ^ v
      for i in range(5):
      chk ^= GEN[i] if ((b >> i) & 1) else 0
    return chk

  def bech32_hrp_expand(s):
    return [ord(x) >> 5 for x in s] + [0] + [ord(x) & 31 for x in s]

  def bech32_verify_checksum(hrp, data):
    return bech32_polymod(bech32_hrp_expand(hrp) + data) == 1


This implements a `BCH code <https://en.wikipedia.org/wiki/BCH_code>`_;
in the case where the encoded string is at most 90 characters, this code
guarantees detection of *any error affecting at most 4 characters* and has
less than a 1 in 10\ :sup:`9` chance of failing to detect more errors.
More details about the properties can be found in the Checksum Design section.
The human-readable part is processed by first feeding the higher 3 bits of each
character's US-ASCII value into the checksum calculation followed by a zero
and then the lower 5 bits of each US-ASCII value.

To construct a valid checksum given the human-readable part and (non-checksum)
values of the data-part characters, the code below can be used:

::

  def bech32_create_checksum(hrp, data):
    values = bech32_hrp_expand(hrp) + data
    polymod = bech32_polymod(values + [0,0,0,0,0,0]) ^ 1
    return [(polymod >> 5 * (5 - i)) & 31 for i in range(6)]


Error correction
''''''''''''''''

One of the properties of these BCH codes is that they can be used for error
correction. An unfortunate side effect of error correction is that it erodes
error detection: correction changes invalid inputs into valid inputs, but if
more than a few errors were made then the valid input may not be the correct
input. Use of an incorrect but valid input can cause funds to be lost
irrecoverably. Because of this, implementations SHOULD NOT implement correction
beyond potentially suggesting to the user where in the string an error might be
found, without suggesting the correction to make.

Uppercase/lowercase
'''''''''''''''''''

The lowercase form is used when determining a character's value for checksum
purposes.

Encoders MUST always output an all-lowercase Bech32 string. If an uppercase
version of the encoding result is desired (e.g. for presentation purposes, or
QR code use), then an uppercasing procedure can be performed external to the
encoding process.

Decoders MUST NOT accept strings where some characters are uppercase and some
are lowercase (such strings are referred to as mixed-case strings).

For presentation, lowercase is usually preferable, but inside QR codes
uppercase SHOULD be used, as those permit the use of `alphanumeric mode
<https://www.thonky.com/qr-code-tutorial/alphanumeric-mode-encoding>`_,
which is 45% more compact than the `byte mode
<https://www.thonky.com/qr-code-tutorial/byte-mode-encoding>`_ that would
otherwise be used.

Encoding
''''''''

* Start with the bits of the raw encoding of the appropriate address or key
  type, most significant bit per byte first.

* Re-arrange those bits into groups of 5, and pad with zeroes at the end if
  needed.

* Translate those bits, most significant bit first, to characters using the
  table above.

Decoding
''''''''

Software interpreting a Bech32-encoded address MUST first verify that the
human-readable part matches that of a specified address type, and similarly
for keys.

If this check passes, convert the rest of the data to bytes:

* Translate the values using the table above to 5 bits, most significant bit
  first.

* Re-arrange those bits into groups of 8 bits. Any incomplete group at the
  end MUST be 4 bits or fewer, MUST be all zeroes, and is discarded.

* The resulting groups are interpreted as the raw encoding of the appropriate
  address or key type, with the most significant bit first in each byte.

Decoders SHOULD enforce known-length restrictions on address and key types.

For example, [#protocol]_ specifies that the length of the raw encoding of a
Sapling payment address is 43 bytes (88 + 256 bits). This results in a
Bech32-encoded Sapling payment address being 78 characters long.


Compatibility
=============

Only software supporting the Sapling network upgrade is able to use these
addresses.

There is no effect on support for transparent addresses and keys, or for Sprout
z-addresses and keys; these will continue to be encoded using Base58Check, and
MUST NOT be encoded using Bech32.


Rationale
=========

Why use base32 at all?
----------------------

The lack of mixed case makes it more efficient to read out loud or to put into
QR codes. It does come with a 15% length increase. However, the length of a
Bech32-encoded Sapling payment address remains below 80 characters, which
reduces the likelihood of line splitting in terminals or email. Thus,
cutting-and-pasting of addresses is still reliable.

Why call it Bech32?
-------------------

"Bech" contains the characters BCH (the error detection algorithm used) and
sounds a bit like "base". The term Bech32 is established for Bitcoin and there
was no reason to use a different name for it in Zcash Sapling.

Why not support Bech32 encoding of Sprout or transparent addresses?
-------------------------------------------------------------------

This was not considered to be sufficiently well-motivated given the
compatibility issues that would arise from having two formats for these
address types, with pre-Sapling software not supporting the new format.

Why include a separator in addresses?
-------------------------------------

That way the human-readable part is unambiguously separated from the data part,
avoiding potential collisions with other human-readable parts that share a
prefix. It also allows us to avoid having character-set restrictions on the
human-readable part. The separator is ''1'' because using a non-alphanumeric
character would complicate copy-pasting of addresses (with no double-click
selection in several applications). Therefore an alphanumeric character outside
the normal character set was chosen.

Why not use an existing base32 character set?
---------------------------------------------

Existing character sets for base-32 encodings include
`RFC 3548 <https://www.rfc-editor.org/rfc/rfc3548.html>`_ and
`z-base-32 <https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt>`_.
The set used for Bech32 was chosen to minimize ambiguity according to
`this <https://hissa.nist.gov/~black/GTLD/>`_ visual similarity data, and the
ordering is chosen to minimize the number of pairs of similar characters
(according to the same data) that differ in more than 1 bit. As the checksum is
chosen to maximize detection capabilities for low numbers of bit errors, this
choice improves its performance under some error models.

Why are the high bits of the human-readable part processed first?
-----------------------------------------------------------------

This design choice had a rationale for Bitcoin Segregated Witness addresses
(see [#bip-0173]_) that does not apply to Zcash Sapling addresses. It is
retained for compatibility with Bech32 encoders/decoders written for Bitcoin.


Reference implementations
=========================

* The encoder/decoder used by ``zcashd``:

  * `In C++ <https://github.com/zcash/zcash/blob/master/src/bech32.cpp>`_

* Encoders and decoders written for Bitcoin:

  * `For C <https://github.com/sipa/bech32/tree/master/ref/c>`_
  * `For C++ <https://github.com/sipa/bech32/tree/master/ref/c++>`_
  * `For Javascript <https://github.com/sipa/bech32/tree/master/ref/javascript>`_
  * `For Go <https://github.com/sipa/bech32/tree/master/ref/go>`_
  * `For Python <https://github.com/sipa/bech32/tree/master/ref/python>`_
  * `For Haskell <https://github.com/sipa/bech32/tree/master/ref/haskell>`_
  * `For Ruby <https://github.com/sipa/bech32/tree/master/ref/ruby>`_
  * `For Rust <https://github.com/sipa/bech32/tree/master/ref/rust>`_

* Fancy decoder written for Bitcoin that localizes errors:

  * `Fancy decoder for Javascript <https://github.com/sipa/bech32/tree/master/ecc/javascript>`_
    (`demo website <HTTP://bitcoin.sipa.be/bech32/demo/demo.html>`_)

Note that the encoders written for Bitcoin may make assumptions specific to
Segregated Witness address formats that do not apply to Zcash. Only the Python
one has been tested with Zcash at the time of writing.


Examples
========

TODO: add valid Sapling payment addresses and keys, and their corresponding raw encodings.


Test vectors
============

The following strings are valid Bech32:

* ``A12UEL5L``
* ``a12uel5l``
* ``an83characterlonghumanreadablepartthatcontainsthenumber1andtheexcludedcharactersbio1tt5tgs``
* ``abcdef1qpzry9x8gf2tvdw0s3jn54khce6mua7lmqqqxw``
* ``11qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqc8247j``
* ``split1checkupstagehandshakeupstreamerranterredcaperred2y9e3w``
* ``?1ezyfcl``

The following strings are not valid Bech32 (with reason for invalidity):

* 0x20 + ``1nwldj5``: HRP character out of range
* 0x7F + ``1axkwrx``: HRP character out of range
* 0x80 + ``1eym55h``: HRP character out of range
* ``pzry9x0s0muk``: No separator character
* ``1pzry9x0s0muk``: Empty HRP
* ``x1b4n0q5v``: Invalid data character
* ``li1dgmt3``: Too short checksum
* ``de1lg7wt`` + 0xFF: Invalid character in checksum
* ``A1G7SGD8``: checksum calculated with uppercase form of HRP
* ``10a06t8``: empty HRP
* ``1qzzfhee``: empty HRP
* ``bc1qw508d6qejxtdg4y5r3zarvary0c5xw7kv8f3t5``: Invalid checksum
* ``tb1qrp33g0q5c5txsp9arysrx4k6zdkfs4nce4xj0gdcccefvpysxf3q0sL5k7``: Mixed case
* ``bc1zw508d6qejxtdg4y5r3zarvaryvqyzf3du``: Zero padding of more than 4 bits
* ``tb1qrp33g0q5c5txsp9arysrx4k6zdkfs4nce4xj0gdcccefvpysxf3pjxtptv``: Non-zero padding in 8-to-5 conversion


Checksum design
===============

Design choices
--------------

BCH codes can be constructed over any prime-power alphabet and can be chosen to
have a good trade-off between size and error-detection capabilities. Unlike
`Reed-Solomon codes <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_,
they are not restricted in length to one less than the alphabet size. While they
also support efficient error correction, the implementation of just error detection
is very simple.

We pick 6 checksum characters as a trade-off between length of the addresses and
the error-detection capabilities, as 6 characters is the lowest number sufficient
for a random failure chance below 1 per billion. For the length of data we're
most interested in protecting (up to 77 bytes excluding the separator, for a
Sapling payment address), BCH codes can be constructed that guarantee detecting
up to 4 errors. Longer data is also supported with slightly weaker error detection.

Selected properties
'''''''''''''''''''

Many of these codes perform badly when dealing with more errors than they are
designed to detect, but not all. For that reason, we consider codes that are
designed to detect only 3 errors as well as 4 errors, and analyse how well they
perform in practice.

The specific code chosen here is the result of:

* Starting with an exhaustive list of 159605 BCH codes designed to detect 3 or 4
  errors up to length 93, 151, 165, 341, 1023, and 1057.

* From those, requiring the detection of 4 errors up to length 71, resulting in
  28825 remaining codes.

* From those, choosing the codes with the best worst-case window for 5-character
  errors, resulting in 310 remaining codes.

* From those, picking the code with the lowest chance for not detecting small
  numbers of ''bit'' errors.

As a naive search would require over 6.5 \* 10\ :sup:`19` checksum evaluations,
a collision-search approach was used for analysis. The code can be found
`here <https://github.com/sipa/ezbase32/>`_.

Properties
''''''''''

The following table summarizes the chances for detection failure (as multiples of
1 in 10\ :sup:`9`).

+-------------------------------------+-----------------------------------------------+
|            Window length            |          Number of wrong characters           |
+--------+----------------------------+-------+-------+-------+-------+-------+-------+
| Length | Description                |  ≤4   |   5   |   6   |   7   |   8   |  ≥9   |
+--------+----------------------------+-------+-------+-------+-------+-------+-------+
|    8   | Longest detecting 6 errors |   0                   | 1.127 | 0.909 |  n/a  |
+--------+----------------------------+---------------+-------+-------+-------+-------+
|   18   | Longest detecting 5 errors |   0           | 0.965 | 0.929 | 0.932 | 0.931 |
+--------+----------------------------+-------+-------+-------+-------+-------+-------+
|   19   | Worst case for 6 errors    |   0   | 0.093 | 0.972 | 0.928 | 0.931         |
+--------+----------------------------+-------+-------+-------+-------+---------------+
|   39   | Length ≤ 39 characters     |   0   | 0.756 | 0.935 | 0.932 | 0.931         |
+--------+----------------------------+-------+-------+-------+-------+---------------+
|   59   | Length ≤ 59 characters     |   0   | 0.805 | 0.933 | 0.931                 |
+--------+----------------------------+-------+-------+-------+-----------------------+
|   71   | Length ≤ 71 characters     |   0   | 0.830 | 0.934 | 0.931                 |
+--------+----------------------------+-------+-------+-------+-----------------------+
|   89   | Longest detecting 4 errors |   0   | 0.867 | 0.933 | 0.931                 |
+--------+----------------------------+-------+-------+-------+-----------------------+

TODO: fill in table for a Sapling payment address, and check the following paragraph.

This means that when 5 changed characters occur randomly distributed in the 77
characters (excluding the separator) of a Sapling payment address, there is a chance
of at most 0.867 per billion that it will go undetected. When those 5 changes occur
randomly within a 19-character window, that chance goes down to 0.093 per billion.
As the number of errors goes up, the chance converges towards 1 in 2\ :sup:`30` =
0.931 per billion.

The chosen code performs reasonably well up to 1023 characters, but is best for
lengths up to 89 characters (excluding the separator). Since the search for suitable
codes was based on the requirements for Bitcoin P2WPKH and P2WSH addresses, it is
not quite optimal for the address lengths used by Sapling, but the advantages of
compatibility with existing Bitcoin libraries were considered to outweigh this
consideration.


Acknowledgements
================

This document is closely based on BIP 173 written by Pieter Wuille and Greg Maxwell,
which was inspired by the `address proposal <https://rusty.ozlabs.org/?p=578>`_ by
Rusty Russell and the `base32
<https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2014-February/004402.html>`_
proposal by Mark Friedenbach. BIP 173 also had input from Luke Dashjr, Johnson Lau,
Eric Lombrozo, Peter Todd, and various other reviewers.


References
==========

.. [#BCP14] `Information on BCP 14 — "RFC 2119: Key words for use in RFCs to Indicate Requirement Levels" and "RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words" <https://www.rfc-editor.org/info/bcp14>`_
.. [#protocol] `Zcash Protocol Specification, Version 2021.2.16 or later <protocol/protocol.pdf>`_
.. [#zip-0032] `ZIP 32: Shielded Hierarchical Deterministic Wallets <zip-0032.rst>`_
.. [#zip-0200] `ZIP 200: Network Upgrade Mechanism <zip-0200.rst>`_
.. [#zip-0205] `ZIP 205: Deployment of the Sapling Network Upgrade <zip-0205.rst>`_
.. [#bip-0013] `BIP 13: Address Format for pay-to-script-hash <https://github.com/bitcoin/bips/blob/master/bip-0013.mediawiki>`_
.. [#bip-0173] `BIP 173: Base32 address format for native v0-16 witness outputs <https://github.com/bitcoin/bips/blob/master/bip-0173.mediawiki>`_