Previously we used the existing `Polynomial::rotate` and
`EvaluationDomain::rotate_extended` APIs to rotate the polynomials we
query at non-zero rotations, and then stored references to the chunks of
each (rotated and unrotated) polynomial as slices. When we reached an
`AstLeaf`, we would then clone the corresponding polynomial chunk.
We now avoid all of this with combined "rotate-and-chunk" APIs, making
use of the observation that a rotation is simply a renumbering of the
indices of the polynomial coefficients. Given that when we reach an
`AstLeaf` we already needed to allocate, these new APIs have the same
number of allocations during AST evaluation, but enable us to completely
avoid caching any information about the rotations or rotated polynomials
ahead of time.
`poly::Evaluator` stores all of the polynomials registered with it in
memory for the duration of its existence. When evaluating an AST, it
additionally caches the rotated polynomials in memory, and then chunks
all of the rotated polynomials for parallel evaluation.
Previously, it stored a polynomial for every unique AST leaf, regardless
of whether that leaf required a rotation or not. This resulted in the
unrotated polynomials being stored twice in memory. However, the chunks
simply refer to slices over cached polynomials, so we can reference the
unrotated polynomials stored in `poly::Evaluator` instead of copies of
them stored in the rotated polynomial `HashMap`. This strictly reduces
memory usage during proving with no effect on correctness.
The existing code will fold together a very deep AST that applies Horner's
rule to each gate in a proof -- which could include multiple circuits and
so for some applications will quickly grow such that when we recursively
descend later during evaluation the stack will easily overflow.
This change special cases the application of Horner's rule to a
"DistributePowers" AST node to keep the tree depth from exploding in size.
Previously we were passing through the chunk size and index to each
thread's evaluation context, but this was insufficient for them to
determine whether or not they were processing the final chunk, or if
the final chunk was short. This led to constant and linear term chunks
being created with the full chunk size, even if the last chunk was
short. If this longer-than-short chunk reached the root of the AST, it
triggered a panic in the final `copy_from_slice()`.
The bug was obscured in two ways:
- Currently polynomials always have a power-of-two length, and on CPUs
with power-of-two threads this meant we never produced short chunks.
- The way that subsequent operations like `Ast::Add` were implemented
meant that if a constant or linear term occurred on the right-hand
side of an operation, the longer chunks were masked to the short chunk
length.
We fix this by passing the polynomial length into each thread's context,
so that we can compute the correct length for the final chunk.
Co-authored-by: Daira Hopwood <daira@jacaranda.org>