From de22cebdeafe29eb6daafde9837e4071435502f4 Mon Sep 17 00:00:00 2001
From: John Tromp <john.tromp@gmail.com>
Date: Thu, 27 Oct 2016 14:55:33 -0400
Subject: [PATCH] add info to README

---
 README.md | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index a904e80..2b458c2 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,53 @@ only suitable for single-threaded use (where they get some speedup over the gene
 
 Options -h HEADER -n NONCE are self explanatory.
 Add option -r RANGESIZE to search a range of nonces.
+For benching, options -n 255 -r 100 are useful as it gets exactly 188 solutions from 100 runs.
 
-Build the faster version courtesy of xenoncat's AVX2 4-way parallel blake2b assembly code with
-"make dev1" and bench with "time ./dev1 -n 255 -r 100"
+My original submission was triggered by seeing how xenoncat's
+"has much of the same ideas as mine" so that making my open sourcing conditional
+on getting sufficient funding for the Cuckoo Cycle Bounty Fund no longer made sense.
+
+https://forum.z.cash/t/tromps-solvers/2465/76
+
+I noticed that we both use bucket sorting with tree nodes stored as a directed acyclic graph.
+Upon original submission, I wrote: Compared to xenoncat, my solver differs in
+- having way more buckets,
+- wasting some memory,
+- having simpler pair compression,
+- being multi-threaded,
+- and supporting (144,5).
+- And of course in not using any assembly.
+- Oh, and having some cool visualization of bucket size distribution...
+
+David Jaenson gave me the idea to disable atomics for single threaded operation,
+which gave a nice speed boost for that case.
+
+Since then I reduced the number of buckets in the cpu solver from 2^16 to 2^12,
+which allowed for reducing the bucket space overhead. I borrowed from xenoncat
+the idea to allocate all memory statically, and found a way to improve upon his memory layout,
+reducing waste by about 7%.
+
+Seeing that my solver was spending 45% of runtime on hashing, I asked xenoncat if (s)he
+could make their assembly blake2b implementation available through a C binding, which s(he)
+very generously did.
+
+Zooko had earlier suggested looking at Samuel Neves' blake2bp implemention for faster hashing.
+After initially rejecting this approach due to different blake2bp semantics, I came back to 
+to it in search of a more portable way to gain speed. I managed to bridge the semantic gap
+and modify Samuel's source to serve Equihash's purposes.
+
+On the morning of the submission deadline day, discussion on sorting with judge Solardiz
+made me realize that my 2nd stage bucket sort could benefit from linking rather than listing
+xor-able slots, which gave me the final speed boost.
+
+More detailed documentation is available in the equi_miner.h source code.
+
+Performance summary:
+
+equi1:      4.6 Sol/s - 5.9 Sol/s (with AVX2)
+equi -t 8: 16.7 Sol/s
+8 x equi1: 20.3 Sol/s
+dev1:       6.5 Sol/s (xenoncat's blake)
+8 x dev1:  20.6 Sol/s
+dev -t 8:  17.2 Sol/s
+eqcuda:    23.6 Sol/s