diff options
1 files changed, 72 insertions, 14 deletions
diff --git a/usenix2019.tex b/usenix2019.tex
index dd3e891..04ed8db 100644
--- a/usenix2019.tex
+++ b/usenix2019.tex
@@ -158,22 +158,80 @@ rand_mixed: 10.0M with 4 threads in 38 sec, 14806 nsec per iter, 263k per s
\section{Btree design}
+The core design elements of bcachefs's btree are fairly old at this point, as
+they originated in the earliest versions of bcache, but have not been described
+formally before.
+Bcachefs's btree is mostly a relatively standard copy on write B+ tree, with one
+major difference - btree nodes are large (typically 256k), and log structured.
+Log structured btree nodes do add a significant amount of complexity to the
+btree code. Nodes are too big for all the keys within a node to be kept as a
+single sorted array of keys, so we have to keep multiple (up to three) sorted
+arrays of keys in memory for a given btree node and the lookup and iterator code
+has to internally search all three and combine the results. More significantly,
+log structured btree nodes mean that we have some of the properties of a
+conventional update-in-place btree: with a copy on write btree, the only writes
+that have any direct effect on visible on disk metadata are writes that update
+the pointer to the root node. With log structured btree nodes, writes that
+append to btree nodes also make changes visible on disk and have to be correctly
+ordered with respect to other metadata writes.
+But, log structured btree nodes have turned out to be an enormous performance
+First, by having multiple sorted arrays of keys (referred to as bsets within the
+code) per btree node, most keys in a btree node will be in bsets that are no
+longer being modified, and this means we can construct lookup tables that would
+be much too expensive to use if we had to modify them.
+This is a major issue because binary search is a worst case scenario for
+performance on modern processes - prefetching is almost useless because at every
+iteration, you have two possible memory accesses depending on which way the
+previous comparison went, and they'll be at completely different locations in
+However, if we construct a binary search tree in Eytzinger layout
+(, prefetching is usefully possible because
+a given node's two children are adjacent in memory - as well as all its
+grandchildren, great grandchildren, and so on. Then we use a couple tricks to
+make nodes in the auxiliary search tree as small as possible - four bytes - so
+we can fit as many of them as possible onto a cacheline:
+ \item Each node in the auxiliary search tree corresponds to one
+ cacheline worth of data in the original set of keys, which means
+ we don't have to store a full pointer to the original
+ corresponding key - just an offset within that cacheline,
+ combined with some math.
+ \item Keys are large integers - 160 bits - but for any given comparison,
+ we don't actually need most of those bits. The keys in the range
+ we're currently searching will all have the same high bits - we
+ don't need to store those. And we don't need to store the low
+ bits: if we compare the pivot with the key immediately prior to
+ it, we only need to store bits up to and including the first bit
+ where those two keys differ (i.e. the key in the auxiliary
+ search tree has to compare greater than the pivot's previous
+ key).
+ We need the nodes in the auxiliary search tree to be fixed size,
+ which means sometimes when constructing those nodes the pivot
+ and the prior key won't differ in the bits created - to handle
+ that case, we flag that node as failed and the lookup code falls
+ back to comparing against the original key. But the lookup code
+ parameters have been tuned so that this is a very rare
+ occurrence (statistics on the auxiliary search trees are
+ available in debugfs).
+All these tricks mean we can fit 16 nodes in the auxiliary search tree onto a
+cacheline, and we can prefetch four levels in advance - which means we can
+pipeline memory accesses reasonably effectively. The auxiliary search tree code
+was, by far, the biggest single performance improvement in the history of
+bcache/bcachefs - around an order of magnitude. This is why we're able to hit 1
+million random lookups per second on a single core - that is an extremely
+difficult number to hit for any comparison based data structure, simply based on
+memory latency considerations and number of memory accesses needed per lookup.