|author||Kent Overstreet <firstname.lastname@example.org>||2021-04-07 01:15:45 -0400|
|committer||Kent Overstreet <email@example.com>||2021-04-07 01:15:45 -0400|
fsck & allocator design docs
3 files changed, 217 insertions, 0 deletions
diff --git a/Allocator.mdwn b/Allocator.mdwn
new file mode 100644
@@ -0,0 +1,55 @@
+It's past time for the allocator to be redesigned, or at least heavily
+reconsidered. Nothing should be off the table at this point - we could
+conceivably even move away from bucket based allocation.
+Here's a list of considerations:
+* The bucket based allocator does have advantages. It's fast - most of the work
+ in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
+ Metadata is one key per bucket, and the alloc btree uses the btree key
+ cache, which is very good for performance - it means extents btree updates
+ don't have to do a second (relatively expensive) btree update to update alloc
+ info, they're only updating the key cache.
+* The bucket based allocator also lets us store and cheaply discard cached data,
+ by incrementing the generation number for that bucket.
+* Disadvantage: the bucket allocator requires us to have copygc, with everything
+ that entails.
+* It would be really nice to have backpointers at some point - copygc would
+ greatly benefit from having backpointers (currently we have to scan the entire
+ extents and reflink btrees to find data that needs to be moved), and erasure
+ coding could also make good use of backpointers.
+ If we added another btree for backpointers, it wouldn't make sense for it to
+ use the btree key cache, and thus extents btree updates would get more
+ expensive. Perhaps alloc keys could include a list of backpointers? They'd be
+ 32 bytes each, and bkeys can be a little short of 2k - that's only ~50
+ backpointers per bkey, which we'd easily overflow with small writes and large
+ buckets. If a secondary backpointers btree was just used as a fallback, it
+ would be getting used right when the performance impact matters most - small
+ We should investigate the actual performance overhead of adding another btree
+ update for backpointers - it might not be all that bad. Right now we're
+ looking at ~450k random updates per second on a single core, and that includes
+ the full `bch2_trans_commit()` path, and the most expensive things in that are
+ per transaction commit, not per btree update (e.g. getting the journal
+ reservation). Also, we'll hopefully be getting gap buffers soon, which will
+ improve btree update performance.
+The most pressing consideration though is that we need to eliminate, as much as
+possible, the periodic scanning the allocator thread(s) have to do to find
+available buckets. It's too easy to end up with bugs where this scanning is
+happening repeatedly (we have one now...), and it's a scalability issue and we
+shouldn't be doing it at all.
+Also, we would really like to get rid of the in memory array of buckets, this is
+another scalability issue. The in memory struct bucket now mostly just mirrors
+what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
+`journal_seq`, which is the journal sequence number of the last btree update
+that touched that bucket. It's needed for knowing whether we need to flush the
+journal before allocating and writing to that bucket - perhaps it should just be
+added to `struct bch_alloc` and stored in the keys in the btree.
diff --git a/Fsck.mdwn b/Fsck.mdwn
new file mode 100644
@@ -0,0 +1,160 @@
+Fsck in bcachefs:
+This document is an overview of fsck in bcachefs, as well as what will need to
+change for snapshots.
+In bcachefs, the fsck code is only responsible for checking higher level
+filesystem structure. Btree topology and allocation information are checked by
+`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the
+scope of fsck considerably.
+What it does have to check is:
+* For each extent, dirent and xattr, that a corresponding inode exists and is of
+ the appropriate type.
+* Dirents and xattrs indexed by hash, with linear probing. The fsck code is
+ responsible for checking that they're stored at the correct index (in case
+ there was a hash collision, every index between the hash value and the actual
+ index must be occupied by keys or hash whiteouts) and that there are no
+* Dirents: check that target inode exists and matches dirent's `d_type`, and
+ check the target inode's backpointer fields.
+ * Check that extents do not overlap
+ * Check that extents do not exist past inode's `i_size`
+ * Count extents for each inode and check inode's `i_sectors`.
+* Check that root directory exists
+* Check that lost+found exists
+* Check directory stucture: currently, this is done with a depth-first traversal
+ from the root directory. We check for directories with multiple links, and
+ then scan the inodes btree for directories that weren't found by the DFS.
+* Check each inode's link count: At this point we know that every dirent exists
+ in a directory (has a corresponding inode of the correct type), has a valid
+ target, and is reachable (because we've checked that all directories are
+ reachable). All that's left is to scan the dirents btree, counting the number
+ of dirents that point to each inode number, and verifying that against each
+ inode's link count.
+Changes for snapshots:
+Subvolumes vs. snapshots:
+Filesystem code has to know what subvolume it is in, in order to fetch the
+current snapshot ID at the start of every btree transaction. Fsck code does not
+know or care what subvolume a given key belongs to - a key may belong to multiple
+subvolumes, if it's in a snapshot that's an interior node in the snapshots
+Instead, when one key references another (e.g. a dirent that points to an
+inode), it does the lookup for the inode by searching for an inode with a
+snapshot ID equal to or an ancestor of the dirent key's snapshot ID.
+IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to
+check that every view of the filesystem, at every point in time, is consistent.
+Checking the filesystem for consistency in this fashion is conceptually
+straightforward - every key has a snapshot ID, and we use that when looking up
+anything that it references. However, we'll have to be careful about how we
+repair inconsistencies when the snapshot ID is an interior node:
+If we repair an inconsistency by deleting a key, we need to ensure that key is
+not referenced by child snapshots. For example, if we find an extent that does
+not have an `S_ISREG` inode, currently we delete it. But we might find an extent
+that doesn't have a corresponding inode in with a snapshot ID equal to or
+ancestor to the extents snapshot ID, but does have an inode in a child snapshot.
+If that were to occur, we should probably copy that inode to the extent's
+We would like it to be as hard a rule as possible that fsck never deletes or
+modifies keys that belong to interior node snapshot IDs, because when people are
+using snapshots they'll generally want the data in that snapshot to always stay
+intact and it would be rather bad form if data was torched due to a bug in fsck.
+It may not be possible to adhere to this rule while fixing all filesystem
+inconsistencies - e.g. if extents exist above an inode's `i_size`, then
+generally speaking the right thing to do is to delete that extent. But if we
+find inconsistencies in at interior node snapshot IDs, then perhaps the best
+thing to do would be to just leave it, or only apply changes in leaf node
+We want bcachefs to scale well into the millions of snapshots. This means that
+there can't generally be operations that have to run for each subvolume or
+snapshot - the fsck algorithms have to work one key at a time, processing them
+in natural btree order.
+This means that the extents, dirents and xattrs passes need to use
+`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode,
+they'll be seeing keys from every snapshot version mixed together - and there
+will be multiple versions of the inode, in different snapshots, that they need
+to be checked against.
+There's a helper in the fsck code that these passes use, `walk_inode()`, that
+fetches the inode for the current inode number when it changes. This will
+probably change to load every copy of that inode, in each snapshot ID. Also,
+where we have to maintain state we'll have to duplicate that state for every
+snapshot ID that exists at that inode number.
+We'll also need to know, for every key that we process, which snapshot IDs it is
+visible in: so that e.g. the extents pass can correctly count `i_sectors`.
+Note that "millions of snapshots" will not _typically_ mean "millions of snaphot
+IDs at the same time" - when we allocate new inodes we pick inode numbers that
+aren't in use in any other snapshot or subvolume, and most files in a filesytem
+are created, written to once and then later atomically replaced with a rename
+call. When we're dealing with a file that keys with many different snapshot IDs
+we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck
+requires in the hundreds of megabytes of state to check these files with worst
+case numbers of overwrites, that's not the end of the world.
+## Checking directory structure:
+The current approach of doing a DFS from the root directory still works with
+snapshots, but algorithmically is too slow. If we have many sparse snapshots,
+the DFS would recurse into each of those snapshots, walking directory structure
+it has already seen in the original subvolume - but those snapshots could have
+directory structure changes in one or two places, so it would have to. We need
+an algorithm that iterators over keys, not filesystem structure.
+Fortunately, since backpointers from inodes to dirents were recently added we
+can take a different approach. We walk the inodes btree, and for every directory
+inode we find we chase backpointers until we get to the root directory. Note
+that since previous passes have already checked backpointers, we're really just
+check for loops here. This will still check the same structure repeatedly, like
+the DFS approach, but here we can solve it by adding every inum:snapshot tuple
+to a hash table after we've verified it has a path to the root.
+## Checking inode link counts:
+This one will be tedious, but algorithmically doesn't change much. The existing
+code walks the dirents btree in natural key order, and then for each dirent
+increments a link count in a radix tree. After walking the dirents, we walk the
+inodes btree and compare link counts we calculated to what's in the inode.
+As before, we'll walk the dirents btree in natural key order. However, we can't
+accumulate link counts without also looking at what actually exists in the inodes
+btree: a dirent may point to multiple inodes with different snapshot IDs, and
+which of those we increment also depends on which snapshots that dirent is
+visible in (we'll already have a helper for that part, as discussed previously).
+This means that instead of using a radix tree, we'll have to start by walking
+the inodes btree, and for every inode we find creating a new in memory data
+structure (eytzinger layout search tree, perhaps) indexed by inode
+number:snapshot id and containing just the link count we'll be calculating.
+Fortunately hardlinks are usually rare in actual practice. We can make use of
+the fact that we've already checked inode backpointers and only include inodes
+that actually have hardlinks in this table.
diff --git a/sidebar.mdwn b/sidebar.mdwn
index ae7eee0..06cf820 100644
@@ -16,5 +16,7 @@
+ * [[Allocator]]
+ * [[Fsck]]
* Other technical docs