- Jan 21, 2024
-
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add a field to bch_snapshot for creation time; this will be important when we start exposing the snapshot tree to userspace. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Jan 06, 2024
-
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add new fields for split brain detection: - bch_member->seq, which tracks the sequence number of the last superblock write that happened to each member device - bch_sb->write_time, which tracks the time of the last superblock write, to allow detection of when two members have diverged but had the same number of superblock writes. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Jan 01, 2024
-
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
- add a tracepoint for write_buffer_flush_sync; this is expensive - fix the write_buffer_flush_slowpath tracepoint Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
This counter is redundant; it's simply the sum of BCH_DATA_stripe and BCH_DATA_parity buckets. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Prep work for introducing bch_replicas_entry_v2 Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add a new superblock section that contains a list of { minor version, recovery passes, errors_to_fix } that is - a list of recovery passes that must be run when downgrading past a given version, and a list of errors to silently fix. The upcoming disk accounting rewrite is not going to be fully compatible: we're going to have to regenerate accounting both when upgrading to the new version, and also from downgrading from the new version, since the new method of doing disk space accounting is a completely different architecture based on deltas, and synchronizing them for every jounal entry write to maintain compatibility is going to be too expensive and impractical. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add two new superblock fields. Since the main section of the superblock is now fully, we have to add a new variable length section for them - bch_sb_field_ext. - recovery_passes_requried: recovery passes that must be run on the next mount - errors_silent: errors that will be silently fixed These are to improve upgrading and dwongrading: these fields won't be cleared until after recovery successfully completes, so there won't be any issues with crashing partway through an upgrade or a downgrade. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Nov 28, 2023
-
-
Kent Overstreet authored
Renamed from trace_move_extent_alloc_mem_fail, because there are other reasons we colud fail (disk space allocation failure). Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Nov 26, 2023
-
-
Kent Overstreet authored
bkey embeds a bpos that is misaligned on big endian; this is so that bch2_bkey_swab() works correctly without having to differentiate between packed and non-packed keys (a debatable design decision). This means it can't have the __aligned() tag on big endian. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Nov 05, 2023
-
-
Kent Overstreet authored
This lets us use bch2_prt_bitflags to print them out. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
rebalance_work entries may refer to entries in the extents btree, which is a snapshots btree, or they may also refer to entries in the reflink btree, which is not. Hence rebalance_work keys may use the snapshot field but it's not required to be nonzero - add a new btree flag to reflect this. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Nov 04, 2023
-
-
Kent Overstreet authored
gcc 10 seems to complain about array bounds in situations where gcc 11 does not - curious. This unfortunately requires adding some casts for now; we may investigate getting rid of our __u64 _data[] VLA in a future patch so that our start[0] members can be VLAs. Reported-by:
John Stoffel <john@stoffel.org> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Nov 02, 2023
-
-
Kent Overstreet authored
Add a new superblock section to keep counts of errors seen since filesystem creation: we'll be addingcounters for every distinct fsck error. The new superblock section has entries of the for [ id, count, time_of_last_error ]; this is intended to let us see what errors are occuring - and getting fixed - via show-super output. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
We now track IO errors per device since filesystem creation. IO error counts can be viewed in sysfs, or with the 'bcachefs show-super' command. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
This adds a new btree, rebalance_work, to eliminate scanning required for finding extents that need work done on them in the background - i.e. for the background_target and background_compression options. rebalance_work is a bitset btree, where a KEY_TYPE_set corresponds to an extent in the extents or reflink btree at the same pos. A new extent field is added, bch_extent_rebalance, which indicates that this extent has work that needs to be done in the background - and which options to use. This allows per-inode options to be propagated to indirect extents - at least in some circumstances. In this patch, changing IO options on a file will not propagate the new options to indirect extents pointed to by that file. Updating (setting/clearing) the rebalance_work btree is done by the extent trigger, which looks at the bch_extent_rebalance field. Scanning is still requrired after changing IO path options - either just for a given inode, or for the whole filesystem. We indicate that scanning is required by adding a KEY_TYPE_cookie key to the rebalance_work btree: the cookie counter is so that we can detect that scanning is still required when an option has been flipped mid-way through an existing scan. Future possible work: - Propagate options to indirect extents when being changed - Add other IO path options - nr_replicas, ec, to rebalance_work so they can be applied in the background when they change - Add a counter, for bcachefs fs usage output, showing the pending amount of rebalance work: we'll probably want to do this after the disk space accounting rewrite (moving it to a new btree) Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
- Oct 22, 2023
-
-
Hunter Shaffer authored
Signed-off-by:
Hunter Shaffer <huntershaffer182456@gmail.com> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Hunter Shaffer authored
Signed-off-by:
Hunter Shaffer <huntershaffer182456@gmail.com> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Hunter Shaffer authored
members_v2 has dynamically resizable entries so that we can extend bch_member. The members can no longer be accessed with simple array indexing Instead members_v2_get is used to find a member's exact location within the array and returns a copy of that member. Alternatively member_v2_get_mut retrieves a mutable point to a member. Signed-off-by:
Hunter Shaffer <huntershaffer182456@gmail.com> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Now that we have the logged operations btree, we can make finsert/fcollapse atomic w.r.t. unclean shutdown as well. This adds bch_logged_op_finsert to represent the state of an finsert or fcollapse, which is a bit more complicated than truncate since we need to track our position in the "shift extents" operation. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Previously, we guaranteed atomicity of truncate after unclean shutdown with the BCH_INODE_I_SIZE_DIRTY flag - which required a full scan of the inodes btree. Recently the deleted inodes btree was added so that we no longer have to scan for deleted inodes, but truncate was unfinished and that change left it broken. This patch uses the new logged operations btree to fix truncate atomicity; we now log an operation that can be replayed at the start of a truncate. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add a new btree for long running logged operations - i.e. for logging operations that we can't do within a single btree transaction, so that they can be resumed if we crash. Keys in the logged operations btree will represent operations in progress, with the state of the operation stored in the value. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
It's no longer legal to use a zero size array as a flexible array member - this causes UBSAN to complain. This patch switches our zero size arrays to normal flexible array members when possible, and inserts casts in other places (e.g. where we use the zero size array as a marker partway through an array). Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
After deleteing snapshots, we may be left with a snapshot tree where some nodes only have one child, and we have a linear chain. Interior snapshot nodes are never used directly (i.e. they never have subvolumes that point to them), they are only referered to by child snapshot nodes - hence, they are redundant. The existing code talks about redundant snapshot nodes as forming and equivalence class; i.e. nodes for which snapshot_t->equiv is equal. In a given equivalence class, we only ever need a single key at a given position - i.e. multiple versions with different snapshot fields are redundant. The existing snapshot cleanup code deletes these redundant keys, but not redundant nodes. It turns out this is buggy, because we assume that after snapshot deletion finishes we should only have a single key per equivalence class, but the btree update path doesn't preserve this - overwriting keys in old snapshots doesn't check for the equivalence class being equal, and thus we can end up with duplicate keys in the same equivalence class and fsck complaining about snapshot deletion not having run correctly. The equivalence class notion has been leaking out of the core snapshots code and into too much other code, i.e. fsck, so this patch takes a different approach: snapshot deletion now moves keys to the node in an equivalence class being kept (the leafiest node) and then deletes the redundant nodes in the equivalance class. Some work has to be done to correctly delete interior snapshot nodes; snapshot node depth and skiplist fields for descendent nodes have to be fixed. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
subvolume.c has gotten a bit large, this splits out a separate file just for managing snapshot trees - BTREE_ID_snapshots. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Joshua Ashton authored
To ensure we aren't shooting ourselves in the foot after merge for potentially doing future revisions for dirent or for storing multiple names for casefolding, limit this to 512 for now. Previously this define was linked to the max size a d_name in bch_dirent could be. Signed-off-by:
Joshua Ashton <joshua@froggi.es> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Add a new bitset btree for inodes pending deletion; this means we no longer have to scan the full inodes btree after an unclean shutdown. Specifically, this adds: - a trigger to update the deleted_inodes btree based on changes to the inodes btree - a new recovery pass - and check_inodes is now only a fsck pass. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
clang had a few more warnings about enum conversion, and also didn't like the opts.c initializer. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
This refactoring centralizes defining per-btree properties. bch2_key_types_allowed was also about to overflow a u32, so expand that to a u64. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
The upper 4 bits are for compression level. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
This extents KEY_TYPE_snapshot to include some new fields: - depth, to indicate depth of this particular node from the root - skip[3], skiplist entries for quickly walking back up to the root These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n)) instead of O(n) in the snapshot tree depth. Skiplist nodes are picked at random from the set of ancestor nodes, not some fixed fraction. This introduces bcachefs_metadata_version 1.1, snapshot_skiplists. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
Now that we've got forward compatibility sorted out, we should be doing more frequent version upgrades in the future. To avoid having to run a full fsck for every version upgrade, this improves the BCH_METADATA_VERSIONS() table to explicitly specify a bitmask of recovery passes to run when upgrading to or past a given version. This means we can also delete PASS_UPGRADE(). Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-
Kent Overstreet authored
This introduces major/minor versioning to the superblock version number. Major version number changes indicate incompatible releases; we can move forward to a new major version number, but not backwards. Minor version numbers indicate compatible changes - these add features, but can still be mounted and used by old versions. With the recent patches that make it possible to roll out new btrees and key types without breaking compatibility, we should be able to roll out most new features without incompatible changes. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-