Commits · 46bf2e9cc745996ca56e56ed816e60d07811bd9a · hardware-enablement / Rockchip upstream enablement efforts / linux

Jan 21, 2024

bcachefs: Fix excess transaction restarts in __bchfs_fallocate() · 46bf2e9c


drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

46bf2e9c

Jan 01, 2024

bcachefs: return from fsync on writeback error to avoid early shutdown · 5a11b5fe

Brian Foster authored 1 year ago


When investigating transient failures of generic/441 on bcachefs, it
was determined that the cause of the failure was a combination of
unconditional emergency shutdown and racing between background
journal activity and the test switchover from a working device
mapper table to an error injecting table.

Part of the reason for this sequence of events is that bcachefs
aggressively flushes as much as possible during fsync(), regardless
of errors. While this is reasonable behavior, it is technically
unnecessary because once an error is returned from fsync(), the
caller cannot make any assumptions about the resilience of data.

Tweak the bch2_fsync() logic to return an error on failure of any of
the steps involved in the flush. Note that this change alone does
not prevent generic/441 failure, but in combination with a test
tweak to avoid racing during the dm-error table switchover it avoids
the unnecessary shutdowns and allows the test to pass reliably on
bcachefs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5a11b5fe

bcachefs: kill INODE_LOCK, use lock_two_nondirectories() · ecf8a74d

Kent Overstreet authored 1 year ago


In an ideal world, we'd have a common helper that could be used for
sorting a list of inodes into the correct lock order, and then the same
lock ordering could be used for any type of inode lock, not just
i_rwsem.

But the lock ordering rules for i_rwsem are a bit complicated, so -
abandon that dream for now and do it the more standard way.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ecf8a74d

Oct 22, 2023

bcachefs: Heap allocate btree_trans · 6bd68ec2

Kent Overstreet authored 1 year ago

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6bd68ec2

bcachefs: remove redundant initializations of variables start_offset and end_offset · c04cbc0d

Colin Ian King authored 1 year ago


The variables start_offset and end_offset are being initialized with
values that are never read, they being re-assigned later on. The
initializations are redundant and can be removed.

Cleans up clang-scan build warnings:
fs/bcachefs/fs-io.c:243:11: warning: Value stored to 'start_offset' during
its initialization is never read [deadcode.DeadStores]
fs/bcachefs/fs-io.c:244:11: warning: Value stored to 'end_offset' during
its initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c04cbc0d

bcachefs: New io_misc.c helpers · 5902cc28

Kent Overstreet authored 1 year ago


This pulls the non vfs specific parts of truncate and finsert/fcollapse
out of fs-io.c, and moves them to io_misc.c.

This is prep work for logging these operations, to make them atomic in
the event of a crash.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5902cc28

bcachefs: Break up io.c · 1809b8cb

Kent Overstreet authored 1 year ago


More reorganization, this splits up io.c into
 - io_read.c
 - io_misc.c - fallocate, fpunch, truncate
 - io_write.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1809b8cb

bcachefs: Add btree_trans* to inode_set_fn · 791236b8

Joshua Ashton authored 1 year ago


This will be used when we need to re-hash a directory tree when setting
flags.

It is not possible to have concurrent btree_trans on a thread.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

791236b8

bcachefs: Split up fs-io.[ch] · dbbfca9f

Kent Overstreet authored 1 year ago


fs-io.c is too big - time for some reorganization
 - fs-dio.c: direct io
 - fs-pagecache.c: pagecache data structures (bch_folio), utility code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dbbfca9f

bcachefs: Fix assorted checkpatch nits · 1e81f89b
Kent Overstreet authored 1 year ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
1e81f89b

bcachefs: Fix lock thrashing in __bchfs_fallocate() · 4198bf03

Kent Overstreet authored 1 year ago


We've observed significant lock thrashing on fstests generic/083 in
fallocate, due to dropping and retaking btree locks when checking the
pagecache for data.

This adds a nonblocking mode to bch2_clamp_data_hole(), where we only
use folio_trylock(), and can thus be used safely while btree locks are
held - thus we only have to drop btree locks as a fallback, on actual
lock contention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

4198bf03

bcachefs: Fix folio leak in folio_hole_offset() · 0a6d6945
Kent Overstreet authored 1 year ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
0a6d6945

bcachefs: Fallocate now checks page cache · a09818c7

Kent Overstreet authored 1 year ago


Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.

But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a09818c7

bcachefs: Delete redundant log messages · c8b4534d

Kent Overstreet authored 1 year ago

Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c8b4534d

bcachefs: Assorted sparse fixes · 73bd774d

Kent Overstreet authored 1 year ago


 - endianness fixes
 - mark some things static
 - fix a few __percpu annotations
 - fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

73bd774d

bcachefs: Check for ERR_PTR() from filemap_lock_folio() · b6898917
Kent Overstreet authored 1 year ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
b6898917

bcachefs: fs-io: Eliminate GFP_NOFS usage · 5718fda0

Kent Overstreet authored 1 year ago

GFP_NOFS doesn't ever make sense. If we're allocatingc memory it should
be GFP_NOWAIT if btree locks are held, GFP_KERNEL otherwise.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5718fda0

bcachefs: Avoid __GFP_NOFAIL · 70d41c9e

Kent Overstreet authored 1 year ago


We've been using __GFP_NOFAIL for allocating struct bch_folio, our
private per-folio state.

However, that struct is variable size - it holds state for each sector
in the folio, and folios can be quite large now, which means it's
possible for bch_folio to be larger than PAGE_SIZE now.

__GFP_NOFAIL allocations are undesirable in normal circumstances, but
particularly so at >= PAGE_SIZE, and warnings are emitted for that.

So, this patch adds proper error paths and eliminates most uses of
__GFP_NOFAIL. Also, do some more cleanup of gfp flags w.r.t. btree node
locks: we can use GFP_KERNEL, but only if we're not holding btree locks,
and if we are holding btree locks we should be using GFP_NOWAIT.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

70d41c9e

bcachefs: Fix quotas + snapshots · cb1b479d

Kent Overstreet authored 1 year ago

Now that we can reliably designate and find the master subvolume out of
a tree of snapshots, we can finally make quotas work with snapshots:

That is - quotas will now _ignore_ snapshot subvolumes, and only be in
effect for the master (non snapshot) subvolume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cb1b479d

bcachefs: folio pos to bch_folio_sector index helper · bf98ee10

Brian Foster authored 1 year ago


Create a small helper to translate from file offset to the
associated bch_folio_sector index in the underlying bch_folio. The
helper assumes the file offset is covered by the passed folio.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bf98ee10

bcachefs: use u64 for folio end pos to avoid overflows · 6b9857b2

Brian Foster authored 2 years ago


Some of the folio_end_*() helpers are prone to overflow of signed
64-bit types because the mapping is only limited by the max value of
loff_t and the associated helpers return the start offset of the
next folio. Therefore, a folio_end_pos() of the max allowable folio in a
mapping returns a value that overflows loff_t.

This makes it hard to rely on such values when doing folio
processing across a range of a file, as bcachefs attempts to do with
the recent folio changes. For example, generic/564 causes problems
in the buffered write path when testing writes at max boundary
conditions.

The current understanding is that the pagecache historically limited
the mapping to one less page to avoid this problem and this was
dropped with some of the folio conversions, but may be reinstated to
properly address the problem. In the meantime, update the internal
folio_end_*() helpers in bcachefs to return a u64, and all of the
associated code to use or cast to u64 to avoid overflow problems.
This allows generic/564 to pass and can be reverted back to using
loff_t if at any point the pagecache subsystem can guarantee these
boundary conditions will not overflow.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6b9857b2

bcachefs: clean up post-eof folios on -ENOSPC · 335f7d4f

Brian Foster authored 2 years ago


The buffered write path batches folio creations in the file mapping
based on the requested size of the write. Under low free space
conditions, it is possible to add a bunch of folios to the mapping
and then return a short write or -ENOSPC due to lack of space. If
this occurs on an extending write, the file size is updated based on
the amount of data successfully written to the file. If folios were
added beyond the final i_size, they may hang around until reclaimed,
truncated or encountered unexpectedly by another operation.

For example, generic/083 reproduces a sequence of events where a
short write leaves around one or more post-EOF folios on an inode, a
subsequent zero range request extends beyond i_size and overlaps
with an aforementioned folio, and __bch2_truncate_folio() happens
across it and complains.

Update __bch2_buffered_write() to keep track of the start offset of
the last folio added to the mapping for a prospective write. After
i_size is updated, check whether this offset starts beyond EOF. If
so, truncate pagecache beyond the latest EOF to clean up any folios
that don't reside at least partially within EOF upon completion of
the write.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

335f7d4f

bcachefs: fix truncate overflow if folio is beyond EOF · 4ad6aa46

Brian Foster authored 2 years ago


generic/083 occasionally reproduces a panic caused by an overflow
when accessing the bch_folio_sector array of the folio being
processed by __bch2_truncate_folio(). The immediate cause of the
overflow is that the folio offset is beyond i_size, and therefore
the sector index calculation underflows on subtraction of the folio
offset.

One cause of this is mainly observed on nocow mounts. When nocow is
enabled, fallocate performs physical block allocation (as opposed to
block reservation in cow mode), which range_has_data() then
interprets as valid data that requires partial zeroing on truncate.
Therefore, if a post-eof zero range request lands across post-eof
preallocated blocks, __bch2_truncate_folio() may actually create a
post-eof folio in order to perform zeroing. To avoid this problem,
update range_has_data() to filter out unwritten blocks from folio
creation and partial zeroing.

Even though we should never create folios beyond EOF like this, the
mere existence of such folios is not necessarily a fatal error. Fix
up the truncate code to warn about this condition and not overflow
the sector array and possibly crash the system. The addition of this
warning without the corresponding unwritten extent fix has shown
that various other fstests are able to reproduce this problem fairly
frequently, but often in ways that doesn't necessarily result in a
kernel panic or a change in user observable behavior, and therefore
the problem goes undetected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

4ad6aa46

bcachefs: Check for folios that don't have bch_folio attached · 34fdcf06

Kent Overstreet authored 2 years ago


With large folios, it's now incidentally possible to end up with a
clean, uptodate folio in the page cache that doesn't have a bch_folio
attached, if a folio has to be split.

This patch fixes __bch2_truncate_folio() to check for this; other code
paths appear to handle it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

34fdcf06

bcachefs: bch2_readahead() large folio conversion · 9567413c

Kent Overstreet authored 2 years ago


Readahead now uses the new filemap_get_contig_folios_d() helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9567413c

bcachefs: filemap_get_contig_folios_d() · 40022c01

Kent Overstreet authored 2 years ago


Add a new helper for getting a range of contiguous folios and returning
them in a darray.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

40022c01

bcachefs: bch_folio_sector_state improvements · a1774a05

Kent Overstreet authored 2 years ago


 - X-macro-ize the bch_folio_sector_state enum: this means we can easily
   generate strings, which is helpful for debugging.

 - Add helpers for state transitions: folio_sector_dirty(),
   folio_sector_undirty(), folio_sector_reserve()

 - Add folio_sector_set(), a single helper for changing folio sector
   state just so that we have a single place to instrument when we're
   debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a1774a05

bcachefs: bch2_truncate_page() large folio conversion · 959f7368
Kent Overstreet authored 2 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
959f7368
bcachefs: bch2_buffered_write large folio conversion · c42b57c4
Kent Overstreet authored 2 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
c42b57c4
bcachefs: bch_folio can now handle multi-order folios · 49fe78ff
Kent Overstreet authored 2 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
49fe78ff

bcachefs: More assorted large folio conversion · 33e2eb96

Kent Overstreet authored 2 years ago


Various misc small conversions in fs-io.c for large folios.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

33e2eb96

bcachefs: bch2_seek_pagecache_data() folio conversion · a86a92cb

Kent Overstreet authored 2 years ago


This converts bch2_seek_pagecache_data() to handle large folios.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a86a92cb

bcachefs: bch2_seek_pagecache_hole() folio conversion · e8d28c3e

Kent Overstreet authored 2 years ago


This converts bch2_seek_pagecache_hole() to handle large folios.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e8d28c3e

bcachefs: bio_for_each_segment_all() -> bio_for_each_folio_all() · ff9c301f
Kent Overstreet authored 2 years ago
```
This converts the writepage end_io path to folios.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
ff9c301f

bcachefs: Initial folio conversion · 30bff594

Kent Overstreet authored 2 years ago


This converts fs-io.c to pass folios, not pages. We're not handling
large folios yet, there's no functional changes in this patch - just a
lot of churn doing the initial type conversions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

30bff594

bcachefs: Rename bch_page_state -> bch_folio · 3342ac13

Kent Overstreet authored 2 years ago


Start of the large folio conversion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3342ac13

bcachefs: Add a bch_page_state assert · c437e153

Kent Overstreet authored 2 years ago


Seeing an odd bug with page/folio state not being properly initialized,
this is to help track it down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c437e153

bcachefs: Private error codes: ENOMEM · 65d48e35

Kent Overstreet authored 2 years ago

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

65d48e35

bcachefs: Nocow support · a8b3a677

Kent Overstreet authored 2 years ago


This adds support for nocow mode, where we do writes in-place when
possible. Patch components:

 - New boolean filesystem and inode option, nocow: note that when nocow
   is enabled, data checksumming and compression are implicitly disabled

 - To prevent in-place writes from racing with data moves
   (data_update.c) or bucket reuse (i.e. a bucket being reused and
   re-allocated while a nocow write is in flight, we have a new locking
   mechanism.

   Buckets can be locked for either data update or data move, using a
   fixed size hash table of two_state_shared locks. We don't have any
   chaining, meaning updates and moves to different buckets that hash to
   the same lock will wait unnecessarily - we'll want to watch for this
   becoming an issue.

 - The allocator path also needs to check for in-place writes in flight
   to a given bucket before giving it out: thus we add another counter
   to bucket_alloc_state so we can track this.

 - Fsync now may need to issue cache flushes to block devices instead of
   flushing the journal. We add a device bitmask to bch_inode_info,
   ei_devs_need_flush, which tracks devices that need to have flushes
   issued - note that this will lead to unnecessary flushes when other
   codepaths have already issued flushes, we may want to replace this with
   a sequence number.

 - New nocow write path: look up extents, and if they're writable write
   to them - otherwise fall back to the normal COW write path.

XXX: switch to sequence numbers instead of bitmask for devs needing
journal flush

XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to
run in process context - see if we can improve this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a8b3a677

bcachefs: Unwritten extents support · 79203111

Kent Overstreet authored 2 years ago


 - bch2_extent_merge checks unwritten bit
 - read path returns 0s for unwritten extents without actually reading
 - reflink path skips over unwritten extents
 - bch2_bkey_ptrs_invalid() checks for extents with both written and
   unwritten extents, and non-normal extents (stripes, btree ptrs) with
   unwritten ptrs
 - fiemap checks for unwritten extents and returns
   FIEMAP_EXTENT_UNWRITTEN

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

79203111