Retrieving jobs is much slower than retrieving devices from Lava, which has the effect of causing device status updates to be delayed, and of limiting the quantity of queries that can be made in any reasonable amount of time.
The main changes here:
Create a generic cache type that permits simultaneous read and update. Because of the potentially length update process, holding a RW lock isn't ideal (and could delay queries anyway), whilst the volume of memory we're using makes it entirely practical to just work on a separate copy. It also helps to guarantee atomicity of updates, so that readers don't observe partial or broken states.
Create a smart job buffer that stores a configurable window of recent jobs, plus all active jobs. Jobs that complete and leave the window are guaranteed to persist for some configurable amount of time, so that data is not lost when scraping is occurring with any reasonable frequency.
Serve the existing metrics directly from the cache, using iterators over an immutable view to provide the necessary filtering.
Launch an async process from main that refreshes the cache with some period. There is a short delay in the loop to prevent the queries becoming continuous, and the initial update is synchronous so that metrics only start being reported from the cache after the cache is fully initialized.