Skip to content

Add a time_safety_margin to the job cache, to avoid missing transitions

Edmund Smith requested to merge eds/lava-monitor:eds/cache_race into main

We're seeing a linear growth in lava_running_job_seconds for many devices which are idle, where the pattern is interrupted by the device becoming active, and immediately restored when the device returns to the idle state. It's notable that devices begin this pattern at different times, and that a reset of the lava monitor is the only thing that causes the pattern to fully reset.

I don't have the necessary privileges to test this fix, but I'm fairly sure what happens is this:

  • A job is running normally, and present in the job cache.
  • A job transitions to finished very close to when an update query (1) is made. Call the moment of this query A.
  • In update (1), the job is not reported as finished, and does not have an end time.
  • In the next update (2), the job is reported as finished, and its end time is B
  • If B is before A then the job stays in the cache as Running. It will never be updated.
  • On subsequent metric calls, the highest id job drives the lava_running_job_seconds metric; this will be any genuinely active job, and then it will be the last job to cause this behaviour to occur. This explains the graph shapes.

Why is B before than A?

After all, the job was not reported as finished at moment A, and yet we're claiming later, it's reported as having already finished by moment A.

There are two obvious possibilities for why B is not necessarily after A

  • It takes some time for the end of a job to be marked in the database. The first query could've occurred while the change was in flight.
  • There is clock skew between the monitor and the lava server.

Neither of these possibilities causes large variation, and they were initially considered ignorable, but we can see from the existing data that they are not.

Merge request reports