Skip to content

Monitor device idleness approximately

Edmund Smith requested to merge eds/lava-monitor:eds/idle-dimension into main

Every time lava-monitor runs, we have the opportunity to find out from the lava API whether there is a job currently scheduled on each device. Over time, these point samples, though widely spaced, should give a reasonably accurate view of device utilization.

There are two obvious ways of implementing, which have different knock-on effects for querying the data in PromQL. This version adds an idle label to the lava_device metric. This label has the values active or idle depending upon whether any job was marked as running on that device or not, at the instant the lava-monitor ran. Doing it this way lets us filter the device metric by idle state (for example, to select only healthy machines for the purpose of calculating utilization).

The alternative means of implementing this is to add a separate metric, for example lava_device_idle. This makes the calculation of utilization trivial for each machine, but prevents any filtering interaction between device health and device activity over time.

The respective PromQL queries are:

average_over_time(lava_device_idle)[$__interval] and on (device) lava_worker{worker="lava-rack-cbg-1"}

and

((
  (count_over_time(
     lava_device{idle = "active"}[$__interval])) 
  or ignoring(idle)
  (count_over_time(
     lava_device{idle="idle"}[$__interval])*0)
 )
 / on(device) 
 sum by (device) (
    count_over_time(lava_device{idle=~"active|idle"}[$__interval]))
) 
and on (device) lava_worker{worker="lava-rack-cbg-1"}

The complexity in the latter case (which is the version matching this PR) is mostly caused by the need to make sure the full set of keys appear on the left hand side of the division.

Note that we can perform filtering at the end across instant vectors (to select a particular rack), but we couldn't perform filtering during the accumulators by another metric. The only type of filtering that can be performed over time (without getting into the cost and complexity of subqueries) is by labels. Operators like and, or and unless are specifically defined on instant vectors, and yield expressions that are not vector selectors - and only vector selectors can be used to form ranges.

Merge request reports