intel_rdt_ui.txt 13.2 KB
Newer Older
1 2 3 4 5 6
User Interface for Resource Allocation in Intel Resource Director Technology

Copyright (C) 2016 Intel Corporation

Fenghua Yu <fenghua.yu@intel.com>
Tony Luck <tony.luck@intel.com>
7
Vikas Shivappa <vikas.shivappa@intel.com>
8 9 10 11 12 13 14 15 16 17 18 19 20

This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

 # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl

mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.


21 22 23 24 25
Info directory
--------------

The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
26 27
names reflect the resource names.
Cache resource(L3/L2)  subdirectory contains the following files:
28

29 30 31
"num_closids":  	The number of CLOSIDs which are valid for this
			resource. The kernel uses the smallest number of
			CLOSIDs of all enabled resources as limit.
32

33 34
"cbm_mask":     	The bitmask which is valid for this resource.
			This mask is equivalent to 100%.
35

36 37
"min_cbm_bits": 	The minimum number of consecutive bits which
			must be set when writing a mask.
38

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Memory bandwitdh(MB) subdirectory contains the following files:

"min_bandwidth":	The minimum memory bandwidth percentage which
			user can request.

"bandwidth_gran":	The granularity in which the memory bandwidth
			percentage is allocated. The allocated
			b/w percentage is rounded off to the next
			control step available on the hardware. The
			available bandwidth control steps are:
			min_bandwidth + N * bandwidth_gran.

"delay_linear": 	Indicates if the delay scale is linear or
			non-linear. This field is purely informational
			only.
54

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
Resource groups
---------------
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".

There are three files associated with each group:

"tasks": A list of tasks that belongs to this group. Tasks can be
	added to a group by writing the task ID to the "tasks" file
	(which will automatically remove them from the previous
	group to which they belonged). New tasks created by fork(2)
	and clone(2) are added to the same group as their parent.
	If a pid is not in any sub partition, it is in root partition
	(i.e. default partition).

"cpus": A bitmask of logical CPUs assigned to this group. Writing
	a new mask can add/remove CPUs from this group. Added CPUs
	are removed from their previous group. Removed ones are
	given to the default (root) group. You cannot remove CPUs
	from the default group.

78 79 80
"cpus_list": One or more CPU ranges of logical CPUs assigned to this
	     group. Same rules apply like for the "cpus" file.

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
"schemata": A list of all the resources available to this group.
	Each resource has its own line and format - see below for
	details.

When a task is running the following rules define which resources
are available to it:

1) If the task is a member of a non-default group, then the schemata
for that group is used.

2) Else if the task belongs to the default group, but is running on a
CPU that is assigned to some specific group, then the schemata for
the CPU's group is used.

3) Otherwise the schemata for the default group is used.


Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
the name of the resource, followed by specific values to be applied
in each of the instances of that resource on the system.

Cache IDs
---------
On current generation systems there is one L3 cache per socket and L2
caches are generally just shared by the hyperthreads on a core, but this
isn't an architectural requirement. We could have multiple separate L3
caches on a socket, multiple cores could share an L2 cache. So instead
of using "socket" or "core" to define the set of logical cpus sharing
a resource we use a "Cache ID". At a given cache level this will be a
unique number across the whole system (but it isn't guaranteed to be a
contiguous sequence, there may be gaps).  To find the ID for each logical
CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id

Cache Bit Masks (CBM)
---------------------
For cache resources we describe the portion of the cache that is available
for allocation using a bitmask. The maximum value of the mask is defined
by each cpu model (and may be different for different cache levels). It
is found using CPUID, but is also provided in the "info" directory of
the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
requires that these masks have all the '1' bits in a contiguous block. So
0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
of the capacity of the cache. You could partition the cache into four
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
Memory bandwidth(b/w) percentage
--------------------------------
For Memory b/w resource, user controls the resource by indicating the
percentage of total memory b/w.

The minimum bandwidth percentage value for each cpu model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the cpu model and can
be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are rounded
to the next control step available on the hardware.

The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166

L3 details (code and data prioritization disabled)
--------------------------------------------------
With CDP disabled the L3 schemata format is:

	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

L3 details (CDP enabled via mount option to resctrl)
----------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:

	L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
	L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

L2 details
----------
L2 cache does not support code and data prioritization, so the
schemata format is always:

	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

167 168 169 170 171 172 173
Memory b/w Allocation details
-----------------------------

Memory b/w domain is L3 cache.

	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...

174 175 176 177 178 179 180 181 182 183 184 185 186 187
Reading/writing the schemata file
---------------------------------
Reading the schemata file will show the state of all resources
on all domains. When writing you only need to specify those values
which you wish to change.  E.g.

# cat schemata
L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
# echo "L3DATA:2=3c0;" > schemata
# cat schemata
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff

188 189 190
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
191 192
for cache bit masks, minimum b/w of 10% with a memory bandwidth
granularity of 10%
193 194 195 196

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
197 198
# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
199 200 201 202 203 204 205 206

The default resource group is unmodified, so we have access to all parts
of all caches (its schemata file reads "L3:0=f;1=f").

Tasks that are under the control of group "p0" may only allocate from the
"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
Tasks in group "p1" use the "lower" 50% of cache on both sockets.

207 208 209 210 211 212 213 214
Similarly, tasks that are under the control of group "p0" may use a
maximum memory b/w of 50% on socket0 and 50% on socket 1.
Tasks in group "p1" may also use 50% memory b/w on both sockets.
Note that unlike cache masks, memory b/w cannot specify whether these
allocations can overlap or not. The allocations specifies the maximum
b/w that the group may be able to use and the system admin can configure
the b/w accordingly.

215 216 217 218 219 220 221 222 223 224 225 226 227
Example 2
---------
Again two sockets, but this time with a more realistic 20-bit mask.

Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
neighbors, each of the two real-time tasks exclusively occupies one quarter
of L3 cache on socket 0.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper"
228 229
50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
ordinary tasks:
230

231
# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253

Next we make a resource group for our first real time task and give
it access to the "top" 25% of the cache on socket 0.

# mkdir p0
# echo "L3:0=f8000;1=fffff" > p0/schemata

Finally we move our first real time task into this resource group. We
also use taskset(1) to ensure the task always runs on a dedicated CPU
on socket 0. Most uses of resource groups will also constrain which
processors tasks run on.

# echo 1234 > p0/tasks
# taskset -cp 1 1234

Ditto for the second real time task (with the remaining 25% of cache):

# mkdir p1
# echo "L3:0=7c00;1=fffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2 5678

254 255 256 257 258 259 260 261 262 263 264 265 266 267
For the same 2 socket system with memory b/w resource and CAT L3 the
schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
10):

For our first real time task this would request 20% memory b/w on socket
0.

# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata

For our second real time task this would request an other 20% memory b/w
on socket 0.

# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata

268 269 270 271 272 273 274 275 276 277 278 279 280
Example 3
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share text
and data, so a per task association is not required and due to interaction
with the kernel it's desired that the kernel on these cores shares L3 with
the tasks.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper"
281 282
50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
cannot be used by ordinary tasks:
283

284
# echo "L3:0=3ff\nMB:0=50" > schemata
285

286 287 288
Next we make a resource group for our real time cores and give it access
to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
socket 0.
289 290

# mkdir p0
291
# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
292 293

Finally we move core 4-7 over to the new group and make sure that the
294 295 296
kernel and the tasks running there get 50% of the cache. They should
also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
siblings and only the real time threads are scheduled on the cores 4-7.
297

298
# echo F0 > p0/cpus
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412

4) Locking between applications

Certain operations on the resctrl filesystem, composed of read/writes
to/from multiple files, must be atomic.

As an example, the allocation of an exclusive reservation of L3 cache
involves:

  1. Read the cbmmasks from each directory
  2. Find a contiguous set of bits in the global CBM bitmask that is clear
     in any of the directory cbmmasks
  3. Create a new directory
  4. Set the bits found in step 2 to the new directory "schemata" file

If two applications attempt to allocate space concurrently then they can
end up allocating the same bits so the reservations are shared instead of
exclusive.

To coordinate atomic operations on the resctrlfs and to avoid the problem
above, the following locking procedure is recommended:

Locking is based on flock, which is available in libc and also as a shell
script command

Write lock:

 A) Take flock(LOCK_EX) on /sys/fs/resctrl
 B) Read/write the directory structure.
 C) funlock

Read lock:

 A) Take flock(LOCK_SH) on /sys/fs/resctrl
 B) If success read the directory structure.
 C) funlock

Example with bash:

# Atomically read directory structure
$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl

# Read directory contents and create new subdirectory

$ cat create-dir.sh
find /sys/fs/resctrl/ > output.txt
mask = function-of(output.txt)
mkdir /sys/fs/resctrl/newres/
echo mask > /sys/fs/resctrl/newres/schemata

$ flock /sys/fs/resctrl/ ./create-dir.sh

Example with C:

/*
 * Example code do take advisory locks
 * before accessing resctrl filesystem
 */
#include <sys/file.h>
#include <stdlib.h>

void resctrl_take_shared_lock(int fd)
{
	int ret;

	/* take shared lock on resctrl filesystem */
	ret = flock(fd, LOCK_SH);
	if (ret) {
		perror("flock");
		exit(-1);
	}
}

void resctrl_take_exclusive_lock(int fd)
{
	int ret;

	/* release lock on resctrl filesystem */
	ret = flock(fd, LOCK_EX);
	if (ret) {
		perror("flock");
		exit(-1);
	}
}

void resctrl_release_lock(int fd)
{
	int ret;

	/* take shared lock on resctrl filesystem */
	ret = flock(fd, LOCK_UN);
	if (ret) {
		perror("flock");
		exit(-1);
	}
}

void main(void)
{
	int fd, ret;

	fd = open("/sys/fs/resctrl", O_DIRECTORY);
	if (fd == -1) {
		perror("open");
		exit(-1);
	}
	resctrl_take_shared_lock(fd);
	/* code to read directory contents */
	resctrl_release_lock(fd);

	resctrl_take_exclusive_lock(fd);
	/* code to read and write directory contents */
	resctrl_release_lock(fd);
}