Skip to content
  • Suravee Suthikulpanit's avatar
    sched/topology: Introduce NUMA identity node sched domain · 051f3ca0
    Suravee Suthikulpanit authored
    
    
    On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
    upto 8 cores (16 threads) with the following topology.
    
                 ----------------------------
             C0  | T0 T1 |    ||    | T0 T1 | C4
                 --------|    ||    |--------
             C1  | T0 T1 | L3 || L3 | T0 T1 | C5
                 --------|    ||    |--------
             C2  | T0 T1 | #0 || #1 | T0 T1 | C6
                 --------|    ||    |--------
             C3  | T0 T1 |    ||    | T0 T1 | C7
                 ----------------------------
    
    Here, there are 2 last-level (L3) caches per logical NUMA node.
    A socket can contain upto 4 NUMA nodes, and a system can support
    upto 2 sockets. With full system configuration, current scheduler
    creates 4 sched domains:
    
      domain0 SMT       (span a core)
      domain1 MC        (span a last-level-cache)
      domain2 NUMA      (span a socket: 4 nodes)
      domain3 NUMA      (span a system: 8 nodes)
    
    Note that there is no domain to represent cpus spaning a logical
    NUMA node.  With this hierarchy of sched domains, the scheduler does
    not balance properly in the following cases:
    
    Case1:
    
     When running 8 tasks, a properly balanced system should
     schedule a task per logical NUMA node. This is not the case for
     the current scheduler.
    
    Case2:
    
     In some cases, threads are scheduled on the same cpu, while other
     cpus are idle. This results in run-to-run inconsistency. For example:
    
      taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                              --cpu-max-prime=100000 run
    
    Total execution time ranges from 25.1s to 33.5s depending on threads
    placement, where 25.1s is when all 8 threads are balanced properly
    on 8 cpus.
    
    Introducing NUMA identity node sched domain, which is based on how
    SRAT/SLIT table define a logical NUMA node. This results in the following
    hierarchy of sched domains on the same system described above.
    
      domain0 SMT       (span a core)
      domain1 MC        (span a last-level-cache)
      domain2 NODE      (span a logical NUMA node)
      domain3 NUMA      (span a socket: 4 nodes)
      domain4 NUMA      (span a system: 8 nodes)
    
    This fixes the improper load balancing cases mentioned above.
    
    Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: bp@suse.de
    Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com
    
    
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    051f3ca0