Skip to content
  • Tang Chen's avatar
    mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
    Tang Chen authored
    
    
    When parsing SRAT, all memory ranges are added into numa_meminfo.  In
    numa_init(), before entering numa_cleanup_meminfo(), all possible memory
    ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
    ranges over max_pfn or empty.
    
    But, this only works if the nodes are continuous.  Let's have a look at
    the following example:
    
    We have an SRAT like this:
    SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
    SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
    SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
    SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
    SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
    SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
    SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
    SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
    SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
    
    On boot, only node 0,1,2,3 exist.
    
    And the numa_meminfo will look like this:
    numa_meminfo.nr_blks = 9
    1. on node 0: [0, 60000000]
    2. on node 0: [100000000, 20000000000]
    3. on node 1: [20000000000, 40000000000]
    4. on node 4: [40000000000, 60000000000]
    5. on node 5: [60000000000, 80000000000]
    6. on node 2: [80000000000, a0000000000]
    7. on node 3: [a0000000000, a0800000000]
    8. on node 6: [c0000000000, a0800000000]
    9. on node 7: [e0000000000, a0800000000]
    
    And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
    end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
    removed because their end addresses are less then max_pfn.  But in fact,
    node 4 and 5 don't exist.
    
    In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
    
    Since memory ranges in node 4 and 5 are in numa_meminfo, in
    numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
    
    If you run lscpu, it will show:
    NUMA node0 CPU(s):     0-14,128-142
    NUMA node1 CPU(s):     15-29,143-157
    NUMA node2 CPU(s):
    NUMA node3 CPU(s):
    NUMA node4 CPU(s):     62-76,190-204
    NUMA node5 CPU(s):     78-92,206-220
    
    In this patch, we use memblock_overlaps_region() to check if ranges in
    numa_meminfo overlap with ranges in memory_block.  Since memory_block
    contains all available memory at boot time, if they overlap, it means the
    ranges exist.  If not, then remove them from numa_meminfo.
    
    After this patch, lscpu will show:
    NUMA node0 CPU(s):     0-14,128-142
    NUMA node1 CPU(s):     15-29,143-157
    NUMA node4 CPU(s):     62-76,190-204
    NUMA node5 CPU(s):     78-92,206-220
    
    Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
    Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Luiz Capitulino <lcapitulino@redhat.com>
    Cc: Xishi Qiu <qiuxishi@huawei.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Vladimir Murzin <vladimir.murzin@arm.com>
    Cc: Fabian Frederick <fabf@skynet.be>
    Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
    Cc: Baoquan He <bhe@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    95cf82ec