1. 09 May, 2017 1 commit
    • Andy Lutomirski's avatar
      x86/boot/32: Fix UP boot on Quark and possibly other platforms · d2b6dc61
      Andy Lutomirski authored
      This partially reverts commit:
       ("x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up")
      That commit had one definite bug and one potential bug.  The
      definite bug is that setup_per_cpu_areas() uses a differnet generic
      implementation on UP kernels, so initial_page_table never got
      resynced.  This was fine for access to percpu data (it's in the
      identity map on UP), but it breaks other users of
      initial_page_table.  The potential bug is that helpers like
      efi_init() would be called before the tables were synced.
      Avoid both problems by just syncing the page tables in setup_arch()
      *and* setup_per_cpu_areas().
      Reported-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  2. 23 Mar, 2017 1 commit
    • Andy Lutomirski's avatar
      x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up · 23b2a4dd
      Andy Lutomirski authored
      The x86 smpboot trampoline expects initial_page_table to have the
      GDT mapped.  If the GDT ends up in a virtually mapped per-cpu page,
      then it won't be in the page tables at all until perc-pu areas are
      set up.  The result will be a triple fault the first time that the
      CPU attempts to access the GDT after LGDT loads the perc-pu GDT.
      This appears to be an old bug, but somehow the GDT fixmap rework
      is triggering it.  This seems to have something to do with the
      memory layout.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/a553264a5972c6a86f9b5caac237470a0c74a720.1490218061.git.luto@kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  3. 16 Mar, 2017 1 commit
    • Thomas Garnier's avatar
      x86: Remap GDT tables in the fixmap section · 69218e47
      Thomas Garnier authored
      Each processor holds a GDT in its per-cpu structure. The sgdt
      instruction gives the base address of the current GDT. This address can
      be used to bypass KASLR memory randomization. With another bug, an
      attacker could target other per-cpu structures or deduce the base of
      the main memory section (PAGE_OFFSET).
      This patch relocates the GDT table for each processor inside the
      fixmap section. The space is reserved based on number of supported
      For consistency, the remapping is done by default on 32 and 64-bit.
      Each processor switches to its remapped GDT at the end of
      initialization. For hibernation, the main processor returns with the
      original GDT and switches back to the remapping at completion.
      This patch was tested on both architectures. Hibernation and KVM were
      both tested specially for their usage of the GDT.
      Thanks to Boris Ostrovsky <boris.ostrovsky@oracle.com> for testing and
      recommending changes for Xen support.
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Luis R . Rodriguez <mcgrof@kernel.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kernel-hardening@lists.openwall.com
      Cc: kvm@vger.kernel.org
      Cc: lguest@lists.ozlabs.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-pm@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Cc: zijun_hu <zijun_hu@htc.com>
      Link: http://lkml.kernel.org/r/20170314170508.100882-2-thgarnie@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  4. 15 Nov, 2016 1 commit
    • Paul Gortmaker's avatar
      x86/percpu: Remove unnecessary include of module.h, add asm/desc.h · 523d0fb4
      Paul Gortmaker authored
      This was originally a part of commit 186f4360
          ("x86/kernel: Audit and remove any unnecessary uses of module.h")
      ...but without the asm/desc.h addition.  As such, Ingo reported a
      build failure on i386 allnoconfig with SMP=y during his pre-merge
      testing.   For expediency the chunk was just dropped at that time.
      The failure was as follows:
        arch/x86/kernel/setup_percpu.c: In function ‘setup_percpu_segment’:
        arch/x86/kernel/setup_percpu.c:159:2: error: implicit declaration of function ‘pack_descriptor’ [-Werror=implicit-function-declaration]
        arch/x86/kernel/setup_percpu.c:162:2: error: implicit declaration of function ‘write_gdt_entry’ [-Werror=implicit-function-declaration]
        arch/x86/kernel/setup_percpu.c:162:18: error: implicit declaration of function ‘get_cpu_gdt_table’ [-Werror=implicit-function-declaration]
      As pack_descriptor(), write_gdt_entry() and get_cpu_gdt_table() all
      live in the file arch/x86/include/asm/desc.h -- calling that header
      out explicitly should fix things.
      Reported-by: default avatarIngo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161114190443.10873-1-paul.gortmaker@windriver.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  5. 18 Aug, 2016 1 commit
  6. 10 Aug, 2016 1 commit
    • Kees Cook's avatar
      x86: Apply more __ro_after_init and const · 404f6aac
      Kees Cook authored
      Guided by grsecurity's analogous __read_only markings in arch/x86,
      this applies several uses of __ro_after_init to structures that are
      only updated during __init, and const for some structures that are
      never updated.  Additionally extends __init markings to some functions
      that are only used during __init, and cleans up some missing C99 style
      static initializers.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Brown <david.brown@linaro.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-hardening@lists.openwall.com
      Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.net
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  7. 25 Jul, 2016 1 commit
  8. 04 Nov, 2014 1 commit
  9. 14 Jun, 2012 1 commit
    • Vlad Zolotarov's avatar
      x86: Add read_mostly declaration/definition to variables from smp.h · 0816b0f0
      Vlad Zolotarov authored
      Add "read-mostly" qualifier to the following variables in
       - cpu_sibling_map
       - cpu_core_map
       - cpu_llc_shared_map
       - cpu_llc_id
       - cpu_number
       - x86_cpu_to_apicid
       - x86_bios_cpu_apicid
       - x86_cpu_to_logical_apicid
      As long as all the variables above are only written during the
      initialization, this change is meant to prevent the false
      sharing. More specifically, on vSMP Foundation platform
      x86_cpu_to_apicid shared the same internode_cache_line with
      frequently written lapic_events.
      From the analysis of the first 33 per_cpu variables out of 219
      (memories they describe, to be more specific) the 8 have read_mostly
      nature (tlb_vector_offset, cpu_loops_per_jiffy, xen_debug_irq, etc.)
      and 25 are frequently written (irq_stack_union, gdt_page,
      exception_stacks, idt_desc, etc.).
      Assuming that the spread of the rest of the per_cpu variables is
      similar, identifying the read mostly memories will make more sense
      in terms of long-term code maintenance comparing to identifying
      frequently written memories.
      Signed-off-by: default avatarVlad Zolotarov <vlad@scalemp.com>
      Acked-by: default avatarShai Fultheim <shai@scalemp.com>
      Cc: Shai Fultheim (Shai@ScaleMP.com) <Shai@scalemp.com>
      Cc: ido@wizery.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1719258.EYKzE4Zbq5@vlad
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  10. 08 May, 2012 1 commit
    • Tejun Heo's avatar
      percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit · d5e28005
      Tejun Heo authored
      With the embed percpu first chunk allocator, x86 uses either PAGE_SIZE
      or PMD_SIZE for atom_size.  PMD_SIZE is used when CPU supports PSE so
      that percpu areas are aligned to PMD mappings and possibly allow using
      PMD mappings in vmalloc areas in the future.  Using larger atom_size
      doesn't waste actual memory; however, it does require larger vmalloc
      space allocation later on for !first chunks.
      With reasonably sized vmalloc area, PMD_SIZE shouldn't be a problem
      but x86_32 at this point is anything but reasonable in terms of
      address space and using larger atom_size reportedly leads to frequent
      percpu allocation failures on certain setups.
      As there is no reason to not use PMD_SIZE on x86_64 as vmalloc space
      is aplenty and most x86_64 configurations support PSE, fix the issue
      by always using PMD_SIZE on x86_64 and PAGE_SIZE on x86_32.
      v2: drop cpu_has_pse test and make x86_64 always use PMD_SIZE and
          x86_32 PAGE_SIZE as suggested by hpa.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarYanmin Zhang <yanmin.zhang@intel.com>
      Reported-by: default avatarShuoX Liu <shuox.liu@intel.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      LKML-Reference: <4F97BA98.6010001@intel.com>
      Cc: stable@vger.kernel.org
  11. 28 Jan, 2011 2 commits
    • Tejun Heo's avatar
      x86: Unify CPU -> NUMA node mapping between 32 and 64bit · 645a7919
      Tejun Heo authored
      Unlike 64bit, 32bit has been using its own cpu_to_node_map[] for
      CPU -> NUMA node mapping.  Replace it with early_percpu variable
      x86_cpu_to_node_map and share the mapping code with 64bit.
      * USE_PERCPU_NUMA_NODE_ID is now enabled for 32bit too.
      * x86_cpu_to_node_map and numa_set/clear_node() are moved from
        numa_64 to numa.  For now, on 32bit, x86_cpu_to_node_map is initialized
        with 0 instead of NUMA_NO_NODE.  This is to avoid introducing unexpected
        behavior change and will be updated once init path is unified.
      * srat_detect_node() is now enabled for x86_32 too.  It calls
        numa_set_node() and initializes the mapping making explicit
        cpu_to_node_map[] updates from map/unmap_cpu_to_node() unnecessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-15-git-send-email-tj@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
    • Tejun Heo's avatar
      x86: Replace cpu_2_logical_apicid[] with early percpu variable · 4c321ff8
      Tejun Heo authored
      Unlike x86_64, on x86_32, the mapping from cpu to logical apicid
      may vary depending on apic in use.  cpu_2_logical_apicid[] array
      is used for this mapping.  Replace it with early percpu variable
      x86_cpu_to_logical_apicid to make it better aligned with other
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-5-git-send-email-tj@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  12. 27 Aug, 2010 1 commit
    • Yinghai Lu's avatar
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu authored
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Suggested-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
  13. 12 Aug, 2010 1 commit
  14. 21 Jul, 2010 1 commit
  15. 20 Jul, 2010 1 commit
    • Yinghai Lu's avatar
      x86, numa: fix boot without RAM on node0 again · 9aebbdb6
      Yinghai Lu authored
      Commit e534c7c5
       ("numa: x86_64: use generic percpu var
      numa_node_id() implementation") broke numa systems that don't have ram
      on node0 when MEMORY_HOTPLUG is enabled, because cpu_up() will call
      cpu_to_node() before per_cpu(numa_node) is setup for APs.
      When Node0 doesn't have RAM, on x86, cpus already round it to nearest
      node with RAM in x86_cpu_to_node_map.  and per_cpu(numa_node) is not set
      up until in c_init for APs.
      When later cpu_up() calling cpu_to_node() will get 0 again, and make it
      online even there is no RAM on node0.  so later all APs can not booted up,
      and later will have panic.
      [    1.611101] On node 0 totalpages: 0
      [    2.608558] On node 0 totalpages: 0
      [    2.612065] Brought up 1 CPUs
      [    2.615199] Total of 1 processors activated (3990.31 BogoMIPS).
         93.225341] calling  loop_init+0x0/0x1a4 @ 1
      [   93.229314] PERCPU: allocation failed, size=80 align=8, failed to populate
      [   93.246539] Pid: 1, comm: swapper Tainted: G        W   2.6.35-rc4-tip-yh-04371-gd64e6c4-dirty #354
      [   93.264621] Call Trace:
      [   93.266533]  [<ffffffff81125e43>] pcpu_alloc+0x83a/0x8e7
      [   93.270710]  [<ffffffff81125f15>] __alloc_percpu+0x10/0x12
      [   93.285849]  [<ffffffff8140786c>] alloc_disk_node+0x94/0x16d
      [   93.291811]  [<ffffffff81407956>] alloc_disk+0x11/0x13
      [   93.306157]  [<ffffffff81503e51>] loop_alloc+0xa7/0x180
      [   93.310538]  [<ffffffff8277ef48>] loop_init+0x9b/0x1a4
      [   93.324909]  [<ffffffff8277eead>] ? loop_init+0x0/0x1a4
      [   93.329650]  [<ffffffff810001f2>] do_one_initcall+0x57/0x136
      [   93.345197]  [<ffffffff827486d0>] kernel_init+0x184/0x20e
      [   93.348146]  [<ffffffff81034954>] kernel_thread_helper+0x4/0x10
      [   93.365194]  [<ffffffff81c7cc3c>] ? restore_args+0x0/0x30
      [   93.369305]  [<ffffffff8274854c>] ? kernel_init+0x0/0x20e
      [   93.386011]  [<ffffffff81034950>] ? kernel_thread_helper+0x0/0x10
      [   93.392047] loop: out of memory
      Try to assign per_cpu(numa_node) early
      [akpm@linux-foundation.org: tidy up code comment]
      Signed-off-by: default avatarYinghai <yinghai@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Acked-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 31 May, 2010 1 commit
  17. 27 May, 2010 1 commit
  18. 03 Mar, 2010 1 commit
  19. 26 Feb, 2010 1 commit
  20. 10 Dec, 2009 1 commit
  21. 14 Aug, 2009 9 commits
    • Tejun Heo's avatar
      x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA · 4518e6a0
      Tejun Heo authored
      Embedding percpu first chunk allocator can now handle very sparse unit
      mapping.  Use embedding allocator instead of lpage for 64bit NUMA.
      This removes extra TLB pressure and the need to do complex and fragile
      dancing when changing page attributes.
      For 32bit, using very sparse unit mapping isn't a good idea because
      the vmalloc space is very constrained.  32bit NUMA machines aren't
      exactly the focus of optimization and it isn't very clear whether
      lpage performs better than page.  Use page first chunk allocator for
      32bit NUMAs.
      As this leaves setup_pcpu_*() functions pretty much empty, fold them
      into setup_per_cpu_areas().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
    • Tejun Heo's avatar
      percpu: update embedding first chunk allocator to handle sparse units · c8826dd5
      Tejun Heo authored
      Now that percpu core can handle very sparse units, given that vmalloc
      space is large enough, embedding first chunk allocator can use any
      memory to build the first chunk.  This patch teaches
      pcpu_embed_first_chunk() about distances between cpus and to use
      alloc/free callbacks to allocate node specific areas for each group
      and use them for the first chunk.
      This brings the benefits of embedding allocator to NUMA configurations
      - no extra TLB pressure with the flexibility of unified dynamic
      allocator and no need to restructure arch code to build memory layout
      suitable for percpu.  With units put into atom_size aligned groups
      according to cpu distances, using large page for dynamic chunks is
      also easily possible with falling back to reuglar pages if large
      allocation fails.
      Embedding allocator users are converted to specify NULL
      cpu_distance_fn, so this patch doesn't cause any visible behavior
      difference.  Following patches will convert them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      percpu: add pcpu_unit_offsets[] · fb435d52
      Tejun Heo authored
      Currently units are mapped sequentially into address space.  This
      patch adds pcpu_unit_offsets[] which allows units to be mapped to
      arbitrary offsets from the chunk base address.  This is necessary to
      allow sparse embedding which might would need to allocate address
      ranges and memory areas which aren't aligned to unit size but
      allocation atom size (page or large page size).  This also simplifies
      things a bit by removing the need to calculate offset from unit
      With this change, there's no need for the arch code to know
      pcpu_unit_size.  Update pcpu_setup_first_chunk() and first chunk
      allocators to return regular 0 or -errno return code instead of unit
      size or -errno.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
    • Tejun Heo's avatar
      percpu: introduce pcpu_alloc_info and pcpu_group_info · fd1e8a1f
      Tejun Heo authored
      Till now, non-linear cpu->unit map was expressed using an integer
      array which maps each cpu to a unit and used only by lpage allocator.
      Although how many units have been placed in a single contiguos area
      (group) is known while building unit_map, the information is lost when
      the result is recorded into the unit_map array.  For lpage allocator,
      as all allocations are done by lpages and whether two adjacent lpages
      are in the same group or not is irrelevant, this didn't cause any
      problem.  Non-linear cpu->unit mapping will be used for sparse
      embedding and this grouping information is necessary for that.
      This patch introduces pcpu_alloc_info which contains all the
      information necessary for initializing percpu allocator.
      pcpu_alloc_info contains array of pcpu_group_info which describes how
      units are grouped and mapped to cpus.  pcpu_group_info also has
      base_offset field to specify its offset from the chunk's base address.
      pcpu_build_alloc_info() initializes this field as if all groups are
      allocated back-to-back as is currently done but this will be used to
      sparsely place groups.
      pcpu_alloc_info is a rather complex data structure which contains a
      flexible array which in turn points to nested cpu_map arrays.
      * pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
        help dealing with pcpu_alloc_info.
      * pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
        generalized and renamed to pcpu_build_alloc_info().
        @cpu_distance_fn may be NULL indicating that all cpus are of
      * pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
        generalized and renamed to pcpu_dump_alloc_info().  It now also
        prints which group each alloc unit belongs to.
      * pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
        separate parameters.  All first chunk allocators are updated to use
        pcpu_build_alloc_info() to build alloc_info and call
        pcpu_setup_first_chunk() with it.  This has the side effect of
        packing units for sparse possible cpus.  ie. if cpus 0, 2 and 4 are
        possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
      * x86 setup_pcpu_lpage() is updated to deal with alloc_info.
      * sparc64 setup_per_cpu_areas() is updated to build alloc_info.
      Although the changes made by this patch are pretty pervasive, it
      doesn't cause any behavior difference other than packing of sparse
      cpus.  It mostly changes how information is passed among
      initialization functions and makes room for more flexibility.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
    • Tejun Heo's avatar
      percpu: add @align to pcpu_fc_alloc_fn_t · 3cbc8565
      Tejun Heo authored
      pcpu_fc_alloc_fn_t is about to see more interesting usage, add @align
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      percpu: drop @static_size from first chunk allocators · 9a773769
      Tejun Heo authored
      First chunk allocators assume percpu areas have been linked using one
      of PERCPU_*() macros and depend on __per_cpu_load symbol defined by
      those macros, so there isn't much point in passing in static area size
      explicitly when it can be easily calculated from __per_cpu_start and
      __per_cpu_end.  Drop @static_size from all percpu first chunk
      allocators and helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      percpu: generalize first chunk allocator selection · f58dc01b
      Tejun Heo authored
      Now that all first chunk allocators are in mm/percpu.c, it makes sense
      to make generalize percpu_alloc kernel parameter.  Define PCPU_FC_*
      and set pcpu_chosen_fc using early_param() in mm/percpu.c.  Arch code
      can use the set value to determine which first chunk allocator to use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      percpu: rename 4k first chunk allocator to page · 00ae4064
      Tejun Heo authored
      Page size isn't always 4k depending on arch and configuration.  Rename
      4k first chunk allocator to page.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Howells <dhowells@redhat.com>
    • Tejun Heo's avatar
      percpu, sparc64: fix sparse possible cpu map handling · 74d46d6b
      Tejun Heo authored
      percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
      is incorrect if cpu_possible_map contains holes.  This causes percpu
      code to access beyond allocated memories and vmalloc areas.  On a
      sparc64 machine with cpus 0 and 2 (u60), this triggers the following
      warning or fails boot.
       WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
       Modules linked in:
       Call Trace:
        [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
        [00000000004b1840] map_vm_area+0x20/0x60
        [00000000004b1950] __vmalloc_area_node+0xd0/0x160
        [0000000000593434] deflate_init+0x14/0xe0
        [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
        [00000000005844f0] crypto_alloc_base+0x50/0xa0
        [000000000058b898] alg_test_comp+0x18/0x80
        [000000000058dad4] alg_test+0x54/0x180
        [000000000058af00] cryptomgr_test+0x40/0x60
        [0000000000473098] kthread+0x58/0x80
        [000000000042b590] kernel_thread+0x30/0x60
        [0000000000472fd0] kthreadd+0xf0/0x160
       ---[ end trace 429b268a213317ba ]---
      This patch fixes generic percpu functions and sparc64
      setup_per_cpu_areas() so that they handle sparse cpu_possible_map
      Please note that on x86, cpu_possible_map() doesn't contain holes and
      thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
      any behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
  22. 03 Jul, 2009 4 commits
    • Tejun Heo's avatar
      percpu: teach large page allocator about NUMA · a530b795
      Tejun Heo authored
      Large page first chunk allocator is primarily used for NUMA machines;
      however, its NUMA handling is extremely simplistic.  Regardless of
      their proximity, each cpu is put into separate large page just to
      return most of the allocated space back wasting large amount of
      vmalloc space and increasing cache footprint.
      This patch teachs NUMA details to large page allocator.  Given
      processor proximity information, pcpu_lpage_build_unit_map() will find
      fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the
      same large page and not too much virtual address space is wasted.
      This greatly reduces the unit and thus chunk size and wastes much less
      address space for the first chunk.  For example, on 4/4 NUMA machine,
      the original code occupied 16MB of virtual space for the first chunk
      while the new code only uses 4MB - one 2MB page for each node.
      [ Impact: much better space efficiency on NUMA machines ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
    • Tejun Heo's avatar
      x86,percpu: generalize lpage first chunk allocator · 8c4bfc6e
      Tejun Heo authored
      Generalize and move x86 setup_pcpu_lpage() into
      pcpu_lpage_first_chunk().  setup_pcpu_lpage() now is a simple wrapper
      around the generalized version.  Other than taking size parameters and
      using arch supplied callbacks to allocate/free/map memory,
      pcpu_lpage_first_chunk() is identical to the original implementation.
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      While at it, factor out pcpu_calc_fc_sizes() which is common to
      pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
      [ Impact: code reorganization and generalization ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86,percpu: generalize 4k first chunk allocator · d4b95f80
      Tejun Heo authored
      Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
      setup_pcpu_4k() now is a simple wrapper around the generalized
      version.  Other than taking size parameters and using arch supplied
      callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
      to the original implementation.
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
      [ Impact: code reorganization and generalization ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      percpu: drop @unit_size from embed first chunk allocator · 788e5abc
      Tejun Heo authored
      The only extra feature @unit_size provides is making dead space at the
      end of the first chunk which doesn't have any valid usecase.  Drop the
      parameter.  This will increase consistency with generalized 4k
      James Bottomley spotted missing conversion for the default
      setup_per_cpu_areas() which caused build breakage on all arcsh which
      use it.
      [ Impact: drop unused code path ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Ingo Molnar <mingo@elte.hu>
  23. 22 Jun, 2009 6 commits
    • Tejun Heo's avatar
      x86: ensure percpu lpage doesn't consume too much vmalloc space · 0017c869
      Tejun Heo authored
      On extreme configuration (e.g. 32bit 32-way NUMA machine), lpage
      percpu first chunk allocator can consume too much of vmalloc space.
      Make it fall back to 4k allocator if the consumption goes over 20%.
      [ Impact: add sanity check for lpage percpu first chunk allocator ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86: implement percpu_alloc kernel parameter · fa8a7094
      Tejun Heo authored
      According to Andi, it isn't clear whether lpage allocator is worth the
      trouble as there are many processors where PMD TLB is far scarcer than
      PTE TLB.  The advantage or disadvantage probably depends on the actual
      size of percpu area and specific processor.  As performance
      degradation due to TLB pressure tends to be highly workload specific
      and subtle, it is difficult to decide which way to go without more
      This patch implements percpu_alloc kernel parameter to allow selecting
      which first chunk allocator to use to ease debugging and testing.
      While at it, make sure all the failure paths report why something
      failed to help determining why certain allocator isn't working.  Also,
      kill the "Great future plan" comment which had already been realized
      quite some time ago.
      [ Impact: allow explicit percpu first chunk allocator selection ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86: fix pageattr handling for lpage percpu allocator and re-enable it · e59a1bb2
      Tejun Heo authored
      lpage allocator aliases a PMD page for each cpu and returns whatever
      is unused to the page allocator.  When the pageattr of the recycled
      pages are changed, this makes the two aliases point to the overlapping
      regions with different attributes which isn't allowed and known to
      cause subtle data corruption in certain cases.
      This can be handled in simliar manner to the x86_64 highmap alias.
      pageattr code should detect if the target pages have PMD alias and
      split the PMD alias and synchronize the attributes.
      pcpur allocator is updated to keep the allocated PMD pages map sorted
      in ascending address order and provide pcpu_lpage_remapped() function
      which binary searches the array to determine whether the given address
      is aliased and if so to which address.  pageattr is updated to use
      pcpu_lpage_remapped() to detect the PMD alias and split it up as
      necessary from cpa_process_alias().
      Jan Beulich spotted the original problem and incorrect usage of vaddr
      instead of laddr for lookup.
      With this, lpage percpu allocator should work correctly.  Re-enable
      [ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86: prepare setup_pcpu_lpage() for pageattr fix · 0ff2587f
      Tejun Heo authored
      Make the following changes in preparation of coming pageattr updates.
      * Define and use array of struct pcpul_ent instead of array of
        pointers.  The only difference is ->cpu field which is set but
        unused yet.
      * Rename variables according to the above change.
      * Rename local variable vm to pcpul_vm and move it out of the
      [ Impact: no functional difference ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86: rename remap percpu first chunk allocator to lpage · 97c9bf06
      Tejun Heo authored
      The "remap" allocator remaps large pages to build the first chunk;
      however, the name isn't very good because 4k allocator remaps too and
      the whole point of the remap allocator is using large page mapping.
      The allocator will be generalized and exported outside of x86, rename
      it to lpage before that happens.
      percpu_alloc kernel parameter is updated to accept both "remap" and
      "lpage" for lpage allocator.
      [ Impact: code cleanup, kernel parameter argument updated ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
    • Tejun Heo's avatar
      x86: fix duplicate free in setup_pcpu_remap() failure path · c5806df9
      Tejun Heo authored
      In the failure path, setup_pcpu_remap() tries to free the area which
      has already been freed to make holes in the large page.  Fix it.
      [ Impact: fix duplicate free in failure path ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>