1. 23 May, 2008 8 commits
    • Steven Rostedt's avatar
      ftrace: use dynamic patching for updating mcount calls · d61f82d0
      Steven Rostedt authored
      
      
      This patch replaces the indirect call to the mcount function
      pointer with a direct call that will be patched by the
      dynamic ftrace routines.
      
      On boot up, the mcount function calls the ftace_stub function.
      When the dynamic ftrace code is initialized, the ftrace_stub
      is replaced with a call to the ftrace_record_ip, which records
      the instruction pointers of the locations that call it.
      
      Later, the ftraced daemon will call kstop_machine and patch all
      the locations to nops.
      
      When a ftrace is enabled, the original calls to mcount will now
      be set top call ftrace_caller, which will do a direct call
      to the registered ftrace function. This direct call is also patched
      when the function that should be called is updated.
      
      All patching is performed by a kstop_machine routine to prevent any
      type of race conditions that is associated with modifying code
      on the fly.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d61f82d0
    • Steven Rostedt's avatar
      ftrace: move memory management out of arch code · 3c1720f0
      Steven Rostedt authored
      
      
      This patch moves the memory management of the ftrace
      records out of the arch code and into the generic code
      making the arch code simpler.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      3c1720f0
    • Steven Rostedt's avatar
      ftrace: use nops instead of jmp · dfa60aba
      Steven Rostedt authored
      
      
      This patch patches the call to mcount with nops instead
      of a jmp over the mcount call.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      dfa60aba
    • Steven Rostedt's avatar
      ftrace: dynamic enabling/disabling of function calls · 3d083395
      Steven Rostedt authored
      
      
      This patch adds a feature to dynamically replace the ftrace code
      with the jmps to allow a kernel with ftrace configured to run
      as fast as it can without it configured.
      
      The way this works, is on bootup (if ftrace is enabled), a ftrace
      function is registered to record the instruction pointer of all
      places that call the function.
      
      Later, if there's still any code to patch, a kthread is awoken
      (rate limited to at most once a second) that performs a stop_machine,
      and replaces all the code that was called with a jmp over the call
      to ftrace. It only replaces what was found the previous time. Typically
      the system reaches equilibrium quickly after bootup and there's no code
      patching needed at all.
      
      e.g.
      
        call ftrace  /* 5 bytes */
      
      is replaced with
      
        jmp 3f  /* jmp is 2 bytes and we jump 3 forward */
      3:
      
      When we want to enable ftrace for function tracing, the IP recording
      is removed, and stop_machine is called again to replace all the locations
      of that were recorded back to the call of ftrace.  When it is disabled,
      we replace the code back to the jmp.
      
      Allocation is done by the kthread. If the ftrace recording function is
      called, and we don't have any record slots available, then we simply
      skip that call. Once a second a new page (if needed) is allocated for
      recording new ftrace function calls.  A large batch is allocated at
      boot up to get most of the calls there.
      
      Because we do this via stop_machine, we don't have to worry about another
      CPU executing a ftrace call as we modify it. But we do need to worry
      about NMI's so all functions that might be called via nmi must be
      annotated with notrace_nmi. When this code is configured in, the NMI code
      will not call notrace.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      3d083395
    • Steven Rostedt's avatar
      ftrace: trace preempt off critical timings · 6cd8a4bb
      Steven Rostedt authored
      
      
      Add preempt off timings. A lot of kernel core code is taken from the RT patch
      latency trace that was written by Ingo Molnar.
      
      This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers
      
      Now instead of just tracing irqs off, preemption off can be selected
      to be recorded.
      
      When this is selected, it shares the same files as irqs off timings.
      One can either trace preemption off, irqs off, or one or the other off.
      
      By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
      of preempt off only is performed. "irqsoff" will only record the time
      irqs are disabled, but "preemptirqsoff" will take the total time irqs
      or preemption are disabled. Runtime switching of these options is now
      supported by simpling echoing in the appropriate trace name into
      /debugfs/tracing/current_tracer.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      6cd8a4bb
    • Steven Rostedt's avatar
      ftrace: trace irq disabled critical timings · 81d68a96
      Steven Rostedt authored
      
      
      This patch adds latency tracing for critical timings
      (how long interrupts are disabled for).
      
       "irqsoff" is added to /debugfs/tracing/available_tracers
      
      Note:
        tracing_max_latency
          also holds the max latency for irqsoff (in usecs).
         (default to large number so one must start latency tracing)
      
        tracing_thresh
          threshold (in usecs) to always print out if irqs off
          is detected to be longer than stated here.
          If irq_thresh is non-zero, then max_irq_latency
          is ignored.
      
      Here's an example of a trace with ftrace_enabled = 0
      
      =======
      preemption latency trace v1.1.5 on 2.6.24-rc7
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      --------------------------------------------------------------------
       latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
          -----------------
          | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
          -----------------
       => started at: _spin_lock_irqsave+0x2a/0xb7
       => ended at:   _spin_unlock_irqrestore+0x32/0x5f
      
                       _------=> CPU#
                      / _-----=> irqs-off
                     | / _----=> need-resched
                     || / _---=> hardirq/softirq
                     ||| / _--=> preempt-depth
                     |||| /
                     |||||     delay
         cmd     pid ||||| time  |   caller
            \   /    |||||   \   |   /
       swapper-0     1d.s3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
       swapper-0     1d.s3  100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
       swapper-0     1d.s3  100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)
      
      vim:ft=help
      =======
      
      And this is a trace with ftrace_enabled == 1
      
      =======
      preemption latency trace v1.1.5 on 2.6.24-rc7
      --------------------------------------------------------------------
       latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
          -----------------
          | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
          -----------------
       => started at: _spin_lock_irqsave+0x2a/0xb7
       => ended at:   _spin_unlock_irqrestore+0x32/0x5f
      
                       _------=> CPU#
                      / _-----=> irqs-off
                     | / _----=> need-resched
                     || / _---=> hardirq/softirq
                     ||| / _--=> preempt-depth
                     |||| /
                     |||||     delay
         cmd     pid ||||| time  |   caller
            \   /    |||||   \   |   /
       swapper-0     1dNs3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
       swapper-0     1dNs3   46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000])
       swapper-0     1dNs3   46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000])
       swapper-0     1dNs3   46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000])
       swapper-0     1dNs3   47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000])
       swapper-0     1dNs3   47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
       swapper-0     1dNs3   97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
       swapper-0     1dNs3   98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000])
       swapper-0     1dNs3   99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000])
       swapper-0     1dNs3  101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000])
       swapper-0     1dNs3  102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
       swapper-0     1dNs3  102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)
      
      vim:ft=help
      =======
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      81d68a96
    • Arnaldo Carvalho de Melo's avatar
      ftrace: add basic support for gcc profiler instrumentation · 16444a8a
      Arnaldo Carvalho de Melo authored
      
      
      If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
      set to a non-zero value the ftrace routine will be called everytime
      we enter a kernel function that is not marked with the "notrace"
      attribute.
      
      The ftrace routine will then call a registered function if a function
      happens to be registered.
      
      [ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
        so don't blame Arnaldo for all of this ;-) ]
      
      Update:
        It is now possible to register more than one ftrace function.
        If only one ftrace function is registered, that will be the
        function that ftrace calls directly. If more than one function
        is registered, then ftrace will call a function that will loop
        through the functions to call.
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      16444a8a
    • Steven Rostedt's avatar
      x86: add notrace annotations to vsyscall. · 23adec55
      Steven Rostedt authored
      
      
      Add the notrace annotations to the vsyscall functions - there we are
      not in kernel context yet, so the tracer function cannot (and must not)
      be called.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      23adec55
  2. 19 May, 2008 1 commit
  3. 18 May, 2008 3 commits
  4. 17 May, 2008 3 commits
    • Thomas Gleixner's avatar
      x86: disable mwait for AMD family 10H/11H CPUs · e9623b35
      Thomas Gleixner authored
      The previous revert of 0c07ee38 left
      out the mwait disable condition for AMD family 10H/11H CPUs.
      
      Andreas Herrman said:
      
      It depends on the CPU. For AMD CPUs that support MWAIT this is wrong.
      Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings then
      depend on a clock divisor and current Pstate of the core.
      
      If all cores of a processor are in halt state (C1) the processor can
      enter the C1E (C1 enhanced) state. If mwait is used this will never
      happen.
      
      Thus HLT saves more power than MWAIT here.
      
      It might be best to switch off the mwait flag for these AMD CPU
      families like it was introduced with commit
      f039b754
      
       (x86: Don't use MWAIT on AMD
      Family 10)
      
      Re-add the AMD families 10H/11H check and disable the mwait usage for
      those.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e9623b35
    • Avi Kivity's avatar
      x86: fix crash on cpu hotplug on pat-incapable machines · 31f4d870
      Avi Kivity authored
      
      
      pat_disable() is __init, which means it goes away after booting is complete.
      Unfortunately it is used by the hotplug code if the machine is not
      pat-capable, causing a crash.
      
      Fix by marking pat_disable() as __cpuinit.
      Signed-off-by: default avatarAvi Kivity <avi@qumranet.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      31f4d870
    • Ingo Molnar's avatar
      x86: remove mwait capability C-state check · a738d897
      Ingo Molnar authored
      Vegard Nossum reports:
      
      | powertop shows between 200-400 wakeups/second with the description
      | "<kernel IPI>: Rescheduling interrupts" when all processors have load (e.g.
      | I need to run two busy-loops on my 2-CPU system for this to show up).
      |
      | The bisect resulted in this commit:
      |
      | commit 0c07ee38
      
      
      | Date:   Wed Jan 30 13:33:16 2008 +0100
      |
      |     x86: use the correct cpuid method to detect MWAIT support for C states
      
      remove the functional effects of this patch and make mwait unconditional.
      
      A future patch will turn off mwait on specific CPUs where that causes
      power to be wasted.
      Bisected-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      a738d897
  5. 16 May, 2008 1 commit
  6. 13 May, 2008 7 commits
    • Roland McGrath's avatar
      x86: user_regset_view table fix for ia32 on 64-bit · 1f465f4e
      Roland McGrath authored
      
      
      The user_regset_view table for the 32-bit regsets on the 64-bit build had
      the wrong sizes for the FP regsets.  This bug had no user-visible effect
      (just on kernel modules using the user_regset interfaces and the like).
      But the fix is trivial and risk-free.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      1f465f4e
    • Pranith Kumar's avatar
      x86: arch/x86/mm/pat.c - fix warning · afc85343
      Pranith Kumar authored
      
      
      fix this warning:
      
       arch/x86/mm/pat.c: In function `phys_mem_access_prot_allowed':
       arch/x86/mm/pat.c:558: warning: long long unsigned int format, long
       unsigned int arg (arg 6)
       arch/x86/mm/pat.c: In function `map_devmem':
       arch/x86/mm/pat.c:580: warning: long long unsigned int format, long
       unsigned int arg (arg 6)
      Signed-off-by: default avatarD Pranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      afc85343
    • Ingo Molnar's avatar
      x86: fix csum_partial() export · 89804c02
      Ingo Molnar authored
      
      
      Fix this symbol export problem:
      
          Building modules, stage 2.
          MODPOST 193 modules
          ERROR: "csum_partial" [fs/reiserfs/reiserfs.ko] undefined!
          make[1]: *** [__modpost] Error 1
          make: *** [modules] Error 2
      
      This is due to a known weakness of symbol exports: if a symbol's
      only in-core user is an EXPORT_SYMBOL from a lib-y section, the
      symbol is not linked in.
      
      The solution is to move the export to x8664_ksyms_64.c - but the real
      solution would be to fix kbuild.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      89804c02
    • Andrew Morton's avatar
      x86: early_init_centaur(): use set_cpu_cap() · 8c45a4e4
      Andrew Morton authored
      
      
      arch/x86/kernel/setup_64.c:954: warning: passing argument 2 of 'set_bit' from incompatible pointer type
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8c45a4e4
    • Hugh Dickins's avatar
      x86: fix app crashes after SMP resume · 61165d7a
      Hugh Dickins authored
      After resume on a 2cpu laptop, kernel builds collapse with a sed hang,
      sh or make segfault (often on 20295564), real-time signal to cc1 etc.
      
      Several hurdles to jump, but a manually-assisted bisect led to -rc1's
      d2bcbad5
      
       x86: do not zap_low_mappings
      in __smp_prepare_cpus.  Though the low mappings were removed at bootup,
      they were left behind (with Global flags helping to keep them in TLB)
      after resume or cpu online, causing the crashes seen.
      
      Reinstate zap_low_mappings (with local __flush_tlb_all) for each cpu_up
      on x86_32.  This used to be serialized by smp_commenced_mask: that's now
      gone, but a low_mappings flag will do.  No need for native_smp_cpus_done
      to repeat the zap: let mem_init zap BSP's low mappings just like on UP.
      
      (In passing, fix error code from native_cpu_up: do_boot_cpu returns a
      variety of diagnostic values, Dprintk what it says but convert to -EIO.
      And save_pg_dir separately before zap_low_mappings: doesn't matter now,
      but zapping twice in succession wiped out resume's swsusp_pg_dir.)
      
      That worked well on the duo and one quad, but wouldn't boot 3rd or 4th
      cpu on P4 Xeon, oopsing just after unlock_ipi_call_lock.  The TLB flush
      IPI now being sent reveals a long-standing bug: the booting cpu has its
      APIC readied in smp_callin at the top of start_secondary, but isn't put
      into the cpu_online_map until just before that unlock_ipi_call_lock.
      
      So native_smp_call_function_mask to online cpus would send_IPI_allbutself,
      including the cpu just coming up, though it has been excluded from the
      count to wait for: by the time it handles the IPI, the call data on
      native_smp_call_function_mask's stack may well have been overwritten.
      
      So fall back to send_IPI_mask while cpu_online_map does not match
      cpu_callout_map: perhaps there's a better APICological fix to be
      made at the start_secondary end, but I wouldn't know that.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      61165d7a
    • Venki Pallipadi's avatar
      x86/PCI: X86_PAT & mprotect · 77db9885
      Venki Pallipadi authored
      
      
      Some versions of X used the mprotect workaround to change caching type from UC
      to WB, so that it can then use mtrr to program WC for that region [1].  Change
      the mmap of pci space through /sys or /proc interfaces from UC to UC_MINUS.
      With this change, X will not need to use mprotect workaround to get WC type
      since the MTRR mapping type will be honored.
      
      The bug in mprotect that clobbers PAT bits is fixed in a follow on patch. So,
      this X workaround will stop working as well.
      Signed-off-by: default avatarVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarJesse Barnes <jbarnes@virtuousgeek.org>
      77db9885
    • Takashi Iwai's avatar
      x86/PCI: fix broken ISA DMA · 4a367f3a
      Takashi Iwai authored
      Rene Herman reported:
      
      > commit 8779f2fc
      
      
      >
      > "x86: don't try to allocate from DMA zone at first"
      >
      > breaks all of ISA DMA. Or all of ALSA ISA DMA at least. All
      > ISA soundcards are silent following that commit -- no error
      > messages, everything appears fine, just silence.
      
      That patch is buggy. We had an implicit assumption that
      dev = NULL for ISA devices that require 24bit DMA.
      
      The recent work on x86 dma_alloc_coherent() breaks the ISA DMA buffer
      allocation, which is represented by "dev = NULL" and requires 24bit
      DMA implicitly.
      Bisected-by: default avatarRene Herman <rene.herman@keyaccess.nl>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarJesse Barnes <jbarnes@virtuousgeek.org>
      4a367f3a
  7. 12 May, 2008 3 commits
  8. 10 May, 2008 5 commits
  9. 09 May, 2008 1 commit
  10. 08 May, 2008 3 commits
  11. 07 May, 2008 1 commit
  12. 06 May, 2008 1 commit
    • Hugh Dickins's avatar
      x86: fix PAE pmd_bad bootup warning · aeed5fce
      Hugh Dickins authored
      Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.
      
      That came from 9fc34113 x86: debug pmd_bad();
      but we understand now that the typecasting was wrong for PAE in the previous
      version: pagetable pages above 4GB looked bad and stopped Arjan from booting.
      
      And revert that cded932b
      
       x86: fix pmd_bad
      and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
      weaken every pmd_bad and pud_bad check to let huge pages slip through - in
      part they check that we _don't_ have a huge page where it's not expected.
      
      Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
      been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
      junk in the upper word is good; and x86_64 should follow x86_32's stricter
      comparison, to stop thinking any subset of required bits is good); but that
      should be a later patch.
      
      Fix Hans' good observation that follow_page() will never find pmd_huge()
      because that would have already failed the pmd_bad test: test pmd_huge in
      between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
      No, once it's a hugepage entry, it can get quite far from a good pmd: for
      example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
      
      However... though follow_page() contains this and another test for huge
      pages, so it's nice to keep it working on them, where does it actually get
      called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
      to call alternative hugetlb processing, as does unmap_vmas() and others.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Earlier-version-tested-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Chua <jeff.chua.linux@gmail.com>
      Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeed5fce
  13. 05 May, 2008 3 commits