1. 11 May, 2007 1 commit
    • Christoph Lameter's avatar
      VM statistics: Make timer deferrable · 39bf6270
      Christoph Lameter authored
      
      
      VM statistics updates do not matter if the kernel is in idle powersaving
      mode.  So allow the timer to be deferred.
      
      It would be better though if we could switch the timer between deferrable
      and nondeferrable based on differentials present.  The timer would start
      out nondeferrable and if we find that there were no updates in the last
      statistics interval then we would switch the timer to deferrable.  If the
      timer later finds again that there are differentials then go to
      nondeferrable again.
      
      And yet another way would be to run the timer shortly before going to idle?
      
      The solution here means that the VM counters may be slightly off during
      idle since differentials may be still pending while the timer is deferred.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39bf6270
  2. 09 May, 2007 4 commits
    • Christoph Lameter's avatar
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter authored
      
      
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
    • Christoph Lameter's avatar
      Make vm statistics update interval configurable · 77461ab3
      Christoph Lameter authored
      
      
      Make it configurable.  Code in mm makes the vm statistics intervals
      independent from the cache reaper use that opportunity to make it
      configurable.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77461ab3
    • Christoph Lameter's avatar
      vmstat: use our own timer events · d1187ed2
      Christoph Lameter authored
      
      
      vmstat is currently using the cache reaper to periodically bring the
      statistics up to date.  The cache reaper does only exists in SLUB as a way to
      provide compatibility with SLAB.  This patch removes the vmstat calls from the
      slab allocators and provides its own handling.
      
      The advantage is also that we can use a different frequency for the updates.
      Refreshing vm stats is a pretty fast job so we can run this every second and
      stagger this by only one tick.  This will lead to some overlap in large
      systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
      updates occurring at once.
      
      However, the vm stats update only accesses per node information.  It is only
      necessary to stagger the vm statistics updates per processor in each node.  Vm
      counter updates occurring on distant nodes will not cause cacheline
      contention.
      
      We could implement an alternate approach that runs the first processor on each
      node at the second and then each of the other processor on a node on a
      subsequent tick.  That may be useful to keep a large amount of the second free
      of timer activity.  Maybe the timer folks will have some feedback on this one?
      
      [jirislaby@gmail.com: add missing break]
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1187ed2
    • Rafael J. Wysocki's avatar
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki authored
      
      
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
  3. 11 Feb, 2007 7 commits
  4. 07 Dec, 2006 2 commits
  5. 28 Oct, 2006 1 commit
    • Martin Bligh's avatar
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh authored
      
      
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3bb1a852
  6. 04 Oct, 2006 1 commit
  7. 27 Sep, 2006 2 commits
  8. 26 Sep, 2006 3 commits
  9. 01 Sep, 2006 2 commits
    • Christoph Lameter's avatar
      [PATCH] ZVC: Scale thresholds depending on the size of the system · df9ecaba
      Christoph Lameter authored
      
      
      The ZVC counter update threshold is currently set to a fixed value of 32.
      This patch sets up the threshold depending on the number of processors and
      the sizes of the zones in the system.
      
      With the current threshold of 32, I was able to observe slight contention
      when more than 130-140 processors concurrently updated the counters.  The
      contention vanished when I either increased the threshold to 64 or used
      Andrew's idea of overstepping the interval (see ZVC overstep patch).
      
      However, we saw contention again at 220-230 processors.  So we need higher
      values for larger systems.
      
      But the current default is already a bit of an overkill for smaller
      systems.  Some systems have tiny zones where precision matters.  For
      example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
      ZONE_DMA32.  These are even present on SMP and NUMA systems.
      
      The patch here sets up a threshold based on the number of processors in the
      system and the size of the zone that these counters are used for.  The
      threshold should grow logarithmically, so we use fls() as an easy
      approximation.
      
      Results of tests on a system with 1024 processors (4TB RAM)
      
      The following output is from a test allocating 1GB of memory concurrently
      on each processor (Forking the process.  So contention on mmap_sem and the
      pte locks is not a factor):
      
                             X                   MIN
      TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU
      fork                   1      0.552      0.552      0.540    0.012      0.552
      fork                   4      0.552      0.548      2.164    0.036      2.200
      fork                  16      0.564      0.548      8.812    0.164      8.976
      fork                 128      0.580      0.572     72.204    1.208     73.412
      fork                 256      1.300      0.660    310.400    2.160    312.560
      fork                 512      3.512      0.696   1526.836    4.816   1531.652
      fork                1020     20.024      0.700  17243.176    6.688  17249.863
      
      So a threshold of 32 is fine up to 128 processors. At 256 processors contention
      becomes a factor.
      
      Overstepping the counter (earlier patch) improves the numbers a bit:
      
      fork                   4      0.552      0.548      2.164    0.040      2.204
      fork                  16      0.552      0.548      8.640    0.148      8.788
      fork                 128      0.556      0.548     69.676    0.956     70.632
      fork                 256      0.876      0.636    212.468    2.108    214.576
      fork                 512      2.276      0.672    997.324    4.260   1001.584
      fork                1020     13.564      0.680  11586.436    6.088  11592.523
      
      Still contention at 512 and 1020. Contention at 1020 is down by a third.
      256 still has a slight bit of contention.
      
      After this patch the counter threshold will be set to 125 which reduces
      contention significantly:
      
      fork                 128      0.560      0.548     69.776    0.932     70.708
      fork                 256      0.636      0.556    143.460    2.036    145.496
      fork                 512      0.640      0.548    284.244    4.236    288.480
      fork                1020      1.500      0.588   1326.152    8.892   1335.044
      
      [akpm@osdl.org: !SMP build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      df9ecaba
    • Christoph Lameter's avatar
      [PATCH] ZVC: Overstep counters · a302eb4e
      Christoph Lameter authored
      
      
      Increments and decrements are usually grouped rather than mixed.  We can
      optimize the inc and dec functions for that case.
      
      Increment and decrement the counters by 50% more than the threshold in
      those cases and set the differential accordingly.  This decreases the need
      to update the atomic counters.
      
      The idea came originally from Andrew Morton.  The overstepping alone was
      sufficient to address the contention issue found when updating the global
      and the per zone counters from 160 processors.
      
      Also remove some code in dec_zone_page_state.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a302eb4e
  10. 10 Jul, 2006 1 commit
  11. 30 Jun, 2006 14 commits
    • Christoph Lameter's avatar
      [PATCH] Light weight event counters · f8891e5e
      Christoph Lameter authored
      The remaining counters in page_state after the zoned VM counter patches
      have been applied are all just for show in /proc/vmstat.  They have no
      essential function for the VM.
      
      We use a simple increment of per cpu variables.  In order to avoid the most
      severe races we disable preempt.  Preempt does not prevent the race between
      an increment and an interrupt handler incrementing the same statistics
      counter.  However, that race is exceedingly rare, we may only loose one
      increment or so and there is no requirement (at least not in kernel) that
      the vm event counters have to be accurate.
      
      In the non preempt case this results in a simple increment for each
      counter.  For many architectures this will be reduced by the compiler to a
      single instruction.  This single instruction is atomic for i386 and x86_64.
       And therefore even the rare race condition in an interrupt is avoided for
      both architectures in most cases.
      
      The patchset also adds an off switch for embedded systems that allows a
      building of linux kernels without these counters.
      
      The implementation of these counters is through inline code that hopefully
      results in only a single instruction increment instruction being emitted
      (i386, x86_64) or in the increment being hidden though instruction
      concurrency (EPIC architectures such as ia64 can get that done).
      
      Benefits:
      - VM event counter operations usually reduce to a single inline instruction
        on i386 and x86_64.
      - No interrupt disable, only preempt disable for the preempt case.
        Preempt disable can also be avoided by moving the counter into a spinlock.
      - Handling is similar to zoned VM counters.
      - Simple and easily extendable.
      - Can be omitted to reduce memory use for embedded use.
      
      References:
      
      RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
      RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
      local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
      V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
      V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
      V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f8891e5e
    • Christoph Lameter's avatar
      [PATCH] Use Zoned VM Counters for NUMA statistics · ca889e6c
      Christoph Lameter authored
      The numa statistics are really event counters.  But they are per node and
      so we have had special treatment for these counters through additional
      fields on the pcp structure.  We can now use the per zone nature of the
      zoned VM counters to realize these.
      
      This will shrink the size of the pcp structure on NUMA systems.  We will
      have some room to add additional per zone counters that will all still fit
      in the same cacheline.
      
       Bits	Prior pcp size	  	Size after patch	We can add
       ------------------------------------------------------------------
       64	128 bytes (16 words)	80 bytes (10 words)	48
       32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
      							72 (128 byte)
      
      Remove the special statistics for numa and replace them with zoned vm
      counters.  This has the side effect that global sums of these events now
      show up in /proc/vmstat.
      
      Also take the opportunity to move the zone_statistics() function from
      page_alloc.c into vmstat.c.
      
      Discussions:
      V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ca889e6c
    • Andrew Morton's avatar
      [PATCH] zoned-vm-counters: remove read_page_state() · bab1846a
      Andrew Morton authored
      
      
      No callers.
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bab1846a
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_bounce to per zone counter · d2c5e30c
      Christoph Lameter authored
      
      
      Conversion of nr_bounce to a per zone counter
      
      nr_bounce is only used for proc output.  So it could be left as an event
      counter.  However, the event counters may not be accurate and nr_bounce is
      categorizing types of pages in a zone.  So we really need this to also be a
      per zone counter.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d2c5e30c
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter · fd39fc85
      Christoph Lameter authored
      
      
      Conversion of nr_unstable to a per zone counter
      
      We need to do some special modifications to the nfs code since there are
      multiple cases of disposition and we need to have a page ref for proper
      accounting.
      
      This converts the last critical page state of the VM and therefore we need to
      remove several functions that were depending on GET_PAGE_STATE_LAST in order
      to make the kernel compile again.  We are only left with event type counters
      in page state.
      
      [akpm@osdl.org: bugfixes]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fd39fc85
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter · ce866b34
      Christoph Lameter authored
      
      
      Conversion of nr_writeback to per zone counter.
      
      This removes the last page_state counter from arch/i386/mm/pgtable.c so we
      drop the page_state from there.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ce866b34
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter · b1e7a8fd
      Christoph Lameter authored
      
      
      This makes nr_dirty a per zone counter.  Looping over all processors is
      avoided during writeback state determination.
      
      The counter aggregation for nr_dirty had to be undone in the NFS layer since
      we summed up the page counts from multiple zones.  Someone more familiar with
      NFS should probably review what I have done.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b1e7a8fd
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter · df849a15
      Christoph Lameter authored
      
      
      Conversion of nr_page_table_pages to a per zone counter
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      df849a15
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_slab to per zone counter · 9a865ffa
      Christoph Lameter authored
      
      
      - Allows reclaim to access counter without looping over processor counts.
      
      - Allows accurate statistics on how many pages are used in a zone by
        the slab. This may become useful to balance slab allocations over
        various zones.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9a865ffa
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED · f3dbd344
      Christoph Lameter authored
      
      
      The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
      calculation as the number of mapped pagecache pages.  However, that is not
      true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
      separates those and therefore allows an accurate tracking of the anonymous
      pages per zone.
      
      It then becomes possible to determine the number of unmapped pages per zone
      and we can avoid scanning for unmapped pages if there are none.
      
      Also it may now be possible to determine the mapped/unmapped ratio in
      get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
      calculation?
      
      Note that this will change the meaning of the number of mapped pages reported
      in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
      user space tools that monitor these counters!  NR_FILE_MAPPED works like
      NR_FILE_DIRTY.  It is only valid for pagecache pages.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f3dbd344
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter · 347ce434
      Christoph Lameter authored
      
      
      Currently a single atomic variable is used to establish the size of the page
      cache in the whole machine.  The zoned VM counters have the same method of
      implementation as the nr_pagecache code but also allow the determination of
      the pagecache size per zone.
      
      Remove the special implementation for nr_pagecache and make it a zoned counter
      named NR_FILE_PAGES.
      
      Updates of the page cache counters are always performed with interrupts off.
      We can therefore use the __ variant here.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      347ce434
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5
      Christoph Lameter authored
      
      
      nr_mapped is important because it allows a determination of how many pages of
      a zone are not mapped, which would allow a more efficient means of determining
      when we need to reclaim memory in a zone.
      
      We take the nr_mapped field out of the page state structure and define a new
      per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
      from NR_MAPPED in the next patch).
      
      We replace the use of nr_mapped in various kernel locations.  This avoids the
      looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
      zone reclaim).
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      65ba55f5
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a
      Christoph Lameter authored
      
      
      Per zone counter infrastructure
      
      The counters that we currently have for the VM are split per processor.  The
      processor however has not much to do with the zone these pages belong to.  We
      cannot tell f.e.  how many ZONE_DMA pages are dirty.
      
      So we are blind to potentially inbalances in the usage of memory in various
      zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
      particular node.  If we knew then we could put measures into the VM to balance
      the use of memory between different zones and different nodes in a NUMA
      system.  For example it would be possible to limit the dirty pages per node so
      that fast local memory is kept available even if a process is dirtying huge
      amounts of pages.
      
      Another example is zone reclaim.  We do not know how many unmapped pages exist
      per zone.  So we just have to try to reclaim.  If it is not working then we
      pause and try again later.  It would be better if we knew when it makes sense
      to reclaim unmapped pages from a zone.  This patchset allows the determination
      of the number of unmapped pages per zone.  We can remove the zone reclaim
      interval with the counters introduced here.
      
      Futhermore the ability to have various usage statistics available will allow
      the development of new NUMA balancing algorithms that may be able to improve
      the decision making in the scheduler of when to move a process to another node
      and hopefully will also enable automatic page migration through a user space
      program that can analyse the memory load distribution and then rebalance
      memory use in order to increase performance.
      
      The counter framework here implements differential counters for each processor
      in struct zone.  The differential counters are consolidated when a threshold
      is exceeded (like done in the current implementation for nr_pageache), when
      slab reaping occurs or when a consolidation function is called.
      
      Consolidation uses atomic operations and accumulates counters per zone in the
      zone structure and also globally in the vm_stat array.  VM functions can
      access the counts by simply indexing a global or zone specific array.
      
      The arrangement of counters in an array also simplifies processing when output
      has to be generated for /proc/*.
      
      Counters can be updated by calling inc/dec_zone_page_state or
      _inc/dec_zone_page_state analogous to *_page_state.  The second group of
      functions can be called if it is known that interrupts are disabled.
      
      Special optimized increment and decrement functions are provided.  These can
      avoid certain checks and use increment or decrement instructions that an
      architecture may provide.
      
      We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
      do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
      only currently set for IA64 SGI SN2 and currently only affects
      node_page_state().  In the best case node_page_state can be reduced to
      retrieving a single counter for the one zone on the node.
      
      [akpm@osdl.org: cleanups]
      [akpm@osdl.org: export vm_stat[] for filesystems]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2244b95a
    • Christoph Lameter's avatar
      [PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h · f6ac2354
      Christoph Lameter authored
      NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
      event counters do not need to be.
      
      Zone based VM statistics are necessary to be able to determine what the state
      of memory in one zone is.  In a NUMA system this can be helpful for local
      reclaim and other memory optimizations that may be able to shift VM load in
      order to get more balanced memory use.
      
      It is also useful to know how the computing load affects the memory
      allocations on various zones.  This patchset allows the retrieval of that data
      from userspace.
      
      The patchset introduces a framework for counters that is a cross between the
      existing page_stats --which are simply global counters split per cpu-- and the
      approach of deferred incremental updates implemented for nr_pagecache.
      
      Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
      certain thresholds then the counters are accumulated in an array of
      atomic_long in the zone and in a global array that sums up all zone values.
      The small 8 bit counters are next to the per cpu page pointers and so they
      will be in high in the cpu cache when pages are allocated and freed.
      
      Access to VM counter information for a zone and for the whole machine is then
      possible by simply indexing an array (Thanks to Nick Piggin for pointing out
      that approach).  The access to the total number of pages of various types does
      no longer require the summing up of all per cpu counters.
      
      Benefits of this patchset right now:
      
      - Ability for UP and SMP configuration to determine how memory
        is balanced between the DMA, NORMAL and HIGHMEM zones.
      
      - loops over all processors are avoided in writeback and
        reclaim paths. We can avoid caching the writeback information
        because the needed information is directly accessible.
      
      - Special handling for nr_pagecache removed.
      
      - zone_reclaim_interval vanishes since VM stats can now determine
        when it is worth to do local reclaim.
      
      - Fast inline per node page state determination.
      
      - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
        counters are counting simply which processor allocated a page somewhere
        and guestimate based on that. So the counters were not useful to show
        the actual distribution of page use on a specific zone.
      
      - The swap_prefetch patch requires per node statistics in order to
        figure out when processors of a node can prefetch. This patch provides
        some of the needed numbers.
      
      - Detailed VM counters available in more /proc and /sys status files.
      
      References to earlier discussions:
      V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
      V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
      V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
      V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2
      
      
      
      Performance tests with AIM7 did not show any regressions.  Seems to be a tad
      faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
      fixes for s390/arm/uml arch code.
      
      This patch:
      
      Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.
      
      Create vmstat.c/vmstat.h by separating the counter code and the proc
      functions.
      
      Move the vm_stat_text array before zoneinfo_show.
      
      [akpm@osdl.org: s390 build fix]
      [akpm@osdl.org: HOTPLUG_CPU build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f6ac2354