8 years agoprintk: fix bounds checking for log_prefix
William Douglas [Tue, 1 Nov 2011 00:11:29 +0000]
printk: fix bounds checking for log_prefix

Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false).  It should be testing to see if the character less than '0' or
greater than '9' instead.  This patch makes that change.

The code being changed worked because strtoul bombs out (endp isn't
updated) and 0 is returned anyway.

Signed-off-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoprintk: add console_suspend module parameter
Yanmin Zhang [Tue, 1 Nov 2011 00:11:27 +0000]
printk: add console_suspend module parameter

We are enabling some power features on medfield.  To test suspend-2-RAM
conveniently, we need turn on/off console_suspend_enabled frequently.

Add a module parameter, so users could change it by:
/sys/module/printk/parameters/console_suspend

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoprintk: add module parameter ignore_loglevel to control ignore_loglevel
Yanmin Zhang [Tue, 1 Nov 2011 00:11:25 +0000]
printk: add module parameter ignore_loglevel to control ignore_loglevel

We are enabling some power features on medfield.  To test suspend-2-RAM
conveniently, we need turn on/off ignore_loglevel frequently without
rebooting.

Add a module parameter, so users can change it by:
/sys/module/printk/parameters/ignore_loglevel

Signed-off-by: Yanmin Zhang <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agokernel/sysctl.c: add cap_last_cap to /proc/sys/kernel
Dan Ballard [Tue, 1 Nov 2011 00:11:20 +0000]
kernel/sysctl.c: add cap_last_cap to /proc/sys/kernel

Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from the header files
only.  The fact that this value cannot be determined properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels.  They assume capabilities are available
which actually aren't.  libcap-ng is one example.  And we ran into the
same problem with systemd too.

Now the capability is exported in /proc/sys/kernel/cap_last_cap.

[akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
Signed-off-by: Dan Ballard <dan@mindstab.net>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Ulrich Drepper <drepper@akkadia.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agowatchdog: move watchdog_*_all_cpus under CONFIG_SYSCTL
Vasily Averin [Tue, 1 Nov 2011 00:11:18 +0000]
watchdog: move watchdog_*_all_cpus under CONFIG_SYSCTL

Fix compilation warnings for CONFIG_SYSCTL=n:

fixed compilation warnings in case of disabled CONFIG_SYSCTL
kernel/watchdog.c:483:13: warning: `watchdog_enable_all_cpus' defined but not used
kernel/watchdog.c:500:13: warning: `watchdog_disable_all_cpus' defined but not used

these functions are static and are used only in sysctl handler, so move
them inside #ifdef CONFIG_SYSCTL too

Signed-off-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agostop_machine: make stop_machine safe and efficient to call early
Jeremy Fitzhardinge [Tue, 1 Nov 2011 00:11:15 +0000]
stop_machine: make stop_machine safe and efficient to call early

Make stop_machine() safe to call early in boot, before SMP has been set
up, by simply calling the callback function directly if there's only one
CPU online.

[ Fixes from AKPM:
   - add comment
   - local_irq_flags, not save_flags
   - also call hard_irq_disable() for systems which need it

  Tejun suggested using an explicit flag rather than just looking at
  the online cpu count. ]

Cc: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Tejun Heo <htejun@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodrivers/misc/ad525x_dpot-i2c.c: add i2c support for AD5161
Peter Korsgaard [Tue, 1 Nov 2011 00:11:12 +0000]
drivers/misc/ad525x_dpot-i2c.c: add i2c support for AD5161

Commit 6c536e4ce8e ("ad525x_dpot: add support for SPI parts") added
support for the AD5161 through SPI, but the device supports both I2C and
SPI (depending on the DIS pin), so add it to -i2c as well.

Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodriver/misc/fsa9480.c fix potential null-pointer dereference
Jonghwan Choi [Tue, 1 Nov 2011 00:11:09 +0000]
driver/misc/fsa9480.c fix potential null-pointer dereference

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Cc: Donggeun Kim <dg77.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3lv02d: make regulator API usage unconditional
Mark Brown [Tue, 1 Nov 2011 00:11:07 +0000]
lis3lv02d: make regulator API usage unconditional

The regulator API contains a range of features for stubbing itself out
when not in use and for transparently restricting the actual effect of
regulator API calls where they can't be supported on a particular system
so that drivers don't need to individually implement this.  Simplify the
driver slightly by making use of this idiom.

The only in tree user is ecovec24 which does not use the regulator API.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: remove the references to the global variable in core driver
Éric Piel [Tue, 1 Nov 2011 00:11:05 +0000]
lis3: remove the references to the global variable in core driver

[ilkka.koskinen@nokia.com: fix arg to lis3->read()]
Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Subject: lis3-remove-the-references-to-the-global-variable-in-core-driver-fix
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: change exported function to use passed parameter
Éric Piel [Tue, 1 Nov 2011 00:11:02 +0000]
lis3: change exported function to use passed parameter

Change exported functions to use the device given as parameter
instead of the global one.

Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: use consistent naming of variables
Éric Piel [Tue, 1 Nov 2011 00:10:58 +0000]
lis3: use consistent naming of variables

Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: free regulators if probe() fails
Éric Piel [Tue, 1 Nov 2011 00:10:54 +0000]
lis3: free regulators if probe() fails

Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agohp_accel: add HP ProBook 655x
Éric Piel [Tue, 1 Nov 2011 00:10:50 +0000]
hp_accel: add HP ProBook 655x

Add axis correction for HP ProBook 6555b.

Signed-off-by: Malte Starostik <m-starostik@versanet.de>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: add support for HP EliteBook 8540w
Éric Piel [Tue, 1 Nov 2011 00:10:46 +0000]
lis3: add support for HP EliteBook 8540w

Add axis correction for HP EliteBook 8540w.

Reported-by: Lyall Pearce <lyall.pearce@hp.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: add support for HP EliteBook 2730p
Éric Piel [Tue, 1 Nov 2011 00:10:43 +0000]
lis3: add support for HP EliteBook 2730p

Add axis correction for HP EliteBook 2730p.

Tested-by: Witold Pilat <witold.pilat@gmail.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3: update maintainer information
Éric Piel [Tue, 1 Nov 2011 00:10:41 +0000]
lis3: update maintainer information

In the move of the lis3 driver, the hp_accel.c file got dropped from the
MAINTAINER file. Make it explicit again that this file is tied to lis3
again.

Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolis3lv02d: avoid divide by zero due to unchecked
Éric Piel [Tue, 1 Nov 2011 00:10:31 +0000]
lis3lv02d: avoid divide by zero due to unchecked

After an "unexpected" reboot, I found this Oops in my logs:

divide error: 0000 [#1] PREEMPT SMP=20
CPU 0=20
Modules linked in: lis3lv02d hp_wmi input_polldev [...]
Pid: 390, comm: modprobe Tainted: G         C  2.6.39-rc7-wl+=20
RIP: 0010:[<ffffffffa014b427>]  [<ffffffffa014b427>]
 lis3lv02d_poweron+0x4e/0x94 [lis3lv02d]
RSP: 0018:ffff8801d6407cf8  EFLAGS: 00010246
RAX: 0000000000000bb8 RBX: ffffffffa014e000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffea00066e4708 RDI: ffff8801df002700
RBP: ffff8801d6407d18 R08: ffffea00066c5a30 R09: ffffffff812498c9
R10: ffff8801d7bfcea0 R11: ffff8801d7bfce10 R12: 0000000000000bb8
R13: 00000000ffffffda R14: ffffffffa0154120 R15: ffffffffa0154030
=46S:  00007fc0705db700(0000) GS:ffff8801dfa00000(0000) knlGS:0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f33549174f0 CR3: 00000001d65c9000 CR4: 00000000000406f0
Process modprobe (pid: 390, threadinfo ffff8801d6406000, task ffff8801d6b40=
000)
Stack:
 ffffffffa0154120 62ffffffa0154030 ffffffffa014e000 00000000ffffffea
 ffff8801d6407d58 ffffffffa014bcc1 0000000000000000 0000000000000048
 ffff8801d8bae800 00000000ffffffea 00000000ffffffda ffffffffa0154120
Call Trace:
 [<ffffffffa014bcc1>] lis3lv02d_init_device+0x1ce/0x496 [lis3lv02d]
 [<ffffffffa01522ff>] lis3lv02d_add+0x10f/0x17c [hp_accel]
 [<ffffffff81233e11>] acpi_device_probe+0x49/0x117
[...]
Code: 3a 75 06 80 4d ef 50 eb 04 80 4d ef 40 0f b6 55 ef be 21
00 00 00 48 89 df ff 53 18 44 8b 63 6c e8 3e fc ff ff 89 c1 44
89 e0 99 <f7> f9 89 c7 e8 93 82 ef e0 48 83 7b 30 00 74 2d 45
31 e4 80 7b=20
RIP  [<ffffffffa014b427>] lis3lv02d_poweron+0x4e/0x94 [lis3lv02d]
 RSP <ffff8801d6407cf8>

>From my POV, it looks like the hardware is not working as expected
and returns a bogus data rate. The driver doesn't check the result
and directly uses it as some sort of divisor in some places:

msleep(lis3->pwron_delay / lis3lv02d_get_odr());

Under this circumstances, this could very well cause the
"divide by zero" exception from above.

For now, I fixed it the easiest and most obvious way:
Check if the result is sane and if it isn't use a sane default
instead. I went for "100" in the latter case, simply because
/sys/devices/platform/lis3lv02d/rate returns it on a successful
boot.

Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodrivers/hwmon/hwmon.c: convert idr to ida and use ida_simple_get()
Jonathan Cameron [Tue, 1 Nov 2011 00:10:27 +0000]
drivers/hwmon/hwmon.c: convert idr to ida and use ida_simple_get()

A straightforward looking use of idr for a device id.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Acked-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agohwmon: convert idr to ida and use ida_simple interface
Jonathan Cameron [Tue, 1 Nov 2011 00:10:09 +0000]
hwmon: convert idr to ida and use ida_simple interface

hwmon was using an idr with a NULL pointer, so convert to an
ida which then allows use of Rusty's ida_simple_get.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolib/Kconfig.debug: fix help message for DEFAULT_HUNG_TASK_TIMEOUT
Jiaju Zhang [Tue, 1 Nov 2011 00:10:07 +0000]
lib/Kconfig.debug: fix help message for DEFAULT_HUNG_TASK_TIMEOUT

Added missing _secs in the help message of config DEFAULT_HUNG_TASK_TIMEOUT.

Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofs/pipe.c: add ->statfs callback for pipefs
Pavel Emelyanov [Tue, 1 Nov 2011 00:10:04 +0000]
fs/pipe.c: add ->statfs callback for pipefs

Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoalpha: wire up sendmmsg syscall
Michael Cree [Tue, 1 Nov 2011 00:10:01 +0000]
alpha: wire up sendmmsg syscall

Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoalpha: wire up accept4 syscall
Michael Cree [Tue, 1 Nov 2011 00:09:49 +0000]
alpha: wire up accept4 syscall

Somehow wiring up the accept4 syscall on Alpha was missed long ago.
This commit rectifies that oversight.

Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/vmstat.c: cache align vm_stat
Dimitri Sivanich [Tue, 1 Nov 2011 00:09:46 +0000]
mm/vmstat.c: cache align vm_stat

Avoid false sharing of the vm_stat array.

This was found to adversely affect tmpfs I/O performance.

Tests run on a 640 cpu UV system.

With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch: ~300 MB/sec
With vm_stat alignment: ~430 MB/sec

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: munlock use mapcount to avoid terrible overhead
Hugh Dickins [Tue, 1 Nov 2011 00:09:43 +0000]
mm: munlock use mapcount to avoid terrible overhead

A process spent 30 minutes exiting, just munlocking the pages of a large
anonymous area that had been alternately mprotected into page-sized vmas:
for every single page there's an anon_vma walk through all the other
little vmas to find the right one.

A general fix to that would be a lot more complicated (use prio_tree on
anon_vma?), but there's one very simple thing we can do to speed up the
common case: if a page to be munlocked is mapped only once, then it is our
vma that it is mapped into, and there's no need whatever to walk through
all the others.

Okay, there is a very remote race in munlock_vma_pages_range(), if between
its follow_page() and lock_page(), another process were to munlock the
same page, then page reclaim remove it from our vma, then another process
mlock it again.  We would find it with page_mapcount 1, yet it's still
mlocked in another process.  But never mind, that's much less likely than
the down_read_trylock() failure which munlocking already tolerates (in
try_to_unmap_one()): in due course page reclaim will discover and move the
page to unevictable instead.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/huge_memory: fix typo when updating mmu cache
Hillf Danton [Tue, 1 Nov 2011 00:09:40 +0000]
mm/huge_memory: fix typo when updating mmu cache

There are three cases of update_mmu_cache() in the file, and the case in
function collapse_huge_page() has a typo, namely the last parameter used,
which is corrected based on the other two cases.

Due to the define of update_mmu_cache by X86, the only arch that
implements THP currently, the change here has no really crystal point, but
one or two minutes of efforts could be saved for those archs that are
likely to support THP in future.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/huge_memory: fix copying user highpage
Hillf Danton [Tue, 1 Nov 2011 00:09:38 +0000]
mm/huge_memory: fix copying user highpage

The THP copy-on-write handler falls back to regular-sized pages for a huge
page replacement upon allocation failure or if THP has been individually
disabled in the target VMA.  The loop responsible for copying page-sized
chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
the virtual address arg for copy_user_highpage().

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: do not drain pagevecs for mlockall(MCL_FUTURE)
Christoph Lameter [Tue, 1 Nov 2011 00:09:35 +0000]
mm: do not drain pagevecs for mlockall(MCL_FUTURE)

MCL_FUTURE does not move pages between lru list and draining the LRU per
cpu pagevecs is a nasty activity.  Avoid doing it unecessarily.

Signed-off-by: Christoph Lameter <cl@gentwo.org>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: abort reclaim/compaction if compaction can proceed
Mel Gorman [Tue, 1 Nov 2011 00:09:33 +0000]
vmscan: abort reclaim/compaction if compaction can proceed

If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps.  This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: limit direct reclaim for higher order allocations
Rik van Riel [Tue, 1 Nov 2011 00:09:31 +0000]
vmscan: limit direct reclaim for higher order allocations

When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory.  Due to the unfreeable
pages, compaction fails.

Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas.  However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.

This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.

This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.

If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.

[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: add barrier to prevent evictable page in unevictable list
Minchan Kim [Tue, 1 Nov 2011 00:09:28 +0000]
vmscan: add barrier to prevent evictable page in unevictable list

When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1.  Then, the page
would be stranded on the unevictable list.

spin_lock
SetPageLRU
spin_unlock
                                clear_bit(AS_UNEVICTABLE)
                                spin_lock
                                if PageLRU()
                                        if !test_bit(AS_UNEVICTABLE)
                                         move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
        move evictable list
                                spin_unlock

But, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.

This patch adds a barrier after mapping_clear_unevictable.

I didn't meet this problem but just found during review.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/huge_memory.c: quiet sparse noise
H Hartley Sweeten [Tue, 1 Nov 2011 00:09:25 +0000]
mm/huge_memory.c: quiet sparse noise

Quiet the sparse noise:

warning: symbol 'khugepaged_scan' was not declared. Should it be static?
warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/mempolicy.c: quiet sparse noise
H Hartley Sweeten [Tue, 1 Nov 2011 00:09:23 +0000]
mm/mempolicy.c: quiet sparse noise

Quiet the spares noise:

warning: symbol 'default_policy' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/thrash.c: quiet sparse noise
H Hartley Sweeten [Tue, 1 Nov 2011 00:09:19 +0000]
mm/thrash.c: quiet sparse noise

Quiet the following sparse noise:

warning: symbol 'swap_token_memcg' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/memblock.c: quiet sparse noise
H Hartley Sweeten [Tue, 1 Nov 2011 00:09:15 +0000]
mm/memblock.c: quiet sparse noise

Quiet the following sparse noise in this file:

warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers,com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tomi Valkeinen <tomi.valkeinen@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: disable user interface to manually rescue unevictable pages
Johannes Weiner [Tue, 1 Nov 2011 00:09:13 +0000]
mm: disable user interface to manually rescue unevictable pages

At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again.  But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.

The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions.  But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.

In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.

This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reported-by: Kautuk Consul <consul.kautuk@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node()
Kautuk Consul [Tue, 1 Nov 2011 00:09:11 +0000]
vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node()

write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.

However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.

Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.

Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul.  In
both cases, we do not perform the scanning of this node's zones.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: fix kunmap_high() comment
Li Haifeng [Tue, 1 Nov 2011 00:09:09 +0000]
mm: fix kunmap_high() comment

Signed-off-by: Li Haifeng <omycle@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: compaction: make compact_zone_order() static
Kyungmin Park [Tue, 1 Nov 2011 00:09:08 +0000]
mm: compaction: make compact_zone_order() static

There's no compact_zone_order() user outside file scope, so make it static.

Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoHWPOISON: convert pr_debug()s to pr_info()s
Dean Nelson [Tue, 1 Nov 2011 00:09:04 +0000]
HWPOISON: convert pr_debug()s to pr_info()s

Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.

About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage").  And these pr_debug()s failed to get
converted to pr_info()s.

This patch converts them as well.  And does some minor related whitespace
cleanup.

Signed-off-by: Dean Nelson <dnelson@redhat.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofs/buffer.c: add device information for error output in __find_get_block_slow()
Tao Ma [Tue, 1 Nov 2011 00:09:00 +0000]
fs/buffer.c: add device information for error output in __find_get_block_slow()

On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.

If the device information is given, we can know the name of the sick
volume.  Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.

[1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/mmap.c: eliminate the ret variable from mm_take_all_locks()
Kautuk Consul [Tue, 1 Nov 2011 00:08:59 +0000]
mm/mmap.c: eliminate the ret variable from mm_take_all_locks()

The ret variable is really not needed in mm_take_all_locks().

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: fix shrinker callback bug in fs/super.c
Mikulas Patocka [Tue, 1 Nov 2011 00:08:57 +0000]
vmscan: fix shrinker callback bug in fs/super.c

The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm-add-comment-explaining-task-state-setting-in-bdi_forker_thread-fix
Andrew Morton [Tue, 1 Nov 2011 00:08:54 +0000]
mm-add-comment-explaining-task-state-setting-in-bdi_forker_thread-fix

fiddle wording

Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoksm: fix the comment of try_to_unmap_one()
Wanlong Gao [Tue, 1 Nov 2011 00:08:51 +0000]
ksm: fix the comment of try_to_unmap_one()

try_to_unmap_one() is called by try_to_unmap_ksm(), too.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/vmalloc.c: report more vmalloc failures
Joe Perches [Tue, 1 Nov 2011 00:08:48 +0000]
mm/vmalloc.c: report more vmalloc failures

Some vmalloc failure paths do not report OOM conditions.

Add warn_alloc_failed, which also does a dump_stack, to those failure
paths.

This allows more site specific vmalloc failure logging message printks to
be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agokswapd: assign new_order and new_classzone_idx after wakeup in sleeping
Alex,Shi [Tue, 1 Nov 2011 00:08:45 +0000]
kswapd: assign new_order and new_classzone_idx after wakeup in sleeping

There 2 places to read pgdat in kswapd.  One is return from a successful
balance, another is waked up from kswapd sleeping.  The new_order and
new_classzone_idx represent the balance input order and classzone_idx.

But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.

1: after a successful balance, kswapd goes to sleep, and new_order = 0;
   new_classzone_idx = __MAX_NR_ZONES - 1;

2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

3: in the balance_pgdat() running, a new balance wakeup happened with
   order = 5, and classzone_idx = ZONE_NORMAL

4: the first wakeup(order = 3) finished successufly, return order = 3
   but, the new_order is still 0, so, this balancing will be treated as a
   failed balance.  And then the second tighter balancing will be missed.

So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/memblock.c: small function definition fixes
Jonghwan Choi [Tue, 1 Nov 2011 00:08:42 +0000]
mm/memblock.c: small function definition fixes

warning: function 'memblock_memory_can_coalesce'
with external linkage has definition.

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agokswapd: avoid unnecessary rebalance after an unsuccessful balancing
Alex,Shi [Tue, 1 Nov 2011 00:08:39 +0000]
kswapd: avoid unnecessary rebalance after an unsuccessful balancing

In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.

1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodebug-pagealloc: add support for highmem pages
Akinobu Mita [Tue, 1 Nov 2011 00:08:38 +0000]
debug-pagealloc: add support for highmem pages

This adds support for highmem pages poisoning and verification to the
debug-pagealloc feature for no-architecture support.

[akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: neaten warn_alloc_failed
Joe Perches [Tue, 1 Nov 2011 00:08:35 +0000]
mm: neaten warn_alloc_failed

Add __attribute__((format (printf...) to the function to validate format
and arguments.  Use vsprintf extension %pV to avoid any possible message
interleaving.  Coalesce format string.  Convert printks/pr_warning to
pr_warn.

[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoinclude/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined...
Sonic Zhang [Tue, 1 Nov 2011 00:08:31 +0000]
include/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined macro

On NOMMU architectures, if physical memory doesn't start from 0,
ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
Because virtual address is equal to physical address, PAGE_OFFSET is
always 0.  virt_to_page and page_to_virt should not index page by
PAGE_OFFSET directly.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agothp: mremap support and TLB optimization
Andrea Arcangeli [Tue, 1 Nov 2011 00:08:30 +0000]
thp: mremap support and TLB optimization

This adds THP support to mremap (decreases the number of split_huge_page()
calls).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (5UL*1024*1024*1024)

int main()
{
        static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");

return 0;
}
===

THP on

 Performance counter stats for './largepage13' (3 runs):

          69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
             60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
         676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
             29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
        1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
              8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)

        7.314454164  seconds time elapsed   ( +-   0.023% )

THP off

 Performance counter stats for './largepage13' (3 runs):

        1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
           9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
        2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
           3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
        6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
              8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)

        9.693655243  seconds time elapsed   ( +-   0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.

[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomremap: avoid sending one IPI per page
Andrea Arcangeli [Tue, 1 Nov 2011 00:08:26 +0000]
mremap: avoid sending one IPI per page

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.

The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things.  It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey).  If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls.  Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs.  One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomremap: check for overflow using deltas
Andrea Arcangeli [Tue, 1 Nov 2011 00:08:22 +0000]
mremap: check for overflow using deltas

Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check.  This I feel
will improve readability.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemblock: add NO_BOOTMEM config symbol
Sam Ravnborg [Tue, 1 Nov 2011 00:08:20 +0000]
memblock: add NO_BOOTMEM config symbol

With the NO_BOOTMEM symbol added architectures may now use the following
syntax to tell that they do not need bootmem:

select NO_BOOTMEM

This is much more convinient than adding a new kconfig symbol which was
otherwise required.

Adding this symbol does not conflict with the architctures that already
define their own symbol.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemblock: add memblock_start_of_DRAM()
Sam Ravnborg [Tue, 1 Nov 2011 00:08:16 +0000]
memblock: add memblock_start_of_DRAM()

SPARC32 require access to the start address.  Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.

The awkward name was chosen to match the already present
memblock_end_of_DRAM().

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: avoid null pointer access in vm_struct via /proc/vmallocinfo
Mitsuo Hayasaka [Tue, 1 Nov 2011 00:08:13 +0000]
mm: avoid null pointer access in vm_struct via /proc/vmallocinfo

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct.  It, however, may access pages
field of vm_struct where a page was not allocated.  This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node().  In other words, it is added to vmlist before it is
fully initialized.  At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info().  Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized.  So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/debug-pagealloc.c: use memchr_inv
Akinobu Mita [Tue, 1 Nov 2011 00:08:10 +0000]
mm/debug-pagealloc.c: use memchr_inv

Use newly introduced memchr_inv() for page verification.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agolib/string.c: introduce memchr_inv()
Akinobu Mita [Tue, 1 Nov 2011 00:08:07 +0000]
lib/string.c: introduce memchr_inv()

memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.

The function name and prototype are stolen from logfs and the
implementation is from SLUB.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit()
Akinobu Mita [Tue, 1 Nov 2011 00:08:05 +0000]
mm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit()

printk_ratelimit() should not be used, because it shares ratelimiting
state with all other unrelated printk_ratelimit() callsites.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: count pages into balanced for zone with good watermark
Shaohua Li [Tue, 1 Nov 2011 00:08:02 +0000]
vmscan: count pages into balanced for zone with good watermark

It's possible a zone watermark is ok when entering the balance_pgdat()
loop, while the zone is within the requested classzone_idx.  Count pages
from this zone into `balanced'.  In this way, we can skip shrinking zones
too much for high order allocation.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes
Mel Gorman [Tue, 1 Nov 2011 00:07:59 +0000]
mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes

When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle.  This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned.  This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: throttle reclaim if encountering too many dirty pages under writeback
Mel Gorman [Tue, 1 Nov 2011 00:07:56 +0000]
mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback

Workloads that are allocating frequently and writing files place a large
number of dirty pages on the LRU.  With use-once logic, it is possible for
them to reach the end of the LRU quickly requiring the reclaimer to scan
more to find clean pages.  Ordinarily, processes that are dirtying memory
will get throttled by dirty balancing but this is a global heuristic and
does not take into account that LRUs are maintained on a per-zone basis.
This can lead to a situation whereby reclaim is scanning heavily, skipping
over a large number of pages under writeback and recycling them around the
LRU consuming CPU.

This patch checks how many of the number of pages isolated from the LRU
were dirty and under writeback.  If a percentage of them under writeback,
the process will be throttled if a backing device or the zone is
congested.  Note that this applies whether it is anonymous or file-backed
pages that are under writeback meaning that swapping is potentially
throttled.  This is intentional due to the fact if the swap device is
congested, scanning more pages and dispatching more IO is not going to
help matters.

The percentage that must be in writeback depends on the priority.  At
default priority, all of them must be dirty.  At DEF_PRIORITY-1, 50% of
them must be, DEF_PRIORITY-2, 25% etc.  i.e.  as pressure increases the
greater the likelihood the process will get throttled to allow the flusher
threads to make some progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: do not writeback filesystem pages in kswapd except in high priority
Mel Gorman [Tue, 1 Nov 2011 00:07:51 +0000]
mm: vmscan: do not writeback filesystem pages in kswapd except in high priority

It is preferable that no dirty pages are dispatched for cleaning from the
page reclaim path.  At normal priorities, this patch prevents kswapd
writing pages.

However, page reclaim does have a requirement that pages be freed in a
particular zone.  If it is failing to make sufficient progress (reclaiming
< SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
scan more pages.  A priority of DEF_PRIORITY - 3 is considered to be the
point where kswapd is getting into trouble reclaiming pages.  If this
priority is reached, kswapd will dispatch pages for writing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoext4: warn if direct reclaim tries to writeback pages
Mel Gorman [Tue, 1 Nov 2011 00:07:48 +0000]
ext4: warn if direct reclaim tries to writeback pages

Direct reclaim should never writeback pages.  Warn if an attempt is made.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoxfs: warn if direct reclaim tries to writeback pages
Mel Gorman [Tue, 1 Nov 2011 00:07:45 +0000]
xfs: warn if direct reclaim tries to writeback pages

Direct reclaim should never writeback pages.  For now, handle the
situation and warn about it.  Ultimately, this will be a BUG_ON.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback
Mel Gorman [Tue, 1 Nov 2011 00:07:42 +0000]
mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback

Lumpy reclaim worked with two passes - the first which queued pages for IO
and the second which waited on writeback.  As direct reclaim can no longer
write pages there is some dead code.  This patch removes it but direct
reclaim will continue to wait on pages under writeback while in
synchronous reclaim mode.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: do not writeback filesystem pages in direct reclaim
Mel Gorman [Tue, 1 Nov 2011 00:07:38 +0000]
mm: vmscan: do not writeback filesystem pages in direct reclaim

Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd.  Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low).  That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;

It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;

xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request

As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied.  In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.

The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.

A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.

There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.

Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.

Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.

Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.

Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.

Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.

Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them

I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.

I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings.  The command line for fs_mark when
botted with 512M looked something like

./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760

The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM.  For multiple threads,
the -d switch is specified multiple times.

The test machine is x86-64 with an older generation of AMD processor with
4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available.  Swap is on a
separate disk.  Dirty ratio was tuned to 40% instead of the default of
20%.

Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.

Here is a summary of results based on testing XFS.

512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
512M-xfs             Elapsed Time fsmark                    56.08     48.90
512M-xfs             Elapsed Time simple-wb                112.22     98.13
512M-xfs             Elapsed Time mmap-strm                219.15    196.67
512M-xfs             Kswapd efficiency fsmark                 54%       56%
512M-xfs             Kswapd efficiency simple-wb              54%       55%
512M-xfs             Kswapd efficiency mmap-strm              45%       44%
512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%

Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely.  The time to completion for all tests is improved which is an
important result.  In general, kswapd efficiency is not affected by
skipping dirty pages.

1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
1024M-xfs            Elapsed Time fsmark                    94.59     91.00
1024M-xfs            Elapsed Time simple-wb                229.84    195.08
1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
1024M-xfs            Kswapd efficiency fsmark                 79%       71%
1024M-xfs            Kswapd efficiency simple-wb              74%       74%
1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%

All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases

In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped.  The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim.  In other words, roughly the same number of pages were reclaimed
but swapping was slower.  As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.

4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
4608M-xfs            Elapsed Time fsmark                   555.96    532.34
4608M-xfs            Elapsed Time simple-wb                659.72    571.85
4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
4608M-xfs            Kswapd efficiency fsmark                 89%       91%
4608M-xfs            Kswapd efficiency simple-wb              88%       82%
4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%

Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.

There are other indications that this is an improvement as well.  For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced.  KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable

In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher.  However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.

On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage.  A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.

1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
1024M-xfs            Kswapd efficiency fsmark              84%       85%
1024M-xfs            Kswapd efficiency simple-wb           92%       81%
1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53

The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory.  On the postive side, firefox launched
marginally faster with the patches applied.  Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive.  It was
also the case that "Avg wait ms" on the root filesystem was lower.  I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.

This patch: do not writeback filesystem pages in direct reclaim.

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does.  If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.

This causes two problems.  First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex.  Some
filesystems ignore write requests from direct reclaim as a result.  The
second is that a single-page flush is inefficient in terms of IO.  While
there is an expectation that the elevator will merge requests, this does
not always happen.  Quoting Christoph Hellwig;

The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd.  Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists.  There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: add comments to explain mm_struct fields
Christoph Lameter [Tue, 1 Nov 2011 00:07:34 +0000]
mm: add comments to explain mm_struct fields

Add comments to explain the page statistics field in the mm_struct.

[akpm@linux-foundation.org: add missing ;]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: distinguish between mlocked and pinned pages
Christoph Lameter [Tue, 1 Nov 2011 00:07:30 +0000]
mm: distinguish between mlocked and pinned pages

Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: vmscan: drop nr_force_scan[] from get_scan_count
Johannes Weiner [Tue, 1 Nov 2011 00:07:27 +0000]
mm: vmscan: drop nr_force_scan[] from get_scan_count

The nr_force_scan[] tuple holds the effective scan numbers for anon and
file pages in case the situation called for a forced scan and the
regularly calculated scan numbers turned out zero.

However, the effective scan number can always be assumed to be
SWAP_CLUSTER_MAX right before the division into anon and file.  The
numerators and denominator are properly set up for all cases, be it force
scan for just file, just anon, or both, to do the right thing.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: output a list of loaded modules when we hit bad_page()
Dave Jones [Tue, 1 Nov 2011 00:07:24 +0000]
mm: output a list of loaded modules when we hit bad_page()

When we get a bad_page bug report, it's useful to see what modules the
user had loaded.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agotmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.
Robert P. J. Day [Tue, 1 Nov 2011 00:07:21 +0000]
tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.

Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agooom: fix race while temporarily setting current's oom_score_adj
David Rientjes [Tue, 1 Nov 2011 00:07:18 +0000]
oom: fix race while temporarily setting current's oom_score_adj

test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated.  That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86.  It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agooom: remove oom_disable_count
David Rientjes [Tue, 1 Nov 2011 00:07:15 +0000]
oom: remove oom_disable_count

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agooom: avoid killing kthreads if they assume the oom killed thread's mm
David Rientjes [Tue, 1 Nov 2011 00:07:11 +0000]
oom: avoid killing kthreads if they assume the oom killed thread's mm

After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups.  It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.

A kernel thread, however, may assume a user thread's mm by using
use_mm().  This is only temporary and should not result in sending a
SIGKILL to that kthread.

This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agooom: thaw threads if oom killed thread is frozen before deferring
David Rientjes [Tue, 1 Nov 2011 00:07:07 +0000]
oom: thaw threads if oom killed thread is frozen before deferring

If a thread has been oom killed and is frozen, thaw it before returning to
the page allocator.  Otherwise, it can stay frozen indefinitely and no
memory will be freed.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm/page-writeback.c: document bdi_min_ratio
Johannes Weiner [Tue, 1 Nov 2011 00:07:05 +0000]
mm/page-writeback.c: document bdi_min_ratio

Looks like someone got distracted after adding the comment characters.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agovmscan: add block plug for page reclaim
Shaohua Li [Tue, 1 Nov 2011 00:07:03 +0000]
vmscan: add block plug for page reclaim

per-task block plug can reduce block queue lock contention and increase
request merge.  Currently page reclaim doesn't support it.  I originally
thought page reclaim doesn't need it, because kswapd thread count is
limited and file cache write is done at flusher mostly.

When I test a workload with heavy swap in a 4-node machine, each CPU is
doing direct page reclaim and swap.  This causes block queue lock
contention.  In my test, without below patch, the CPU utilization is about
2% ~ 7%.  With the patch, the CPU utilization is about 1% ~ 3%.  Disk
throughput isn't changed.  This should improve normal kswapd write and
file cache write too (increase request merge for example), but might not
be so obvious as I explain above.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoradix_tree: clean away saw_unset_tag leftovers
Hugh Dickins [Tue, 1 Nov 2011 00:07:02 +0000]
radix_tree: clean away saw_unset_tag leftovers

radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.

Remove it now, and return 0 as soon as we see unset tag - we already rely
upon the root tag to be correct, returning 0 immediately if it's not set.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: migration: clean up unmap_and_move()
Minchan Kim [Tue, 1 Nov 2011 00:06:57 +0000]
mm: migration: clean up unmap_and_move()

unmap_and_move() is one a big messy function.  Clean it up.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: zone_reclaim: make isolate_lru_page() filter-aware
Minchan Kim [Tue, 1 Nov 2011 00:06:55 +0000]
mm: zone_reclaim: make isolate_lru_page() filter-aware

In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
we have isolated mapped page and re-add it into LRU's head.  It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped.  So it could be
migrated.  But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: compaction: make isolate_lru_page() filter-aware
Minchan Kim [Tue, 1 Nov 2011 00:06:51 +0000]
mm: compaction: make isolate_lru_page() filter-aware

In async mode, compaction doesn't migrate dirty or writeback pages.  So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback.  So it could be migrated.  But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: change isolate mode from #define to bitwise type
Minchan Kim [Tue, 1 Nov 2011 00:06:47 +0000]
mm: change isolate mode from #define to bitwise type

Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomm: compaction: trivial clean up in acct_isolated()
Minchan Kim [Tue, 1 Nov 2011 00:06:44 +0000]
mm: compaction: trivial clean up in acct_isolated()

acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE.  In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.

This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoCross Memory Attach
Christopher Yeoh [Tue, 1 Nov 2011 00:06:39 +0000]
Cross Memory Attach

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()
Wanlong Gao [Tue, 1 Nov 2011 00:06:35 +0000]
ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()

Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(),
although it is harmless for the syscall mq_timed* now.  It was introduced
by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers").

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years ago/proc/self/numa_maps: restore "huge" tag for hugetlb vmas
Andrew Morton [Tue, 1 Nov 2011 00:06:32 +0000]
/proc/self/numa_maps: restore "huge" tag for hugetlb vmas

The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
use walk_page_range() instead of custom page table walking code").

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoinclude/linux/dmar.h: forward-declare struct acpi_dmar_header
Andrew Morton [Tue, 1 Nov 2011 00:06:29 +0000]
include/linux/dmar.h: forward-declare struct acpi_dmar_header

x86_64 allnoconfig:

In file included from arch/x86/kernel/pci-dma.c:3:
include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list
include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want

Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodma-mapping: fix sync_single_range_* DMA debugging
Clemens Ladisch [Tue, 1 Nov 2011 00:06:28 +0000]
dma-mapping: fix sync_single_range_* DMA debugging

Commit 5fd75a7850b5 (dma-mapping: remove unnecessary sync_single_range_*
in dma_map_ops) unified not only the dma_map_ops but also the
corresponding debug_dma_sync_* calls.  This led to spurious WARN()ings
like the following because the DMA debug code was no longer able to detect
the DMA buffer base address without the separate offset parameter:

WARNING: at lib/dma-debug.c:911 check_sync+0xce/0x446()
firewire_ohci 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000cedaa400] [size=1024 bytes]
Call Trace: ...
 [<ffffffff811326a5>] check_sync+0xce/0x446
 [<ffffffff81132ad9>] debug_dma_sync_single_for_device+0x39/0x3b
 [<ffffffffa01d6e6a>] ohci_queue_iso+0x4f3/0x77d [firewire_ohci]
 ...

To fix this, unshare the sync_single_* and sync_single_range_*
implementations so that we are able to call the correct debug_dma_sync_*
functions.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Mon, 31 Oct 2011 22:22:44 +0000]
Merge git://git./linux/kernel/git/davem/net

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  vlan: allow nested vlan_do_receive()
  ipv6: fix route lookup in addrconf_prefix_rcv()
  bonding: eliminate bond_close race conditions
  qlcnic: fix beacon and LED test.
  qlcnic: Updated License file
  qlcnic: updated reset sequence
  qlcnic: reset loopback mode if promiscous mode setting fails.
  qlcnic: skip IDC ack check in fw reset path.
  i825xx: Fix incorrect dependency for BVME6000_NET
  ipv6: fix route error binding peer in func icmp6_dst_alloc
  ipv6: fix error propagation in ip6_ufo_append_data()
  stmmac: update normal descriptor structure (v2)
  stmmac: fix NULL pointer dereference in capabilities fixup (v2)
  stmmac: fix a bug while checking the HW cap reg (v2)
  be2net: Changing MAC Address of a VF was broken.
  be2net: Refactored be_cmds.c file.
  bnx2x: update driver version to 1.70.30-0
  bnx2x: use FW 7.0.29.0
  bnx2x: Enable changing speed when port type is PORT_DA
  bnx2x: Fix 54618se LED behavior
  ...

8 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Linus Torvalds [Mon, 31 Oct 2011 22:22:16 +0000]
Merge git://git./linux/kernel/git/davem/sparc

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix masking and shifting in VIS fpcmp emulation.
  sparc32: Correct the return value of memcpy.
  sparc32: Remove uses of %g7 in memcpy implementation.
  sparc32: Remove non-kernel code from memcpy implementation.

8 years agoMerge branch 'for-linus' of git://neil.brown.name/md
Linus Torvalds [Mon, 31 Oct 2011 22:21:29 +0000]
Merge branch 'for-linus' of git://neil.brown.name/md

* 'for-linus' of git://neil.brown.name/md:
  md/raid10:  Fix bug when activating a hot-spare.

8 years agosparc64: Fix masking and shifting in VIS fpcmp emulation.
David S. Miller [Mon, 31 Oct 2011 08:05:49 +0000]
sparc64: Fix masking and shifting in VIS fpcmp emulation.

Signed-off-by: David S. Miller <davem@davemloft.net>

8 years agomd/raid10: Fix bug when activating a hot-spare.
NeilBrown [Mon, 31 Oct 2011 01:59:44 +0000]
md/raid10:  Fix bug when activating a hot-spare.

This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional.   It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
   2bb77736ae5dca0a189829fbb7379d43364a9dac
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

8 years agoMerge branch 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvar...
Linus Torvalds [Sun, 30 Oct 2011 22:54:59 +0000]
Merge branch 'i2c-for-linus' of git://git./linux/kernel/git/jdelvare/staging

* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  i2c: Functions for byte-swapped smbus_write/read_word_data
  i2c-algo-pca: Return standard fault codes
  i2c-algo-bit: Return standard fault codes
  i2c-algo-bit: Be verbose on bus testing failure
  i2c-algo-bit: Let user test buses without failing
  i2c/scx200_acb: Fix section mismatch warning in scx200_pci_drv
  i2c: I2C_ELEKTOR should depend on HAS_IOPORT

8 years agoMerge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Linus Torvalds [Sun, 30 Oct 2011 22:46:19 +0000]
Merge branch 'next' of git://git./linux/kernel/git/joro/iommu

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
  iommu/core: Remove global iommu_ops and register_iommu
  iommu/msm: Use bus_set_iommu instead of register_iommu
  iommu/omap: Use bus_set_iommu instead of register_iommu
  iommu/vt-d: Use bus_set_iommu instead of register_iommu
  iommu/amd: Use bus_set_iommu instead of register_iommu
  iommu/core: Use bus->iommu_ops in the iommu-api
  iommu/core: Convert iommu_found to iommu_present
  iommu/core: Add bus_type parameter to iommu_domain_alloc
  Driver core: Add iommu_ops to bus_type
  iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
  iommu/amd: Fix wrong shift direction
  iommu/omap: always provide iommu debug code
  iommu/core: let drivers know if an iommu fault handler isn't installed
  iommu/core: export iommu_set_fault_handler()
  iommu/omap: Fix build error with !IOMMU_SUPPORT
  iommu/omap: Migrate to the generic fault report mechanism
  iommu/core: Add fault reporting mechanism
  iommu/core: Use PAGE_SIZE instead of hard-coded value
  iommu/core: use the existing IS_ALIGNED macro
  iommu/msm: ->unmap() should return order of unmapped page
  ...

Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.

8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Linus Torvalds [Sun, 30 Oct 2011 22:43:32 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/bp/bp

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
  amd64_edac: Cleanup return type of amd64_determine_edac_cap()
  amd64_edac: Add a fix for Erratum 505
  EDAC, MCE, AMD: Simplify NB MCE decoder interface
  EDAC, MCE, AMD: Drop local coreid reporting
  EDAC, MCE, AMD: Print valid addr when reporting an error
  EDAC, MCE, AMD: Print CPU number when reporting the error