8 years agoVFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
David Howells [Tue, 7 Jun 2011 13:09:20 +0000]
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()

Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits
such as:

2304450783dfde7b0b94ae234edd0dbffa865073
2fd6b7f50797f2e993eea59e0a0b8c6399c811dc

as part of the RCU-based pathwalk changes, despite the fact that the caller
(shrink_dcache_for_umount()) notes in the banner comment the reasons that
d_lock is not necessary in these functions:

/*
 * destroy the dentries attached to a superblock on unmounting
 * - we don't need to use dentry->d_lock because:
 *   - the superblock is detached from all mountings and open files, so the
 *     dentry trees will not be rearranged by the VFS
 *   - s_umount is write-locked, so the memory pressure shrinker will ignore
 *     any dentries belonging to this superblock that it comes across
 *   - the filesystem itself is no longer permitted to rearrange the dentries
 *     in this superblock
 */

So remove these locks.  If the locks are actually necessary, then this banner
comment should be altered instead.

The hash table chains are protected by 1-bit locks in the hash table heads, so
those shouldn't be a problem.

Note that to make this work, __d_drop() has to be split so that the RCUwalk
barrier can be avoided.  This causes problems otherwise as it has an assertion
that dentry->d_lock is locked - but there is no need for that as no one else
can be trying to access this dentry, except to step over it (and that should
be handled by d_free(), I think).

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoVFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
David Howells [Tue, 7 Jun 2011 13:09:10 +0000]
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()

Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as
the value it computes is no longer used as of commit
312d3ca856d369bb04d0443846b85b4cdde6fa8a which made the nr_dentry counters
summed per-CPU rather than global atomic.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoswitch posix_acl_chmod() to umode_t
Al Viro [Sat, 23 Jul 2011 23:03:11 +0000]
switch posix_acl_chmod() to umode_t

again, that's what all callers pass to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoswitch posix_acl_from_mode() to umode_t
Al Viro [Sat, 23 Jul 2011 23:01:48 +0000]
switch posix_acl_from_mode() to umode_t

... seeing that this is what all callers pass to it anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoswitch posix_acl_equiv_mode() to umode_t *
Al Viro [Sat, 23 Jul 2011 22:56:36 +0000]
switch posix_acl_equiv_mode() to umode_t *

... so that &inode->i_mode could be passed to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoswitch posix_acl_create() to umode_t *
Al Viro [Sat, 23 Jul 2011 22:37:50 +0000]
switch posix_acl_create() to umode_t *

so we can pass &inode->i_mode to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoblock: initialise bd_super in bdget()
Lachlan McIlroy [Thu, 30 Jun 2011 01:01:45 +0000]
block: initialise bd_super in bdget()

bd_super is currently reset to NULL in kill_block_super() so we rely on previous
users of the block_device object to initialise this value for the next user.
This quirk was exposed on RHEL5 when a third party filesystem did not always use
kill_block_super() and therefore bd_super wasn't being reset when a block_device
object was recycled within the cache.  This may not be a problem upstream but
makes sense to be defensive.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agovfs: avoid call to inode_lru_list_del() if possible
Eric Dumazet [Thu, 28 Jul 2011 04:55:13 +0000]
vfs: avoid call to inode_lru_list_del() if possible

inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.

Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agovfs: avoid taking inode_hash_lock on pipes and sockets
Eric Dumazet [Thu, 28 Jul 2011 04:41:09 +0000]
vfs: avoid taking inode_hash_lock on pipes and sockets

Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.

nice speedup on SMP machines on socket intensive workloads.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agovfs: conditionally call inode_wb_list_del()
Eric Dumazet [Thu, 28 Jul 2011 04:11:47 +0000]
vfs: conditionally call inode_wb_list_del()

Some inodes (pipes, sockets, ...) are not in bdi writeback list.

evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.

At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoVFS: Fix automount for negative autofs dentries
David Howells [Mon, 11 Jul 2011 13:20:57 +0000]
VFS: Fix automount for negative autofs dentries

Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries.  These
need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoBtrfs: load the key from the dir item in readdir into a fake dentry
Josef Bacik [Tue, 28 Jun 2011 20:18:59 +0000]
Btrfs: load the key from the dir item in readdir into a fake dentry

In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir.  However if you use ls,
it usually stat's each file it gets from readdir.  This is where the second
index comes in, which is based on a hash of the name of the file.  So then the
lookup has to lookup this index, and then lookup the inode.  The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance.  Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agodevtmpfs: missing initialialization in never-hit case
Al Viro [Thu, 28 Jul 2011 02:27:33 +0000]
devtmpfs: missing initialialization in never-hit case

create_path() on something without a single / in it will return err
without initializing it.  It actually can't happen (we call that thing
only if create on the same path returns -ENOENT, which won't happen
happen for single-component path), but in this case initializing err
to 0 is more than making compiler to STFU - would be the right thing
to return on such paths; the function creates a parent directory of
given pathname and in that case it has no work to do...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agohppfs: missing include
Al Viro [Thu, 28 Jul 2011 02:21:58 +0000]
hppfs: missing include

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Linus Torvalds [Wed, 27 Jul 2011 01:30:20 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  merge fchmod() and fchmodat() guts, kill ancient broken kludge
  xfs: fix misspelled S_IS...()
  xfs: get rid of open-coded S_ISREG(), etc.
  vfs: document locking requirements for d_move, __d_move and d_materialise_unique
  omfs: fix (mode & S_IFDIR) abuse
  btrfs: S_ISREG(mode) is not mode & S_IFREG...
  ima: fmode_t misspelled as mode_t...
  pci-label.c: size_t misspelled as mode_t
  jffs2: S_ISLNK(mode & S_IFMT) is pointless
  snd_msnd ->mode is fmode_t, not mode_t
  v9fs_iop_get_acl: get rid of unused variable
  vfs: dont chain pipe/anon/socket on superblock s_inodes list
  Documentation: Exporting: update description of d_splice_alias
  fs: add missing unlock in default_llseek()

8 years agoMerge branch 'next/devel2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux...
Linus Torvalds [Wed, 27 Jul 2011 00:42:18 +0000]
Merge branch 'next/devel2' of git://git./linux/kernel/git/arm/linux-arm-soc

* 'next/devel2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (47 commits)
  OMAP: Add debugfs node to show the summary of all clocks
  OMAP2+: hwmod: Follow the recommended PRCM module enable sequence
  OMAP2+: clock: allow per-SoC clock init code to prevent clockdomain calls from clock code
  OMAP2+: clockdomain: Add per clkdm lock to prevent concurrent state programming
  OMAP2+: PM: idle clkdms only if already in idle
  OMAP2+: clockdomain: add clkdm_in_hwsup()
  OMAP2+: clockdomain: Add 2 APIs to control clockdomain from hwmod framework
  OMAP: clockdomain: Remove redundant call to pwrdm_wait_transition()
  OMAP4: hwmod: Introduce the module control in hwmod control
  OMAP4: cm: Add two new APIs for modulemode control
  OMAP4: hwmod data: Add modulemode entry in omap_hwmod structure
  OMAP4: hwmod data: Add PRM context register offset
  OMAP4: prm: Remove deprecated functions
  OMAP4: prm: Replace warm reset API with the offset based version
  OMAP4: hwmod: Replace RSTCTRL absolute address with offset macros
  OMAP: hwmod: Wait the idle status to be disabled
  OMAP4: hwmod: Replace CLKCTRL absolute address with offset macros
  OMAP2+: hwmod: Init clkdm field at boot time
  OMAP4: hwmod data: Add clock domain attribute
  OMAP4: clock data: Add missing divider selection for auxclks
  ...

8 years agoMerge branch 'next/devel' of ssh://master.kernel.org/pub/scm/linux/kernel/git/arm...
Linus Torvalds [Wed, 27 Jul 2011 00:41:04 +0000]
Merge branch 'next/devel' of ssh:///linux/kernel/git/arm/linux-arm-soc

* 'next/devel' of ssh://master.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (128 commits)
  ARM: S5P64X0: External Interrupt Support
  ARM: EXYNOS4: Enable MFC on Samsung NURI
  ARM: EXYNOS4: Enable MFC on universal_c210
  ARM: S5PV210: Enable MFC on Goni
  ARM: S5P: Add support for MFC device
  ARM: EXYNOS4: Add support FIMD on SMDKC210
  ARM: EXYNOS4: Add platform device and helper functions for FIMD
  ARM: EXYNOS4: Add resource definition for FIMD
  ARM: EXYNOS4: Change devname for FIMD clkdev
  ARM: SAMSUNG: Add IRQ_I2S0 definition
  ARM: SAMSUNG: Add platform device for idma
  ARM: EXYNOS4: Add more registers to be saved and restored for PM
  ARM: EXYNOS4: Add more register addresses of CMU
  ARM: EXYNOS4: Add platform device for dwmci driver
  ARM: EXYNOS4: configure rtc-s3c on NURI
  ARM: EXYNOS4: configure MAX8903 secondary charger on NURI
  ARM: EXYNOS4: configure ADC on NURI
  ARM: EXYNOS4: configure MAX17042 fuel gauge on NURI
  ARM: EXYNOS4: configure regulators and PMIC(MAX8997) on NURI
  ARM: EXYNOS4: Increase NR_IRQS for devices with more IRQs
  ...

Fix up tons of silly conflicts:
 - arch/arm/mach-davinci/include/mach/psc.h
 - arch/arm/mach-exynos4/Kconfig
 - arch/arm/mach-exynos4/mach-smdkc210.c
 - arch/arm/mach-exynos4/pm.c
 - arch/arm/mach-imx/mm-imx1.c
 - arch/arm/mach-imx/mm-imx21.c
 - arch/arm/mach-imx/mm-imx25.c
 - arch/arm/mach-imx/mm-imx27.c
 - arch/arm/mach-imx/mm-imx31.c
 - arch/arm/mach-imx/mm-imx35.c
 - arch/arm/mach-mx5/mm.c
 - arch/arm/mach-s5pv210/mach-goni.c
 - arch/arm/mm/Kconfig

8 years agoMerge branch 'next/board' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux...
Linus Torvalds [Wed, 27 Jul 2011 00:13:04 +0000]
Merge branch 'next/board' of git://git./linux/kernel/git/arm/linux-arm-soc

* 'next/board' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
  ARM: S3C64XX: Configure backup battery charger on Cragganmore
  ARM: S3C64XX: Fix WM8915 IRQ polarity on Cragganmore
  ARM: S3C64XX: Configure supplies for all Cragganmore regulators
  ARM: S3C64XX: Refresh Cragganmore support
  ARM: S3C64XX: Initial support for Wolfson/Simtec Cragganmore/Banff
  OMAP4: Keyboard: Mux changes in the board file
  omap: blaze: add mmc5/wl1283 device support
  omap: 4430SDP: Register the card detect GPIO properly
  arm: omap3: cm-t35: add support for cm-t3730
  OMAP3: beagle: add support for beagleboard xM revision C
  OMAP3: rx-51: Add full regulator definitions
  omap: rx51: Platform support for lp5523 led chip

8 years agoMerge branch 'next/cross-platform' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 27 Jul 2011 00:12:10 +0000]
Merge branch 'next/cross-platform' of git://git./linux/kernel/git/arm/linux-arm-soc

* 'next/cross-platform' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
  ARM: Consolidate the clkdev header files
  ARM: set vga memory base at run-time
  ARM: convert PCI defines to variables
  ARM: pci: make pcibios_assign_all_busses use pci_has_flag
  ARM: remove unnecessary mach/hardware.h includes
  pci: move microblaze and powerpc pci flag functions into asm-generic
  powerpc: rename ppc_pci_*_flags to pci_*_flags

Fix up conflicts in arch/microblaze/include/asm/pci-bridge.h

8 years agoMerge branch 'next/fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux...
Linus Torvalds [Wed, 27 Jul 2011 00:10:20 +0000]
Merge branch 'next/fixes2' of git://git./linux/kernel/git/arm/linux-arm-soc

* 'next/fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (24 commits)
  ASoC: omap: McBSP: fix build breakage on OMAP1
  OMAP: hwmod: fix the i2c-reset timeout during bootup
  I2C: OMAP2+: add correct functionality flags to all omap2plus i2c dev_attr
  I2C: OMAP2+: Tag all OMAP2+ hwmod defintions with I2C IP revision
  I2C: OMAP1/OMAP2+: create omap I2C functionality flags for each cpu_... test
  I2C: OMAP2+:  Introduce I2C IP versioning constants
  I2C: OMAP2+: increase omap_i2c_dev_attr flags from u8 to u32
  I2C: OMAP2+: Set hwmod flags to only allow 16-bit accesses to i2c
  OMAP4: hwmod data: Change DSS main_clk scheme
  OMAP4: powerdomain data: Remove unsupported MPU powerdomain state
  OMAP4: clock data: Keep GPMC clocks always enabled and hardware managed
  OMAP4: powerdomain data: Fix core mem states and missing cefuse flag
  OMAP2+: PM: Initialise sleep_switch to a non-valid value
  OMAP4: hwmod data: Modify DSS opt clocks
  OMAP4: iommu: fix clock name
  omap: iovmm: s/sg_dma_len(sg)/sg->length/
  omap: iommu: fix pte programming
  arm: omap3: cm-t35: fix slow path warning
  arm: omap3: cm-t35: minor comments fixes
  omap: ZOOM: QUART: Request reset GPIO
  ...

8 years agoMerge branch 'next/soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux...
Linus Torvalds [Wed, 27 Jul 2011 00:09:31 +0000]
Merge branch 'next/soc' of git://git./linux/kernel/git/arm/linux-arm-soc

* 'next/soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
  MAINTAINERS: add maintainer of CSR SiRFprimaII machine
  ARM: CSR: initializing L2 cache
  ARM: CSR: mapping early DEBUG_LL uart
  ARM: CSR: Adding CSR SiRFprimaII board support
  OMAP4: clocks: Update the clock tree with 4460 clock nodes
  OMAP4: PRCM: OMAP4460 specific PRM and CM register bitshifts
  OMAP4: ID: add omap_has_feature for max freq supported
  OMAP: ID: introduce chip detection for OMAP4460
  ARM: Xilinx: merge board file into main platform code
  ARM: Xilinx: Adding Xilinx board support

Fix up conflicts in arch/arm/mach-omap2/cm-regbits-44xx.h

8 years agoMerge branch 'next-i2c' of git://git.fluff.org/bjdooks/linux
Linus Torvalds [Tue, 26 Jul 2011 23:55:45 +0000]
Merge branch 'next-i2c' of git://git.fluff.org/bjdooks/linux

* 'next-i2c' of git://git.fluff.org/bjdooks/linux:
  i2c-eg20t : Fix the issue of Combined R/W transfer mode
  i2c-eg20t : Support Combined R/W transfer mode
  i2c: Tegra: Add DeviceTree support

8 years agoasm-generic/atomic.h: allow SMP peeps to leverage this
Mike Frysinger [Tue, 26 Jul 2011 23:09:11 +0000]
asm-generic/atomic.h: allow SMP peeps to leverage this

Only a few core funcs need to be implemented for SMP systems, so allow the
arches to override them while getting the rest for free.

At least, this is enough to allow the Blackfin SMP port to use things.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoasm-generic/atomic.h: add atomic_set_mask() helper
Mike Frysinger [Tue, 26 Jul 2011 23:09:10 +0000]
asm-generic/atomic.h: add atomic_set_mask() helper

Since arches are expected to implement this guy, add a common version for
people the same way as atomic_clear_mask is handled.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoasm-generic/atomic.h: fix type used in atomic_clear_mask
Mike Frysinger [Tue, 26 Jul 2011 23:09:10 +0000]
asm-generic/atomic.h: fix type used in atomic_clear_mask

The atomic helpers are supposed to take an atomic_t pointer, not a random
unsigned long pointer.  So convert atomic_clear_mask over.

While we're here, also add some nice documentation to the func.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoasm-generic/atomic.h: simplify inc/dec test helpers
Mike Frysinger [Tue, 26 Jul 2011 23:09:09 +0000]
asm-generic/atomic.h: simplify inc/dec test helpers

We already declared inc/dec helpers, so we don't need to call the
atomic_{add,sub}_return funcs directly.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoatomic: Update comments in atomic.h
Arun Sharma [Tue, 26 Jul 2011 23:09:08 +0000]
atomic: Update comments in atomic.h

This clarifies the differences between <linux/atomic.h> and
<asm-generic/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Suggested-by: Mike Frysinger <vapier.adi@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoatomic: cleanup asm-generic atomic*.h inclusion
Arun Sharma [Tue, 26 Jul 2011 23:09:08 +0000]
atomic: cleanup asm-generic atomic*.h inclusion

After changing all consumers of atomics to include <linux/atomic.h>, we
ran into some compile time errors due to this dependency chain:

linux/atomic.h
  -> asm/atomic.h
    -> asm-generic/atomic-long.h

where atomic-long.h could use funcs defined later in linux/atomic.h
without a prototype.  This patches moves the code that includes
asm-generic/atomic*.h to linux/atomic.h.

Archs that need <asm-generic/atomic64.h> need to select
CONFIG_GENERIC_ATOMIC64 from now on (some of them used to include it
unconditionally).

Compile tested on i386 and x86_64 with allnoconfig.

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoatomic: move atomic_add_unless to generic code
Arun Sharma [Tue, 26 Jul 2011 23:09:07 +0000]
atomic: move atomic_add_unless to generic code

This is in preparation for more generic atomic primitives based on
__atomic_add_unless.

Signed-off-by: Arun Sharma <asharma@fb.com>
Signed-off-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoatomic: use <linux/atomic.h>
Arun Sharma [Tue, 26 Jul 2011 23:09:06 +0000]
atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoasm-generic: add another generic ext2 atomic bitops
Akinobu Mita [Tue, 26 Jul 2011 23:09:04 +0000]
asm-generic: add another generic ext2 atomic bitops

The majority of architectures implement ext2 atomic bitops as
test_and_{set,clear}_bit() without spinlock.

This adds this type of generic implementation in ext2-atomic-setbit.h and
use it wherever possible.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofail_make_request: cleanup should_fail_request
Akinobu Mita [Tue, 26 Jul 2011 23:09:03 +0000]
fail_make_request: cleanup should_fail_request

This changes should_fail_request() to more usable wrapper function of
should_fail().  It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofail_page_alloc: simplify debugfs initialization
Akinobu Mita [Tue, 26 Jul 2011 23:09:03 +0000]
fail_page_alloc: simplify debugfs initialization

Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need
to hold dentry structs for each debugfs file.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofailslab: simplify debugfs initialization
Akinobu Mita [Tue, 26 Jul 2011 23:09:02 +0000]
failslab: simplify debugfs initialization

Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need
to hold dentry structs for each debugfs file.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofault-injection: use debugfs_remove_recursive
Akinobu Mita [Tue, 26 Jul 2011 23:09:02 +0000]
fault-injection: use debugfs_remove_recursive

Use debugfs_remove_recursive() to simplify initialization and
deinitialization of fault injection debugfs files.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofault-injection: cleanup simple attribute of stacktrace_depth
Akinobu Mita [Tue, 26 Jul 2011 23:09:01 +0000]
fault-injection: cleanup simple attribute of stacktrace_depth

Minor cosmetic changes for simple attribute of stacktrace_depth:

 - use min_t()
 - reduce #ifdef by moving a function
 - do not use partly capitalized function name

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofault-injection: remove nonexistent function extern
Akinobu Mita [Tue, 26 Jul 2011 23:09:00 +0000]
fault-injection: remove nonexistent function extern

should_fail_srandom() does not exist.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofault-injection: do not include unneeded header
Akinobu Mita [Tue, 26 Jul 2011 23:09:00 +0000]
fault-injection: do not include unneeded header

No need to include linux/kallsyms.h.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoramoops: make record_size a module parameter
Sergiu Iordache [Tue, 26 Jul 2011 23:08:59 +0000]
ramoops: make record_size a module parameter

The size of the dump is currently set using the RECORD_SIZE macro which
is set to a page size.  This patch makes the record size a module
parameter and allows it to be set through platform data as well to allow
larger dumps if needed.

Signed-off-by: Sergiu Iordache <sergiu@chromium.org>
Acked-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoramoops: move dump_oops into platform data
Sergiu Iordache [Tue, 26 Jul 2011 23:08:58 +0000]
ramoops: move dump_oops into platform data

The platform driver currently allows setting the mem_size and
mem_address.

ince dump_oops is also a module parameter it would be more consistent if
it could be set through platform data as well.

Signed-off-by: Sergiu Iordache <sergiu@chromium.org>
Acked-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoramoops: add new line to each print
Marco Stornelli [Tue, 26 Jul 2011 23:08:57 +0000]
ramoops: add new line to each print

Add new line to each print.

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoramoops: use module parameters instead of platform data if not available
Marco Stornelli [Tue, 26 Jul 2011 23:08:57 +0000]
ramoops: use module parameters instead of platform data if not available

Use generic module parameters instead of platform data, if platform data
are not available.  This limitation has been introduced with commit
c3b92ce9e75 ("ramoops: use the platform data structure instead of module
params").

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoVmware balloon: switch to using sysem-wide freezable workqueue
Dmitry Torokhov [Tue, 26 Jul 2011 23:08:56 +0000]
Vmware balloon: switch to using sysem-wide freezable workqueue

With the arrival of concurrency-managed workqueues there is no need for
our driver to use dedicated workqueue; system-wide one should suffice just
fine.

[akpm@linux-foundation.org: fix comment layout & grammar]
Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodrivers/w1/slaves/w1_therm.c: add support for DS28EA00
Christian Glindkamp [Tue, 26 Jul 2011 23:08:55 +0000]
drivers/w1/slaves/w1_therm.c: add support for DS28EA00

Signed-off-by: Christian Glindkamp <christian.glindkamp@taskit.de>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agopanic, vt: do not force oops output when panic_timeout < 0
Mandeep Singh Baines [Tue, 26 Jul 2011 23:08:54 +0000]
panic, vt: do not force oops output when panic_timeout < 0

Don't force output if you intend to reboot immediately.

In this patch, I'm disabling the functionality enabled by
vc->vc_panic_force_write if panic_timeout < 0 (i.e.  no timeout).
vc_panic_force_write is only enabled for fb video consoles if the
FBINFO_CAN_FORCE_OUTPUT flag is set.

For our application, we're using ram_oops to preserved the panic in
memory.  We want to reliably, and as fast as possible, machine_restart.
The vc_panic_force_write flag results in a bunch of graphics driver code
to be invoked which slows down restart and decreases reliability.  Since
we're already storing the panic in RAM and are going to reboot
immediately, there is no benefit in mode switching back to the vc in
order to display the panic output.  The log buffer will get flushed by
the console_unblank() call so remote management consoles should see all
output.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agopanic: panic=-1 for immediate reboot
Hugh Dickins [Tue, 26 Jul 2011 23:08:52 +0000]
panic: panic=-1 for immediate reboot

When a kernel BUG or oops occurs, ChromeOS intends to panic and
immediately reboot, with stacktrace and other messages preserved in RAM
across reboot.

But the longer we delay, the more likely the user is to poweroff and
lose the info.

panic_timeout (seconds before rebooting) is set by panic= boot option or
sysctl or /proc/sys/kernel/panic; but 0 means wait forever, so at
present we have to delay at least 1 second.

Let a negative number mean reboot immediately (with the small cosmetic
benefit of suppressing that newline-less "Rebooting in %d seconds.."
message).

Signed-off-by: Hugh Dickins <hughd@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoDocumentation/DMA-API-HOWTO.txt: fix misleading example
Michal Miroslaw [Tue, 26 Jul 2011 23:08:51 +0000]
Documentation/DMA-API-HOWTO.txt: fix misleading example

See: DMA-API.txt, part Id, DMA_FROM_DEVICE description.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoinclude/linux/dma-mapping.h: remove DMA_xxBIT_MASK macros
WANG Cong [Tue, 26 Jul 2011 23:08:50 +0000]
include/linux/dma-mapping.h: remove DMA_xxBIT_MASK macros

git grep shows there are no users in tree, so we can remove them safely.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agogcov: disable CONSTRUCTORS for UML
Vitaliy Ivanov [Tue, 26 Jul 2011 23:08:49 +0000]
gcov: disable CONSTRUCTORS for UML

Selecting GCOV for UML causing configuration mismatch:

  warning: (GCOV_KERNEL) selects CONSTRUCTORS which has unmet direct dependencies (!UML)

Constructors are not needed for UML.

Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: Richard Weinberger <richard@nod.at>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodrivers/edac/mpc85xx_edac.c: correct offset_in_page mask bits in edac_mc_handle_ce()
Kai.Jiang [Tue, 26 Jul 2011 23:08:49 +0000]
drivers/edac/mpc85xx_edac.c: correct offset_in_page mask bits in edac_mc_handle_ce()

Parameter offset_in_page in edac_mc_handle_ce() should mask the higher
bits above the page size, not the lower bits.  The original input
sometimes causes a crash.

Signed-off-by: Kai.Jiang <Kai.Jiang@freescale.com>
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Cc: Anton Vorontsov <avorontsov@mvista.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoipc: introduce shm_rmid_forced sysctl
Vasiliy Kulikov [Tue, 26 Jul 2011 23:08:48 +0000]
ipc: introduce shm_rmid_forced sysctl

Add support for the shm_rmid_forced sysctl.  If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoipc/mqueue.c: fix mq_open() return value
Jiri Slaby [Tue, 26 Jul 2011 23:08:47 +0000]
ipc/mqueue.c: fix mq_open() return value

We return ENOMEM from mqueue_get_inode even when we have enough memory.
Namely in case the system rlimit of mqueue was reached.  This error
propagates to mq_queue and user sees the error unexpectedly.  So fix
this up to properly return EMFILE as described in the manpage:

EMFILE The process already has the maximum number of files and
       message queues open.

instead of:

ENOMEM Insufficient memory.

With the previous patch we just switch to ERR_PTR/PTR_ERR/IS_ERR error
handling here.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoipc/mqueue.c: refactor failure handling
Jiri Slaby [Tue, 26 Jul 2011 23:08:46 +0000]
ipc/mqueue.c: refactor failure handling

If new_inode fails to allocate an inode we need only to return with
NULL.  But now we test the opposite and have all the work in a nested
block.  So do the opposite to save one indentation level (and remove
unnecessary line breaks).

This is only a preparation/cleanup for the next patch where we fix up
return values from mqueue_get_inode.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocpumask: add cpumask_var_t documentation
KOSAKI Motohiro [Tue, 26 Jul 2011 23:08:45 +0000]
cpumask: add cpumask_var_t documentation

cpumask_var_t has one notable difference from cpumask_t.  Add the
explanation.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Thiago Farina <tfransosi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocpumask: alloc_cpumask_var() use NUMA_NO_NODE
KOSAKI Motohiro [Tue, 26 Jul 2011 23:08:44 +0000]
cpumask: alloc_cpumask_var() use NUMA_NO_NODE

NUMA_NO_NODE and numa_node_id() have different meanings.  NUMA_NO_NODE is
obviously the recommended fallback.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocpumask: convert for_each_cpumask() with for_each_cpu()
KOSAKI Motohiro [Tue, 26 Jul 2011 23:08:44 +0000]
cpumask: convert for_each_cpumask() with for_each_cpu()

Adapt new API fashion.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofs/exec.c:acct_arg_size(): ptl is no longer needed for add_mm_counter()
Oleg Nesterov [Tue, 26 Jul 2011 23:08:43 +0000]
fs/exec.c:acct_arg_size(): ptl is no longer needed for add_mm_counter()

acct_arg_size() takes ->page_table_lock around add_mm_counter() if
!SPLIT_RSS_COUNTING.  This is not needed after commit 172703b08cd0 ("mm:
delete non-atomic mm counter implementation").

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoexec: do not retry load_binary method if CONFIG_MODULES=n
Tetsuo Handa [Tue, 26 Jul 2011 23:08:42 +0000]
exec: do not retry load_binary method if CONFIG_MODULES=n

If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
handler because the list will not be modified by request_module().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoexec: do not call request_module() twice from search_binary_handler()
Tetsuo Handa [Tue, 26 Jul 2011 23:08:41 +0000]
exec: do not call request_module() twice from search_binary_handler()

Currently, search_binary_handler() tries to load binary loader module
using request_module() if a loader for the requested program is not yet
loaded.  But second attempt of request_module() does not affect the result
of search_binary_handler().

If request_module() triggered recursion, calling request_module() twice
causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions.  It is
not an infinite loop but is sufficient for users to consider as a hang up.

Therefore, this patch changes not to call request_module() twice, making 1
to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Richard Weinberger <richard@nod.at>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP
Michal Hocko [Tue, 26 Jul 2011 23:08:40 +0000]
fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP

Commit a8bef8ff6ea1 ("mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
and VM_STACK_INCOMPLETE_SETUP do not overlap.  The check is a compile
time one, so BUILD_BUG_ON is more appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agokernel/fork.c: fix a few coding style issues
Daniel Rebelo de Oliveira [Tue, 26 Jul 2011 23:08:39 +0000]
kernel/fork.c: fix a few coding style issues

Signed-off-by: Daniel Rebelo de Oliveira <psykon@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoproc: fix a race in do_io_accounting()
Vasiliy Kulikov [Tue, 26 Jul 2011 23:08:38 +0000]
proc: fix a race in do_io_accounting()

If an inode's mode permits opening /proc/PID/io and the resulting file
descriptor is kept across execve() of a setuid or similar binary, the
ptrace_may_access() check tries to prevent using this fd against the
task with escalated privileges.

Unfortunately, there is a race in the check against execve().  If
execve() is processed after the ptrace check, but before the actual io
information gathering, io statistics will be gathered from the
privileged process.  At least in theory this might lead to gathering
sensible information (like ssh/ftp password length) that wouldn't be
available otherwise.

Holding task->signal->cred_guard_mutex while gathering the io
information should protect against the race.

The order of locking is similar to the one inside of ptrace_attach():
first goes cred_guard_mutex, then lock_task_sighand().

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoprocfs: return ENOENT on opening a being-removed proc entry
Daisuke Ogino [Tue, 26 Jul 2011 23:08:37 +0000]
procfs: return ENOENT on opening a being-removed proc entry

Change the return value to ENOENT.  This return value is then returned
when opening the proc entry that have been removed.  For example,
open("/proc/bus/pci/XX/YY") when the corresponding device is being
hot-removed.

Signed-off-by: Daisuke Ogino <ogino.daisuke@jp.fujitsu.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoh8300/m68k/xtensa: __FD_ISSET should return 0/1
Andrew Morton [Tue, 26 Jul 2011 23:08:35 +0000]
h8300/m68k/xtensa: __FD_ISSET should return 0/1

Harmonise these return values with other architectures.  In some cases
this affects all compilers and in other cases non-gcc compilers only.

Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ulrich Drepper <drepper@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodo_coredump: fix the "ispipe" error check
Oleg Nesterov [Tue, 26 Jul 2011 23:08:34 +0000]
do_coredump: fix the "ispipe" error check

do_coredump() assumes that if format_corename() fails it should return
-ENOMEM.  This is not true, for example cn_print_exe_file() can propagate
the error from d_path.  Even if it was true, this is too fragile.  Change
the code to check "ispipe < 0".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocoredump: escape / in hostname and comm
Jiri Slaby [Tue, 26 Jul 2011 23:08:33 +0000]
coredump: escape / in hostname and comm

Change every occurence of / in comm and hostname to !.  If the process
changes its name to contain /, the core is not dumped (if the directory
tree doesn't exist like that).  The same with hostname being something
like myhost/3.  Fix this behaviour by using the escape loop used in %E.
(We extract it to a separate function.)

Now both with comm == myprocess/1 and hostname == myhost/1, the core is
dumped like (kernel.core_pattern='core.%p.%e.%h):
core.2349.myprocess!1.myhost!1

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocoredump: use task comm instead of (unknown)
Jiri Slaby [Tue, 26 Jul 2011 23:08:32 +0000]
coredump: use task comm instead of (unknown)

If we don't know the file corresponding to the binary (i.e.  exe_file is
unknown), use "task->comm (path unknown)" instead of simple "(unknown)"
as suggested by ak.

The fallback is the same as %e except it will append "(path unknown)".

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoptrace: unify show_regs() prototype
Mike Frysinger [Tue, 26 Jul 2011 23:08:31 +0000]
ptrace: unify show_regs() prototype

[ poleg@redhat.com: no need to declare show_regs() in ptrace.h, sched.h does this ]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agocpusets: randomize node rotor used in cpuset_mem_spread_node()
Michal Hocko [Tue, 26 Jul 2011 23:08:30 +0000]
cpusets: randomize node rotor used in cpuset_mem_spread_node()

[ This patch has already been accepted as commit 0ac0c0d0f837 but later
  reverted (commit 35926ff5fba8) because it itroduced arch specific
  __node_random which was defined only for x86 code so it broke other
  archs.  This is a followup without any arch specific code.  Other than
  that there are no functional changes.]

Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems).  Part of the reason is
that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
at node 0 for newly created tasks.

This patch changes the rotor to be initialized to a random node number
of the cpuset.

[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
[mhocko@suse.cz: Make it arch independent]
[akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Menage <menage@google.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: get rid of percpu_charge_mutex lock
Michal Hocko [Tue, 26 Jul 2011 23:08:29 +0000]
memcg: get rid of percpu_charge_mutex lock

percpu_charge_mutex protects from multiple simultaneous per-cpu charge
caches draining because we might end up having too many work items.  At
least this was the case until commit 26fe61684449 ("memcg: fix percpu
cached charge draining frequency") when we introduced a more targeted
draining for async mode.

Now that also sync draining is targeted we can safely remove mutex
because we will not send more work than the current number of CPUs.
FLUSHING_CACHED_CHARGE protects from sending the same work multiple
times and stock->nr_pages == 0 protects from pointless sending a work if
there is obviously nothing to be done.  This is of course racy but we
can live with it as the race window is really small (we would have to
see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
non-zero).

The only remaining place where we can race is synchronous mode when we
rely on FLUSHING_CACHED_CHARGE test which might have been set by other
drainer on the same group but we should wait in that case as well.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: add mem_cgroup_same_or_subtree() helper
Michal Hocko [Tue, 26 Jul 2011 23:08:29 +0000]
memcg: add mem_cgroup_same_or_subtree() helper

We are checking whether a given two groups are same or at least in the
same subtree of a hierarchy at several places.  Let's make a helper for
it to make code easier to read.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: unify sync and async per-cpu charge cache draining
Michal Hocko [Tue, 26 Jul 2011 23:08:28 +0000]
memcg: unify sync and async per-cpu charge cache draining

Currently we have two ways how to drain per-CPU caches for charges.
drain_all_stock_sync will synchronously drain all caches while
drain_all_stock_async will asynchronously drain only those that refer to
a given memory cgroup or its subtree in hierarchy.  Targeted async
draining has been introduced by 26fe6168 (memcg: fix percpu cached
charge draining frequency) to reduce the cpu workers number.

sync draining is currently triggered only from mem_cgroup_force_empty
which is triggered only by userspace (mem_cgroup_force_empty_write) or
when a cgroup is removed (mem_cgroup_pre_destroy).  Although these are
not usually frequent operations it still makes some sense to do targeted
draining as well, especially if the box has many CPUs.

This patch unifies both methods to use the single code (drain_all_stock)
which relies on the original async implementation and just adds
flush_work to wait on all caches that are still under work for the sync
mode.  We are using FLUSHING_CACHED_CHARGE bit check to prevent from
waiting on a work that we haven't triggered.  Please note that both sync
and async functions are currently protected by percpu_charge_mutex so we
cannot race with other drainers.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: do not try to drain per-cpu caches without pages
Michal Hocko [Tue, 26 Jul 2011 23:08:27 +0000]
memcg: do not try to drain per-cpu caches without pages

drain_all_stock_async tries to optimize a work to be done on the work
queue by excluding any work for the current CPU because it assumes that
the context we are called from already tried to charge from that cache
and it's failed so it must be empty already.

While the assumption is correct we can optimize it even more by checking
the current number of pages in the cache.  This will also reduce a work
on other CPUs with an empty stock.

For the current CPU we can simply call drain_local_stock rather than
deferring it to the work queue.

[kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: add memory.vmscan_stat
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 23:08:26 +0000]
memcg: add memory.vmscan_stat

The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
in...") says it adds scanning stats to memory.stat file.  But it doesn't
because we considered we needed to make a concensus for such new APIs.

This patch is a trial to add memory.scan_stat. This shows
  - the number of scanned pages(total, anon, file)
  - the number of rotated pages(total, anon, file)
  - the number of freed pages(total, anon, file)
  - the number of elaplsed time (including sleep/pause time)

  for both of direct/soft reclaim.

The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as

  # echo 0 ...../memory.scan_stat

Example of output is here. This is a result after make -j 6 kernel
under 300M limit.

  [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
  [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
  scanned_pages_by_limit 9471864
  scanned_anon_pages_by_limit 6640629
  scanned_file_pages_by_limit 2831235
  rotated_pages_by_limit 4243974
  rotated_anon_pages_by_limit 3971968
  rotated_file_pages_by_limit 272006
  freed_pages_by_limit 2318492
  freed_anon_pages_by_limit 962052
  freed_file_pages_by_limit 1356440
  elapsed_ns_by_limit 351386416101
  scanned_pages_by_system 0
  scanned_anon_pages_by_system 0
  scanned_file_pages_by_system 0
  rotated_pages_by_system 0
  rotated_anon_pages_by_system 0
  rotated_file_pages_by_system 0
  freed_pages_by_system 0
  freed_anon_pages_by_system 0
  freed_file_pages_by_system 0
  elapsed_ns_by_system 0
  scanned_pages_by_limit_under_hierarchy 9471864
  scanned_anon_pages_by_limit_under_hierarchy 6640629
  scanned_file_pages_by_limit_under_hierarchy 2831235
  rotated_pages_by_limit_under_hierarchy 4243974
  rotated_anon_pages_by_limit_under_hierarchy 3971968
  rotated_file_pages_by_limit_under_hierarchy 272006
  freed_pages_by_limit_under_hierarchy 2318492
  freed_anon_pages_by_limit_under_hierarchy 962052
  freed_file_pages_by_limit_under_hierarchy 1356440
  elapsed_ns_by_limit_under_hierarchy 351386416101
  scanned_pages_by_system_under_hierarchy 0
  scanned_anon_pages_by_system_under_hierarchy 0
  scanned_file_pages_by_system_under_hierarchy 0
  rotated_pages_by_system_under_hierarchy 0
  rotated_anon_pages_by_system_under_hierarchy 0
  rotated_file_pages_by_system_under_hierarchy 0
  freed_pages_by_system_under_hierarchy 0
  freed_anon_pages_by_system_under_hierarchy 0
  freed_file_pages_by_system_under_hierarchy 0
  elapsed_ns_by_system_under_hierarchy 0

total_xxxx is for hierarchy management.

This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.

This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information.  For
example, nr_scanned is reset frequentrly and incremented +2 at scanning
mapped pages.

To avoid complexity, I added a new param in scan_control which is for
exporting scanning score.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Andrew Bresticker <abrestic@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: fix behavior of mem_cgroup_resize_limit()
Daisuke Nishimura [Tue, 26 Jul 2011 23:08:25 +0000]
memcg: fix behavior of mem_cgroup_resize_limit()

Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
when mem_limit == memsw_limit.  The flag is checked at the beginning of
reclaim, and "noswap" is set if the flag is true, because using swap is
meaningless in this case.

This works well in most cases, but when we try to shrink mem_limit,
which is the same as memsw_limit now, we might fail to shrink mem_limit
because swap doesn't used.

This patch fixes this behavior by:
 - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
 - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: fix vmscan count in small memcgs
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 23:08:24 +0000]
memcg: fix vmscan count in small memcgs

Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
fixes the memcg/kswapd behavior against small targets and prevent vmscan
priority too high.

But the implementation is too naive and adds another problem to small
memcg.  It always force scan to 32 pages of file/anon and doesn't handle
swappiness and other rotate_info.  It makes vmscan to scan anon LRU
regardless of swappiness and make reclaim bad.  This patch fixes it by
adjusting scanning count with regard to swappiness at el.

At a test "cat 1G file under 300M limit." (swappiness=20)
 before patch
        scanned_pages_by_limit 360919
        scanned_anon_pages_by_limit 180469
        scanned_file_pages_by_limit 180450
        rotated_pages_by_limit 31
        rotated_anon_pages_by_limit 25
        rotated_file_pages_by_limit 6
        freed_pages_by_limit 180458
        freed_anon_pages_by_limit 19
        freed_file_pages_by_limit 180439
        elapsed_ns_by_limit 429758872
 after patch
        scanned_pages_by_limit 180674
        scanned_anon_pages_by_limit 24
        scanned_file_pages_by_limit 180650
        rotated_pages_by_limit 35
        rotated_anon_pages_by_limit 24
        rotated_file_pages_by_limit 11
        freed_pages_by_limit 180634
        freed_anon_pages_by_limit 0
        freed_file_pages_by_limit 180634
        elapsed_ns_by_limit 367119089
        scanned_pages_by_system 0

the numbers of scanning anon are decreased(as expected), and elapsed time
reduced. By this patch, small memcgs will work better.
(*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: change memcg_oom_mutex to spinlock
Michal Hocko [Tue, 26 Jul 2011 23:08:24 +0000]
memcg: change memcg_oom_mutex to spinlock

memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control.  None of the critical sections which it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp.  oom_lock atomic operations).

Mutex is also too heavyweight for those code paths because it triggers a
lot of scheduling.  It also makes makes convoying effects more visible
when we have a big number of oom killing because we take the lock
mutliple times during mem_cgroup_handle_oom so we have multiple places
where many processes can sleep.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: make oom_lock 0 and 1 based rather than counter
Michal Hocko [Tue, 26 Jul 2011 23:08:23 +0000]
memcg: make oom_lock 0 and 1 based rather than counter

Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
counter which is incremented by mem_cgroup_oom_lock when we are about to
handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
if oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we have
many processes triggering OOM and many CPUs available for them (I have
tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
processes that are blocked on the mutex.  While A releases the mutex and
calls mem_cgroup_out_of_memory others will wake up (one after another)
and increase the counter and fall into sleep (memcg_oom_waitq).

Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
decreases oom_lock and wakes other tasks (if releasing memory by
somebody else - e.g.  killed process - hasn't done it yet).

A testcase would look like:
  Assume malloc XXX is a program allocating XXX Megabytes of memory
  which touches all allocated pages in a tight loop
  # swapoff SWAP_DEVICE
  # cgcreate -g memory:A
  # cgset -r memory.oom_control=0   A
  # cgset -r memory.limit_in_bytes= 200M
  # for i in `seq 100`
  # do
  #     cgexec -g memory:A   malloc 10 &
  # done

The main problem here is that all processes still race for the mutex and
there is no guarantee that we will get counter back to 0 for those that
got back to mem_cgroup_handle_oom.  In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful can be done.  The time is basically unbounded
because it highly depends on scheduling and ordering on mutex (I have
seen this taking hours...).

This patch replaces the counter by a simple {un}lock semantic.  As
mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.

We have to be careful while locking subtrees because we can encounter a
subtree which is already locked: hierarchy:

          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked.  While we want to enable locking
E subtree because OOM situations cannot influence each other we
definitely do not want to allow locking A.

Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world will
recognize that a group is under OOM even though it doesn't have a lock.
Therefore we have to introduce under_oom variable which is incremented
and decremented for the whole subtree when we enter resp.  leave
mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
updated under memcg_oom_mutex because its users only check a single
group and they use atomic operations for that.

This can be checked easily by the following test case:

  # cgcreate -g memory:A
  # cgset -r memory.use_hierarchy=1 A
  # cgset -r memory.oom_control=1   A
  # cgset -r memory.limit_in_bytes= 100M
  # cgset -r memory.memsw.limit_in_bytes= 100M
  # cgcreate -g memory:A/B
  # cgset -r memory.oom_control=1 A/B
  # cgset -r memory.limit_in_bytes=20M
  # cgset -r memory.memsw.limit_in_bytes=20M
  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it.  Both of them go into sleep and
wait for an external action.  We can make the limit higher for A to
enforce waking it up

  # cgset -r memory.memsw.limit_in_bytes=300M A
  # cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: consolidate memory cgroup lru stat functions
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 23:08:22 +0000]
memcg: consolidate memory cgroup lru stat functions

In mm/memcontrol.c, there are many lru stat functions as..

  mem_cgroup_zone_nr_lru_pages
  mem_cgroup_node_nr_file_lru_pages
  mem_cgroup_nr_file_lru_pages
  mem_cgroup_node_nr_anon_lru_pages
  mem_cgroup_nr_anon_lru_pages
  mem_cgroup_node_nr_unevictable_lru_pages
  mem_cgroup_nr_unevictable_lru_pages
  mem_cgroup_node_nr_lru_pages
  mem_cgroup_nr_lru_pages
  mem_cgroup_get_local_zonestat

Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into

  mem_cgroup_zone_nr_lru_pages()
  mem_cgroup_node_nr_lru_pages()
  mem_cgroup_nr_lru_pages()

For these functions, "which LRU?" information is passed by a mask.

example:
  mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

example:
  mem_cgroup_nr_lru_pages(mem, ALL_LRU)

BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.

Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

This means we'll touch cache lines in different node in turn.

After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

Then, we'll gather information in the same cacheline at once.

[akpm@linux-foundation.org: fix warnigns, build error]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomemcg: export memory cgroup's swappiness with mem_cgroup_swappiness()
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 23:08:21 +0000]
memcg: export memory cgroup's swappiness with mem_cgroup_swappiness()

Each memory cgroup has a 'swappiness' value which can be accessed by
get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
and swappiness is passed by argument.  It's propagated by scan_control.

get_swappiness() is a static function but some planned updates will need
to get swappiness from files other than memcontrol.c This patch exports
get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
argument of swapiness from try_to_free...  and drop swappiness from
scan_control.  only memcg uses it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agortc: fix hrtimer deadlock
Thomas Gleixner [Tue, 26 Jul 2011 23:08:20 +0000]
rtc: fix hrtimer deadlock

Ben reported a lockup related to rtc. The lockup happens due to:

CPU0                                        CPU1

rtc_irq_set_state()     __run_hrtimer()
  spin_lock_irqsave(&rtc->irq_task_lock)    rtc_handle_legacy_irq();
      spin_lock(&rtc->irq_task_lock);
  hrtimer_cancel()
    while (callback_running);

So the running callback never finishes as it's blocked on
rtc->irq_task_lock.

Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
waiting for the callback.  Fix this for both rtc_irq_set_state() and
rtc_irq_set_freq().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ben Greear <greearb@candelatech.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agortc: limit frequency
Thomas Gleixner [Tue, 26 Jul 2011 23:08:19 +0000]
rtc: limit frequency

Due to the hrtimer self rearming mode a user can DoS the machine simply
because it's starved by hrtimer events.

The RTC hrtimer is self rearming.  We really need to limit the frequency
to something sensible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ben Greear <greearb@candelatech.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agortc: handle errors correctly in rtc_irq_set_state()
Thomas Gleixner [Tue, 26 Jul 2011 23:08:18 +0000]
rtc: handle errors correctly in rtc_irq_set_state()

The code checks the correctness of the parameters, but unconditionally
arms/disarms the hrtimer.

The result is that a random task might arm/disarm rtc timer and surprise
the real owner by either generating events or by stopping them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ben Greear <greearb@candelatech.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agomn10300, exec: remove redundant set_fs(USER_DS)
Mathias Krause [Tue, 26 Jul 2011 23:08:17 +0000]
mn10300, exec: remove redundant set_fs(USER_DS)

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agodrivers/base/power/opp.c: fix dev_opp initial value
Jonghwan Choi [Tue, 26 Jul 2011 23:08:16 +0000]
drivers/base/power/opp.c: fix dev_opp initial value

Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
error.

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agofrv, exec: remove redundant set_fs(USER_DS)
Mathias Krause [Tue, 26 Jul 2011 23:08:15 +0000]
frv, exec: remove redundant set_fs(USER_DS)

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Also removed the dead code in flush_thread().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoi2c-eg20t : Fix the issue of Combined R/W transfer mode
Tomoya MORINAGA [Thu, 23 Jun 2011 07:17:10 +0000]
i2c-eg20t : Fix the issue of Combined R/W transfer mode

issue-1
In case combined transfer mode fails halfway, the processing must be stopped halfway.
However currently, the processing is continued.
This patch breaks the processing.

issue-2
Currently, pch_i2c_xfer returns read/write size at that time.
However pch_i2c_xfer must return the number of messages to be read/written.
This patch modifies correctly.

Signed-off-by: Tomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
Signed-off-by: Ben Dooks <ben-linux@fluff.org>

8 years agoi2c-eg20t : Support Combined R/W transfer mode
Tomoya MORINAGA [Thu, 9 Jun 2011 02:29:29 +0000]
i2c-eg20t : Support Combined R/W transfer mode

Currently, Combined R/W transfer mode is not supported.
This patch enables Combined R/W transfer mode.

Signed-off-by: Tomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
Signed-off-by: Ben Dooks <ben-linux@fluff.org>

8 years agoi2c: Tegra: Add DeviceTree support
John Bonesio [Wed, 22 Jun 2011 16:16:56 +0000]
i2c: Tegra: Add DeviceTree support

This patch modifies the tegra i2c driver so that it can be initiailized
using the device tree along with the devices connected to the i2c bus.

Signed-off-by: John Bonesio <bones@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: OIof Johansson <olof@lixom.net>
Signed-off-by: Ben Dooks <ben-linux@fluff.org>

8 years agoMerge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus
Linus Torvalds [Tue, 26 Jul 2011 21:17:28 +0000]
Merge branch 'upstream' of git://git.linux-mips.org/upstream-linus

* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits)
  MIPS: Close races in TLB modify handlers.
  MIPS: Add uasm UASM_i_SRL_SAFE macro.
  MIPS: RB532: Use hex_to_bin()
  MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms
  MIPS: PowerTV: Provide cpu-feature-overrides.h
  MIPS: Remove pointless return statement from empty void functions.
  MIPS: Limit fixrange_init() to the FIXMAP region
  MIPS: Install handlers for software IRQs
  MIPS: Move FIXADDR_TOP into spaces.h
  MIPS: Add SYNC after cacheflush
  MIPS: pfn_valid() is broken on low memory HIGHMEM systems
  MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
  MIPS: topdown mmap support
  MIPS: Remove redundant addr_limit assignment on exec.
  MIPS: AR7: Replace __attribute__((__packed__)) with __packed
  MIPS: AR7: Remove 'space before tabs' in platform.c
  MIPS: Lantiq: Add missing clk_enable and clk_disable functions.
  MIPS: AR7: Fix trailing semicolon bug in clock.c
  MAINTAINERS: Update MIPS entry.
  MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition
  ...

8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph...
Linus Torvalds [Tue, 26 Jul 2011 20:38:50 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: document unlocked d_parent accesses
  ceph: explicitly reference rename old_dentry parent dir in request
  ceph: document locking for ceph_set_dentry_offset
  ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
  ceph: protect d_parent access in ceph_d_revalidate
  ceph: protect access to d_parent
  ceph: handle racing calls to ceph_init_dentry
  ceph: set dir complete frag after adding capability
  rbd: set blk_queue request sizes to object size
  ceph: set up readahead size when rsize is not passed
  rbd: cancel watch request when releasing the device
  ceph: ignore lease mask
  ceph: fix ceph_lookup_open intent usage
  ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
  ceph: fix bad parent_inode calc in ceph_lookup_open
  ceph: avoid carrying Fw cap during write into page cache
  libceph: don't time out osd requests that haven't been received
  ceph: report f_bfree based on kb_avail rather than diffing.
  ceph: only queue capsnap if caps are dirty
  ceph: fix snap writeback when racing with writes
  ...

8 years agomerge fchmod() and fchmodat() guts, kill ancient broken kludge
Al Viro [Tue, 26 Jul 2011 08:15:54 +0000]
merge fchmod() and fchmodat() guts, kill ancient broken kludge

The kludge in question is undocumented and doesn't work for 32bit
binaries on amd64, sparc64 and s390.  Passing (mode_t)-1 as
mode had (since 0.99.14v and contrary to behaviour of any
other Unix, prescriptions of POSIX, SuS and our own manpages)
was kinda-sorta no-op.  Note that any software relying on
that (and looking for examples shows none) would be visibly
broken on sparc64, where practically all userland is built
32bit.  No such complaints noticed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoxfs: fix misspelled S_IS...()
Al Viro [Tue, 26 Jul 2011 00:54:24 +0000]
xfs: fix misspelled S_IS...()

mode_t is not a bitmap...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agoxfs: get rid of open-coded S_ISREG(), etc.
Al Viro [Tue, 26 Jul 2011 06:31:30 +0000]
xfs: get rid of open-coded S_ISREG(), etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8 years agogma500: udelay(20000) it too long again
Stephen Rothwell [Mon, 25 Jul 2011 05:18:44 +0000]
gma500: udelay(20000) it too long again

so replace it with mdelay(20).

Fixes build error:

  ERROR: "__bad_udelay" [drivers/staging/gma500/psb_gfx.ko] undefined!

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoUSB / Renesas: Fix build issue related to struct scatterlist
Rafael J. Wysocki [Tue, 26 Jul 2011 18:51:01 +0000]
USB / Renesas: Fix build issue related to struct scatterlist

Fix build issue caused by undefined struct scatterlist in
drivers/usb/renesas_usbhs/fifo.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoMMC / TMIO: Fix build issue related to struct scatterlist
Rafael J. Wysocki [Tue, 26 Jul 2011 18:50:23 +0000]
MMC / TMIO: Fix build issue related to struct scatterlist

Fix build issue caused by undefined struct scatterlist in
drivers/mmc/host/tmio_mmc.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8 years agoMerge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux...
Linus Torvalds [Tue, 26 Jul 2011 18:34:40 +0000]
Merge branch 'for_linus' of git://git./linux/kernel/git/jack/linux-fs-2.6

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
  ext3.txt: update the links in the section "useful links" to the latest ones
  ext3: Fix data corruption in inodes with journalled data
  ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
  ext3: Fix compilation with -DDX_DEBUG
  quota: Remove unused declaration
  jbd: Use WRITE_SYNC in journal checkpoint.
  jbd: Fix oops in journal_remove_journal_head()
  ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
  ext3/ioctl.c: silence sparse warnings about different address spaces
  ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
  ext3: Improve truncate error handling
  ext3: use proper little-endian bitops
  ext2: include fs.h into ext2_fs.h
  ext3: Fix oops in ext3_try_to_allocate_with_rsv()
  jbd: fix a bug of leaking jh->b_jcount
  jbd: remove dependency on __GFP_NOFAIL
  ext3: Convert ext3 to new truncate calling convention
  jbd: Add fixed tracepoints
  ext3: Add fixed tracepoints

Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
new fixed tracepoints.

8 years agoceph: document unlocked d_parent accesses
Sage Weil [Tue, 26 Jul 2011 18:31:26 +0000]
ceph: document unlocked d_parent accesses

For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine.  Document that, and
do some minor cleanup.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

8 years agoceph: explicitly reference rename old_dentry parent dir in request
Sage Weil [Tue, 26 Jul 2011 18:31:14 +0000]
ceph: explicitly reference rename old_dentry parent dir in request

We carry a pin on the parent directory for the rename source and dest
dentries.  For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>