7 years agoMerge branch 'audit' of git://git.linaro.org/people/rmk/linux-arm
Linus Torvalds [Wed, 1 Aug 2012 23:35:37 +0000]
Merge branch 'audit' of git://git.linaro.org/people/rmk/linux-arm

Pull ARM audit/signal updates from Russell King:
 "ARM audit/signal handling updates from Al and Will.  This improves on
  the work Viro did last merge window, and sorts out some of the issues
  found with that work."

* 'audit' of git://git.linaro.org/people/rmk/linux-arm:
  ARM: 7475/1: sys_trace: allow all syscall arguments to be updated via ptrace
  ARM: 7474/1: get rid of TIF_SYSCALL_RESTARTSYS
  ARM: 7473/1: deal with handlerless restarts without leaving the kernel
  ARM: 7472/1: pull all work_pending logics into C function
  ARM: 7471/1: Revert "7442/1: Revert "remove unused restart trampoline""
  ARM: 7470/1: Revert "7443/1: Revert "new way of handling ERESTART_RESTARTBLOCK""

7 years agoMerge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Linus Torvalds [Wed, 1 Aug 2012 23:30:45 +0000]
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm

Pull ARM fixes from Russell King:
 "This fixes various issues found during July"

* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
  ARM: 7479/1: mm: avoid NULL dereference when flushing gate_vma with VIVT caches
  ARM: Fix undefined instruction exception handling
  ARM: 7480/1: only call smp_send_stop() on SMP
  ARM: 7478/1: errata: extend workaround for erratum #720789
  ARM: 7477/1: vfp: Always save VFP state in vfp_pm_suspend on UP
  ARM: 7476/1: vfp: only clear vfp state for current cpu in vfp_pm_suspend
  ARM: 7468/1: ftrace: Trace function entry before updating index
  ARM: 7467/1: mutex: use generic xchg-based implementation for ARMv6+
  ARM: 7466/1: disable interrupt before spinning endlessly
  ARM: 7465/1: Handle >4GB memory sizes in device tree and mem=size@start option

7 years agoKVM: VMX: Fix ds/es corruption on i386 with preemption
Avi Kivity [Wed, 1 Aug 2012 13:48:03 +0000]
KVM: VMX: Fix ds/es corruption on i386 with preemption

Commit b2da15ac26a0c ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.rpl, es.rpl cleared by hardware)

  interrupt
    push ds, es  # pushes bad ds, es
    schedule
      vmx_vcpu_put
        vmx_load_host_state
          reload ds, es (with __USER_DS)
    pop ds, es  # of other thread's stack
    iret
  # other thread runs
  interrupt
    push ds, es
    schedule  # back in vcpu thread
    pop ds, es  # now with rpl=0
    iret
  ...
  vcpu_put
  resume_userspace
  iret  # clears ds, es due to mismatched rpl

(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)

Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.

Reported-by: Chris Clayron <chris2553@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agox86-64, kcmp: The kcmp system call can be common
H. Peter Anvin [Wed, 1 Aug 2012 22:59:58 +0000]
x86-64, kcmp: The kcmp system call can be common

We already use the same system call handler for i386 and x86-64, there
is absolutely no reason x32 can't use the same system call, too.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: <stable@vger.kernel.org> v3.5
Link: http://lkml.kernel.org/n/tip-vwzk3qbcr3yjyxjg2j38vgy9@git.kernel.org

7 years agoum: Add arch/x86/um to MAINTAINERS
Richard Weinberger [Wed, 1 Aug 2012 23:00:47 +0000]
um: Add arch/x86/um to MAINTAINERS

Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agoum: pass siginfo to guest process
Martin Pärtel [Wed, 1 Aug 2012 22:49:17 +0000]
um: pass siginfo to guest process

UML guest processes now get correct siginfo_t for SIGTRAP, SIGFPE,
SIGILL and SIGBUS. Specifically, si_addr and si_code are now correct
where previously they were si_addr = NULL and si_code = 128.

Signed-off-by: Martin Pärtel <martin.partel@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agoum: fix ubd_file_size for read-only files
Martin Pärtel [Wed, 1 Aug 2012 22:44:22 +0000]
um: fix ubd_file_size for read-only files

Made ubd_file_size not request write access. Fixes use of read-only images.

Signed-off-by: Martin Pärtel <martin.partel@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agomd/dm-raid: DM_RAID should select MD_RAID10
NeilBrown [Wed, 1 Aug 2012 22:35:43 +0000]
md/dm-raid: DM_RAID should select MD_RAID10

Now that DM_RAID supports raid10, it needs to select that code
to ensure it is included.

Cc: Jonathan Brassow <jbrassow@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agomd/raid1: submit IO from originating thread instead of md thread.
NeilBrown [Wed, 1 Aug 2012 22:33:20 +0000]
md/raid1: submit IO from originating thread instead of md thread.

queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.

So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.

Signed-off-by: NeilBrown <neilb@suse.de>

7 years agoraid5: raid5d handle stripe in batch way
Shaohua Li [Wed, 1 Aug 2012 22:33:15 +0000]
raid5: raid5d handle stripe in batch way

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agoraid5: make_request use batch stripe release
Shaohua Li [Wed, 1 Aug 2012 22:33:00 +0000]
raid5: make_request use batch stripe release

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agoum: pull interrupt_end() into userspace()
Al Viro [Wed, 23 May 2012 04:25:15 +0000]
um: pull interrupt_end() into userspace()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agoum: split syscall_trace(), pass pt_regs to it
Al Viro [Wed, 23 May 2012 04:18:33 +0000]
um: split syscall_trace(), pass pt_regs to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[richard@nod.at: Fixed some minor build issues]
Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agoum: switch UPT_SET_RETURN_VALUE and regs_return_value to pt_regs
Al Viro [Wed, 23 May 2012 01:16:35 +0000]
um: switch UPT_SET_RETURN_VALUE and regs_return_value to pt_regs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>

7 years agoKVM: x86: apply kvmclock offset to guest wall clock time
Bruce Rogers [Fri, 20 Jul 2012 16:44:24 +0000]
KVM: x86: apply kvmclock offset to guest wall clock time

When a guest migrates to a new host, the system time difference from the
previous host is used in the updates to the kvmclock system time visible
to the guest, resulting in a continuation of correct kvmclock based guest
timekeeping.

The wall clock component of the kvmclock provided time is currently not
updated with this same time offset. Since the Linux guest caches the
wall clock based time, this discrepency is not noticed until the guest is
rebooted. After reboot the guest's time calculations are off.

This patch adjusts the wall clock by the kvmclock_offset, resulting in
correct guest time after a reboot.

Cc: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoMerge tag 'fbdev-updates-for-3.6' of git://github.com/schandinat/linux-2.6
Linus Torvalds [Wed, 1 Aug 2012 17:45:12 +0000]
Merge tag 'fbdev-updates-for-3.6' of git://github.com/schandinat/linux-2.6

Pull fbdev updates from Florian Tobias Schandinat:
 - large updates for OMAP
   - support for LCD3 overlay manager (omap5)
   - omapdss output cleanup
   - removal of passive matrix LCD support as there are no drivers for
     such panels for DSS or DSS2 and nobody complained (cleanup)
 - large updates for SH Mobile
   - overlay support
   - separating MERAM (cache) from framebuffer driver
 - some updates for Exynos and da8xx-fb
 - various other small patches

* tag 'fbdev-updates-for-3.6' of git://github.com/schandinat/linux-2.6: (78 commits)
  da8xx-fb: fix compile issue due to missing include
  fbdev: Make pixel_to_pat() failure mode more friendly
  da8xx-fb: do not turn ON LCD backlight unless LCDC is enabled
  fbdev: sh_mobile_lcdc: Fix vertical panning step
  video: exynos mipi dsi: Fix mipi dsi regulators handling issue
  video: da8xx-fb: do clock reset of revision 2 LCDC before enabling
  arm: da850: configure LCDC fifo threshold
  video: da8xx-fb: configure FIFO threshold to reduce underflow errors
  video: da8xx-fb: fix flicker due to 1 frame delay in updated frame
  video: da8xx-fb rev2: fix disabling of palette completion interrupt
  da8xx-fb: add missing FB_BLANK operations
  video: exynos_dp: use usleep_range instead of delay
  video: exynos_dp: check the only INTERLANE_ALIGN_DONE bit during Link Training
  fb: epson1355fb: Fix section mismatch
  video: exynos_dp: fix wrong DPCD address during Link Training
  video/smscufx: fix line counting in fb_write
  aty128fb: Fix coding style issues
  fbdev: sh_mobile_lcdc: Fix pan offset computation in YUV mode
  fbdev: sh_mobile_lcdc: Fix overlay registers update during pan operation
  fbdev: sh_mobile_lcdc: Support horizontal panning
  ...

7 years agoMerge tag 'sound-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Linus Torvalds [Wed, 1 Aug 2012 17:42:26 +0000]
Merge tag 'sound-3.6' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of small fixes that have been found recently.  Most of
  the commits are regression fixes in HD-audio and some other random
  drivers."

* tag 'sound-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: snd-usb: fix clock source validity index
  ALSA: hda - Fix mute-LED GPIO initialization for IDT codecs
  ALSA: hda - Add descriptions for missing IDT 92HD83x models
  ALSA: hda - Fix polarity of mute LED on HP Mini 210
  ALSA: es1688 - freeup resources on init failure
  ALSA: hda - Workaround for silent output on VAIO Z with ALC889
  ALSA: hda - Fix WARNING from HDMI/DP parser
  ALSA: hda - Detach from converter at closing in patch_hdmi.c
  ALSA: hda - Fix mute-LED GPIO setup for HP Mini 210
  ALSA: mpu401: Fix missing initialization of irq field
  ALSA: hda - Fix invalid D3 of headphone DAC on VT202x codecs

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Wed, 1 Aug 2012 17:26:23 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs

Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...

7 years agoMIPS: Loongson 2: Sort out clock managment.
Ralf Baechle [Wed, 1 Aug 2012 15:15:32 +0000]
MIPS: Loongson 2: Sort out clock managment.

For unexplainable reasons the Loongson 2 clock API was implemented in a
module so fixing this involved shifting large amounts of code around.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMerge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Linus Torvalds [Wed, 1 Aug 2012 16:06:47 +0000]
Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block

Pull block driver changes from Jens Axboe:

 - Making the plugging support for drivers a bit more sane from Neil.
   This supersedes the plugging change from Shaohua as well.

 - The usual round of drbd updates.

 - Using a tail add instead of a head add in the request completion for
   ndb, making us find the most completed request more quickly.

 - A few floppy changes, getting rid of a duplicated flag and also
   running the floppy init async (since it takes forever in boot terms)
   from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
  floppy: remove duplicated flag FD_RAW_NEED_DISK
  blk: pass from_schedule to non-request unplug functions.
  block: stack unplug
  blk: centralize non-request unplug handling.
  md: remove plug_cnt feature of plugging.
  block/nbd: micro-optimization in nbd request completion
  drbd: announce FLUSH/FUA capability to upper layers
  drbd: fix max_bio_size to be unsigned
  drbd: flush drbd work queue before invalidate/invalidate remote
  drbd: fix potential access after free
  drbd: call local-io-error handler early
  drbd: do not reset rs_pending_cnt too early
  drbd: reset congestion information before reporting it in /proc/drbd
  drbd: report congestion if we are waiting for some userland callback
  drbd: differentiate between normal and forced detach
  drbd: cleanup, remove two unused global flags
  floppy: Run floppy initialization asynchronous

7 years agoMerge branch 'for-3.6/core' of git://git.kernel.dk/linux-block
Linus Torvalds [Wed, 1 Aug 2012 16:02:41 +0000]
Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block

Pull core block IO bits from Jens Axboe:
 "The most complicated part if this is the request allocation rework by
  Tejun, which has been queued up for a long time and has been in
  for-next ditto as well.

  There are a few commits from yesterday and today, mostly trivial and
  obvious fixes.  So I'm pretty confident that it is sound.  It's also
  smaller than usual."

* 'for-3.6/core' of git://git.kernel.dk/linux-block:
  block: remove dead func declaration
  block: add partition resize function to blkpg ioctl
  block: uninitialized ioc->nr_tasks triggers WARN_ON
  block: do not artificially constrain max_sectors for stacking drivers
  blkcg: implement per-blkg request allocation
  block: prepare for multiple request_lists
  block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
  blkcg: inline bio_blkcg() and friends
  block: allocate io_context upfront
  block: refactor get_request[_wait]()
  block: drop custom queue draining used by scsi_transport_{iscsi|fc}
  mempool: add @gfp_mask to mempool_create_node()
  blkcg: make root blkcg allocation use %GFP_KERNEL
  blkcg: __blkg_lookup_create() doesn't need radix preload

7 years agoMerge branch 'for-next' of git://neil.brown.name/md
Linus Torvalds [Wed, 1 Aug 2012 16:02:01 +0000]
Merge branch 'for-next' of git://neil.brown.name/md

Pull md updates from NeilBrown.

* 'for-next' of git://neil.brown.name/md:
  DM RAID: Add support for MD RAID10
  md/RAID1: Add missing case for attempting to repair known bad blocks.
  md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
  md/raid1: don't abort a resync on the first badblock.
  md: remove duplicated test on ->openers when calling do_md_stop()
  raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
  md/raid1: prevent merging too large request
  md/raid1: read balance chooses idlest disk for SSD
  md/raid1: make sequential read detection per disk based
  MD RAID10: Export md_raid10_congested
  MD: Move macros from raid1*.h to raid1*.c
  MD RAID1: rename mirror_info structure
  MD RAID10: rename mirror_info structure
  MD RAID10: Fix compiler warning.
  raid5: add a per-stripe lock
  raid5: remove unnecessary bitmap write optimization
  raid5: lockless access raid5 overrided bi_phys_segments
  raid5: reduce chance release_stripe() taking device_lock

7 years agolocks: remove unused lm_release_private
J. Bruce Fields [Wed, 1 Aug 2012 11:56:16 +0000]
locks: remove unused lm_release_private

In commit 3b6e2723f32d ("locks: prevent side-effects of
locks_release_private before file_lock is initialized") we removed the
last user of lm_release_private without removing the field itself.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agoMIPS: Loongson 1: more clk support and add select HAVE_CLK
Yoichi Yuasa [Wed, 1 Aug 2012 06:42:16 +0000]
MIPS: Loongson 1: more clk support and add select HAVE_CLK

This fixes a redefinition of clk_*:

arch/mips/loongson1/common/clock.c:23:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/loongson1/common/clock.c:41:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
make[3]: *** [arch/mips/loongson1/common/clock.o] Error 1

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Cc: linux-mips@linux-mips.org
Reviewed-by: John Crispin <blogic@openwrt.org>
Acked-by: Kelvin Cheung <keguang.zhang@gmail.com>
Patchwork: https://patchwork.linux-mips.org/patch/4143/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: txx9: Fix redefinition of clk_* by adding select HAVE_CLK
Yoichi Yuasa [Wed, 1 Aug 2012 06:41:06 +0000]
MIPS: txx9: Fix redefinition of clk_* by adding select HAVE_CLK

arch/mips/txx9/generic/setup.c:87:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/txx9/generic/setup.c:97:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/txx9/generic/setup.c:103:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/txx9/generic/setup.c:108:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/txx9/generic/setup.c:114:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[3]: *** [arch/mips/txx9/generic/setup.o] Error 1

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Cc: linux-mips@linux-mips.org
Reviewed-by: John Crispin <blogic@openwrt.org>
Patchwork: https://patchwork.linux-mips.org/patch/4142/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: BCM63xx: Fix redefinition of clk_* by adding select HAVE_CLK
Yoichi Yuasa [Wed, 1 Aug 2012 06:39:52 +0000]
MIPS: BCM63xx: Fix redefinition of clk_* by adding select HAVE_CLK

arch/mips/bcm63xx/clk.c:249:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/bcm63xx/clk.c:259:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/bcm63xx/clk.c:268:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/bcm63xx/clk.c:275:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/bcm63xx/clk.c:302:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[2]: *** [arch/mips/bcm63xx/clk.o] Error 1

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Cc: linux-mips@linux-mips.org
Reviewed-by: John Crispin <blogic@openwrt.org>
Patchwork: https://patchwork.linux-mips.org/patch/4141/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: AR7: Fix redefinition of clk_* by adding select HAVE_CLK
Yoichi Yuasa [Wed, 1 Aug 2012 06:38:00 +0000]
MIPS: AR7: Fix redefinition of clk_* by adding select HAVE_CLK

arch/mips/ar7/clock.c:420:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/ar7/clock.c:426:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/ar7/clock.c:431:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/ar7/clock.c:437:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/ar7/clock.c:454:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[2]: *** [arch/mips/ar7/clock.o] Error 1

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Cc: linux-mips@linux-mips.org
Reviewed-by: John Crispin <blogic@openwrt.org>
Acked-by: Florian Fainelli <florian@openwrt.org>
Patchwork: https://patchwork.linux-mips.org/patch/4140/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: Lantiq: Platform specific CLK fixup
John Crispin [Sun, 22 Jul 2012 06:56:01 +0000]
MIPS: Lantiq: Platform specific CLK fixup

As we use CLKDEV_LOOKUP but dont have support for COMMON_CLK yet, we need to
provide our own version of of_clk_get_from_provider().

Signed-off-by: John Crispin <blogic@openwrt.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/4117/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: Lantiq: Add device_tree_init function
John Crispin [Sun, 22 Jul 2012 06:56:00 +0000]
MIPS: Lantiq: Add device_tree_init function

Add a lantiq specific version of device_tree_init. The generic MIPS version
was removed by.

commit 594e966bc412d64eec9282d28ce511bdd62fea39
Author: David Daney <david.daney@cavium.com>
Date:   Thu Jul 5 18:12:38 2012 +0200

MIPS: Prune some target specific code out of prom.c

Signed-off-by: John Crispin <blogic@openwrt.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/4116/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agoMIPS: Lantiq: Fix interface clock and PCI control register offset
John Crispin [Sun, 22 Jul 2012 06:55:57 +0000]
MIPS: Lantiq: Fix interface clock and PCI control register offset

The XRX200 based SoC have a different register offset for the interface
clock and PCI control registers. This patch detects the SoC and sets the
register offset at runtime. This make PCI work on the VR9 SoC.

Signed-off-by: John Crispin <blogic@openwrt.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/4113/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

7 years agodelousing target_core_file a bit
Al Viro [Wed, 1 Aug 2012 09:07:04 +0000]
delousing target_core_file a bit

* set_fs(KERNEL_DS) + getname() is probably the weirdest implementation
of strdup() I've seen.  Especially since they don't to copy it at all...
* filp_open() never returns NULL; it's ERR_PTR(-E...) on failure.
* file->f_dentry is never going to be NULL, TYVM.
* match_strdup() + snprintf() + kfree() is a bloody weird way to spell
match_strlcpy().

Pox on cargo-cult programmers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7 years agoDM RAID: Add support for MD RAID10
Jonathan Brassow [Wed, 1 Aug 2012 02:44:26 +0000]
DM RAID: Add support for MD RAID10

Support the MD RAID10 personality through dm-raid.c

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agoMerge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf' into md
NeilBrown [Wed, 1 Aug 2012 10:40:02 +0000]
Merge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf' into md

Pull in pre-requisites for adding raid10 support to dm-raid.

7 years agoblock: remove dead func declaration
Yuanhan Liu [Wed, 1 Aug 2012 10:25:54 +0000]
block: remove dead func declaration

__generic_unplug_device() function is removed with commit
7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
remove the declaration at meantime. Here remove it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7 years agoblock: add partition resize function to blkpg ioctl
Vivek Goyal [Wed, 1 Aug 2012 10:24:18 +0000]
block: add partition resize function to blkpg ioctl

Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.

This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.

Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().

v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7 years agoblock: uninitialized ioc->nr_tasks triggers WARN_ON
Olof Johansson [Wed, 1 Aug 2012 10:17:27 +0000]
block: uninitialized ioc->nr_tasks triggers WARN_ON

Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[   10.886893] ------------[ cut here ]------------
[   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[   10.886905] Hardware name: Bochs
[   10.886906] Modules linked in:
[   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[   10.886908] Call Trace:
[   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
[   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
[   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
[   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
[   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[   10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks.  This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task.  This patch
    doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7 years agoblock: do not artificially constrain max_sectors for stacking drivers
Mike Snitzer [Wed, 1 Aug 2012 08:44:28 +0000]
block: do not artificially constrain max_sectors for stacking drivers

blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.

It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).

Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7 years agoALSA: snd-usb: fix clock source validity index
Daniel Mack [Wed, 1 Aug 2012 08:16:53 +0000]
ALSA: snd-usb: fix clock source validity index

uac_clock_source_is_valid() uses the control selector value to access
the bmControls bitmap of the clock source unit. This is wrong, as
control selector values start from 1, while the bitmap uses all
available bits.

In other words, "Clock Validity Control" is stored in D3..2, not D5..4
of the clock selector unit's bmControls.

Signed-off-by: Daniel Mack <zonque@gmail.com>
Reported-by: Andreas Koch <andreas@akdesigninc.com>
Cc: stable@kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>

7 years agoMerge branch 'common/irqdomain' into sh-latest
Paul Mundt [Wed, 1 Aug 2012 08:14:52 +0000]
Merge branch 'common/irqdomain' into sh-latest

7 years agosh: ecovec: care CN5 VBUS if USB host mode
Kuninori Morimoto [Wed, 1 Aug 2012 07:54:58 +0000]
sh: ecovec: care CN5 VBUS if USB host mode

renesas_usbhs driver can control both USB Host/Gadget,
but it needs VBUS output if Host mode.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agosh: sh7724: fixup renesas_usbhs clock settings
Kuninori Morimoto [Wed, 1 Aug 2012 07:54:22 +0000]
sh: sh7724: fixup renesas_usbhs clock settings

8cc88a55b03bd4940390125c2521c99368513be5
(sh: sh7724: use runtime PM implementation) broke sh7724 clocks.

renesas_usbhs needs HWBLK_USB0/1 clock on sh7724

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agosh: intc: initial irqdomain support.
Paul Mundt [Wed, 1 Aug 2012 08:13:46 +0000]
sh: intc: initial irqdomain support.

Trivial support for irq domains, using either a linear map or radix tree
depending on the vector layout.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agoMerge branch 'common/pinctrl' into sh-latest
Paul Mundt [Wed, 1 Aug 2012 07:28:37 +0000]
Merge branch 'common/pinctrl' into sh-latest

7 years agosh: pfc: Fix up init ordering mess.
Paul Mundt [Wed, 1 Aug 2012 07:27:38 +0000]
sh: pfc: Fix up init ordering mess.

Commit ca5481c68e9fbcea62bb3c78ae6cccf99ca8fb73 ("sh: pfc: Rudimentary
pinctrl-backed GPIO support.") introduced a regression for platforms that
were doing early GPIO API calls (from arch_initcall() or earlier),
leading to a situation where our two-stage registration logic would trip
itself up and we'd -ENODEV out of the pinctrl registration path,
resulting in endless -EPROBE_DEFER errors. Further lack of checking any
sort of errors from gpio_request() resulted in boot time warnings,
tripping on the FLAG_REQUESTED test-and-set in gpio_ensure_requested().

As it turns out there's no particular need to bother with the two-stage
registration, as the platform bus is already available at the point that
we have to start caring. As such, it's easiest to simply fold these
together in to a single init path, the ordering of which is ensured
through the platform's mux registration, as usual.

Reported-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agoMerge branch 'sh/dmaengine' into sh-latest
Paul Mundt [Wed, 1 Aug 2012 04:49:13 +0000]
Merge branch 'sh/dmaengine' into sh-latest

7 years agoserial: sh-sci: fix compilation breakage, when DMA is enabled
Guennadi Liakhovetski [Mon, 30 Jul 2012 19:28:47 +0000]
serial: sh-sci: fix compilation breakage, when DMA is enabled

A recent commit:

commit d6fa5a4e7ab605370fd6c982782f84ef2e6660e7
Author: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
    serial: sh-sci: prepare for conversion to the shdma base library

is not sufficient to update the sh-sci driver to the new shdma driver
layout. This caused compilation breakage, when CONFIG_SERIAL_SH_SCI_DMA
is enabled. This patch trivially fixes the problem by updating the DMA
descriptor manipulation code.

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agodmaengine: shdma: restore partial transfer calculation
Guennadi Liakhovetski [Mon, 30 Jul 2012 19:28:27 +0000]
dmaengine: shdma: restore partial transfer calculation

The recent shdma driver split has mistakenly removed support for partial
DMA transfer size calculation on forced termination. This patch restores
it.

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Acked-by: Vinod Koul <vinod.koul@linux.intel.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

7 years agoMerge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Wed, 1 Aug 2012 03:44:03 +0000]
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull irqdomain changes from Grant Likely:
 "Round of refactoring and enhancements to irq_domain infrastructure.
  This series starts the process of simplifying irqdomain.  The ultimate
  goal is to merge LEGACY, LINEAR and TREE mappings into a single
  system, but had to back off from that after some last minute bugs.
  Instead it mainly reorganizes the code and ensures that the reverse
  map gets populated when the irq is mapped instead of the first time it
  is looked up.

  Merging of the irq_domain types is deferred to v3.7

  In other news, this series adds helpers for creating static mappings
  on a linear or tree mapping."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
  irqdomain: Improve diagnostics when a domain mapping fails
  irqdomain: eliminate slow-path revmap lookups
  irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
  irqdomain: Eliminate dedicated radix lookup functions
  irqdomain: Support for static IRQ mapping and association.
  irqdomain: Always update revmap when setting up a virq
  irqdomain: Split disassociating code into separate function
  irq_domain: correct a minor wrong comment for linear revmap
  irq_domain: Standardise legacy/linear domain selection
  irqdomain: Make ops->map hook optional
  irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
  irqdomain: Simple NUMA awareness.
  devicetree: add helper inline for retrieving a node's full name

7 years agox86: OLPC: move s/r-related EC cmds to EC driver
Andres Salomon [Tue, 17 Jul 2012 08:26:10 +0000]
x86: OLPC: move s/r-related EC cmds to EC driver

The new EC driver calls platform-specific suspend and resume hooks; run
XO-1-specific EC commands from there, rather than deep in s/r code.  If we
attempt to run EC commands after the new EC driver has suspended, it is
refused by the ec->suspended checks.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: move global variables into priv struct
Andres Salomon [Sat, 14 Jul 2012 00:10:45 +0000]
Platform: OLPC: move global variables into priv struct

Populate olpc_ec_priv with variables that were previously global.  This
makes things a tad bit clearer, IMO.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: move debugfs support from x86 EC driver
Andres Salomon [Fri, 13 Jul 2012 03:45:14 +0000]
Platform: OLPC: move debugfs support from x86 EC driver

There's nothing about the debugfs interface for the EC driver that is
architecture-specific, so move it into the arch-independent driver.

The code is mostly unchanged with the exception of renamed variables, coding
style changes, and API updates.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agox86: OLPC: switch over to using new EC driver on x86
Andres Salomon [Fri, 13 Jul 2012 00:57:28 +0000]
x86: OLPC: switch over to using new EC driver on x86

This uses the new EC driver framework in drivers/platform/olpc.  The
XO-1 and XO-1.5-specific code is still in arch/x86, but the generic stuff
(including a new workqueue; no more running EC commands with IRQs disabled!)
can be shared with other architectures.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: add a suspended flag to the EC driver
Andres Salomon [Fri, 13 Jul 2012 22:54:25 +0000]
Platform: OLPC: add a suspended flag to the EC driver

A problem we've noticed on XO-1.75 is when we suspend in the middle of
an EC command.  Don't allow that.

In the process, create a private object for the generic EC driver to use;
we have a framework for passing around a struct, use that rather than a
proliferation of global variables.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: turn EC driver into a platform_driver
Andres Salomon [Fri, 13 Jul 2012 12:57:17 +0000]
Platform: OLPC: turn EC driver into a platform_driver

The 1.75-based OLPC EC driver already does this; let's do it for all EC
drivers.  This gives us nice suspend/resume hooks, amongst other things.

We want to run the EC's suspend hooks later than other drivers (which may
be setting wakeup masks or be running EC commands).  We also want to run
the EC's resume hooks earlier than other drivers (which may want to run EC
commands).

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: allow EC cmd to be overridden, and create a workqueue to call it
Andres Salomon [Thu, 12 Jul 2012 00:40:25 +0000]
Platform: OLPC: allow EC cmd to be overridden, and create a workqueue to call it

This provides a new API allows different OLPC architectures to override the
EC driver.  x86 and ARM OLPC machines use completely different EC backends.

The olpc_ec_cmd is synchronous, and waits for the workqueue to send the
command to the EC.  Multiple callers can run olpc_ec_cmd() at once, and
they will by serialized and sleep while only one executes on the EC at a time.

We don't provide an unregister function, as that doesn't make sense within
the context of OLPC machines - there's only ever 1 EC, it's critical to
functionality, and it certainly not hotpluggable.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agodrivers: OLPC: update various drivers to include olpc-ec.h
Andres Salomon [Wed, 11 Jul 2012 08:16:29 +0000]
drivers: OLPC: update various drivers to include olpc-ec.h

Switch over to using olpc-ec.h in multiple steps, so as not to break builds.
This covers every driver that calls olpc_ec_cmd().

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoPlatform: OLPC: add a stub to drivers/platform/ for the OLPC EC driver
Andres Salomon [Wed, 11 Jul 2012 02:31:51 +0000]
Platform: OLPC: add a stub to drivers/platform/ for the OLPC EC driver

The OLPC EC driver has outgrown arch/x86/platform/.  It's time to both
share common code amongst different architectures, as well as move it out
of arch/x86/.  The XO-1.75 is ARM-based, and the EC driver shares a lot of
code with the x86 code.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

7 years agoKVM: fold kvm_pit_timer into kvm_kpit_state
Avi Kivity [Thu, 26 Jul 2012 15:01:53 +0000]
KVM: fold kvm_pit_timer into kvm_kpit_state

One structure nests inside the other, providing no value at all.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: Simplify kvm_pit_timer
Avi Kivity [Thu, 26 Jul 2012 15:01:52 +0000]
KVM: Simplify kvm_pit_timer

'timer_mode_mask' is unused
'tscdeadline' is unused
't_ops' only adds needless indirection
'vcpu' is unused

Remove.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: Simplify kvm_timer
Avi Kivity [Thu, 26 Jul 2012 15:01:51 +0000]
KVM: Simplify kvm_timer

'reinject' is never initialized
't_ops' only serves as indirection to lapic_is_periodic; call that directly
   instead
'kvm' is never used
'vcpu' can be derived via container_of

Remove these fields.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: Remove internal timer abstraction
Avi Kivity [Thu, 26 Jul 2012 15:01:50 +0000]
KVM: Remove internal timer abstraction

kvm_timer_fn(), the sole inhabitant of timer.c, is only used by lapic.c. Move
it there to make it easier to hack on it.

struct kvm_timer is a thin wrapper around hrtimer, and only adds obfuscation.
Move near its two users (with different names) to prepare for simplification.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoMerge branch 'akpm' (Andrew's patch-bomb)
Linus Torvalds [Wed, 1 Aug 2012 02:25:39 +0000]
Merge branch 'akpm' (Andrew's patch-bomb)

Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...

7 years agoMerge tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio
Linus Torvalds [Wed, 1 Aug 2012 02:17:27 +0000]
Merge tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio

Pull VFIO core from Alex Williamson:
 "This series includes the VFIO userspace driver interface for the 3.6
  kernel merge window.  This driver is intended to provide a secure
  interface for device access using IOMMU protection for applications
  like assignment of physical devices to virtual machines.

  Qemu will be the first user of this interface, enabling assignment of
  PCI devices to Qemu guests.  This interface is intended to eventually
  replace the x86-specific assignment mechanism currently available in
  KVM.

  This interface has the advantage of being more secure, by working with
  IOMMU groups to ensure device isolation and providing it's own
  filtered resource access mechanism, and also more flexible, in not
  being x86 or KVM specific (extensions to enable POWER are already
  working).

  This driver is originally the work of Tom Lyon, but has since been
  handed over to me and gone through a complete overhaul thanks to the
  input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
  Roedel, and others.  This driver has been available in linux-next for
  the last month."

Paul Mackerras says:
 "I would be glad to see it go in since we want to use it with KVM on
  PowerPC.  If possible we'd like the PowerPC bits for it to go in as
  well."

* tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio:
  vfio: Add PCI device driver
  vfio: Type1 IOMMU implementation
  vfio: Add documentation
  vfio: VFIO core

7 years agoMerge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso...
Linus Torvalds [Wed, 1 Aug 2012 02:07:42 +0000]
Merge tag 'random_for_linus' of git://git./linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...

7 years agoMerge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland...
Linus Torvalds [Wed, 1 Aug 2012 01:59:55 +0000]
Merge tag 'rdma-for-linus' of git://git./linux/kernel/git/roland/infiniband

Pull final RDMA changes from Roland Dreier:
 - Fix IPoIB to stop using unsafe linkage between networking neighbour
   layer and private path database.
 - Small fixes for bugs found by Fengguang Wu's automated builds.

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  IPoIB: Use a private hash table for path lookup in xmit path
  IB/qib: Fix size of cc_supported_table_entries
  RDMA/ucma: Convert open-coded equivalent to memdup_user()
  RDMA/ocrdma: Fix check of GSI CQs
  RDMA/cma: Use PTR_RET rather than if (IS_ERR(...)) + PTR_ERR

7 years agoMerge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Wed, 1 Aug 2012 01:47:44 +0000]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media

Pull second set of media updates from Mauro Carvalho Chehab:

 - radio API: add support to work with radio frequency bands

 - new AM/FM radio drivers: radio-shark, radio-shark2

 - new Remote Controller USB driver: iguanair

 - conversion of several drivers to the v4l2 core control framework

 - new board additions at existing drivers

 - the remaining (and vast majority of the patches) are due to
   drivers/DocBook fixes/cleanups.

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (154 commits)
  [media] radio-tea5777: use library for 64bits div
  [media] tlg2300: Declare MODULE_FIRMWARE usage
  [media] lgs8gxx: Declare MODULE_FIRMWARE usage
  [media] xc5000: Add MODULE_FIRMWARE statements
  [media] s2255drv: Add MODULE_FIRMWARE statement
  [media] dib8000: move dereference after check for NULL
  [media] Documentation: Update cardlists
  [media] bttv: add support for Aposonic W-DVR
  [media] cx25821: Remove bad strcpy to read-only char*
  [media] pms.c: remove duplicated include
  [media] smiapp-core.c: remove duplicated include
  [media] via-camera: pass correct format settings to sensor
  [media] rtl2832.c: minor cleanup
  [media] Add support for the IguanaWorks USB IR Transceiver
  [media] Minor cleanups for MCE USB
  [media] drivers/media/dvb/siano/smscoreapi.c: use list_for_each_entry
  [media] Use a named union in struct v4l2_ioctl_info
  [media] mceusb: Add Twisted Melon USB IDs
  [media] staging/media/solo6x10: use module_pci_driver macro
  [media] staging/media/dt3155v4l: use module_pci_driver macro
  ...

Conflicts:
Documentation/feature-removal-schedule.txt

7 years agoMerge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Wed, 1 Aug 2012 01:45:44 +0000]
Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull second wave of NFS client updates from Trond Myklebust:

 - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
   separate modules.

 - Fix Oopses in the NFSv4 idmapper

 - Fix a deadlock whereby rpciod tries to allocate a new socket and ends
   up recursing into the NFS code due to memory reclaim.

 - Increase the number of permitted callback connections.

* tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: explicitly reject LOCK_MAND flock() requests
  nfs: increase number of permitted callback connections.
  SUNRPC: return negative value in case rpcbind client creation error
  NFS: Convert v4 into a module
  NFS: Convert v3 into a module
  NFS: Convert v2 into a module
  NFS: Keep module parameters in the generic NFS client
  NFS: Split out remaining NFS v4 inode functions
  NFS: Pass super operations and xattr handlers in the nfs_subversion
  NFS: Only initialize the ACL client in the v3 case
  NFS: Create a try_mount rpc op
  NFS: Remove the NFS v4 xdev mount function
  NFS: Add version registering framework
  NFS: Fix a number of bugs in the idmapper
  nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
  sunrpc: clarify comments on rpc_make_runnable
  pnfsblock: bail out partial page IO

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Wed, 1 Aug 2012 01:43:13 +0000]
Merge git://git./linux/kernel/git/davem/net

Pull networking update from David S. Miller:
 "I think Eric Dumazet and I have dealt with all of the known routing
  cache removal fallout.  Some other minor fixes all around.

  1) Fix RCU of cached routes, particular of output routes which require
     liberation via call_rcu() instead of call_rcu_bh().  From Eric
     Dumazet.

  2) Make sure we purge net device references in cached routes properly.

  3) TG3 driver bug fixes from Michael Chan.

  4) Fix reported 'expires' value in ipv6 routes, from Li Wei.

  5) TUN driver ioctl leaks kernel bytes to userspace, from Mathias
     Krause."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
  ipv4: Properly purge netdev references on uncached routes.
  ipv4: Cache routes in nexthop exception entries.
  ipv4: percpu nh_rth_output cache
  ipv4: Restore old dst_free() behavior.
  bridge: make port attributes const
  ipv4: remove rt_cache_rebuild_count
  net: ipv4: fix RCU races on dst refcounts
  net: TCP early demux cleanup
  tun: Fix formatting.
  net/tun: fix ioctl() based info leaks
  tg3: Update version to 3.124
  tg3: Fix race condition in tg3_get_stats64()
  tg3: Add New 5719 Read DMA workaround
  tg3: Fix Read DMA workaround for 5719 A0.
  tg3: Request APE_LOCK_PHY before PHY access
  ipv6: fix incorrect route 'expires' value passed to userspace
  mISDN: Bugfix only few bytes are transfered on a connection
  seeq: use PTR_RET at init_module of driver
  bnx2x: remove cast around the kmalloc in bnx2x_prev_mark_path
  ipv4: clean up put_child
  ...

7 years agortc/rtc-88pm80x: remove unneed devm_kfree
Devendra Naga [Tue, 31 Jul 2012 23:46:24 +0000]
rtc/rtc-88pm80x: remove unneed devm_kfree

devm_kzalloc() doesn't need a matching devm_kfree(), the freeing mechanism
will trigger when driver unloads.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Ashish Jangam <ashish.jangam@kpitcummins.com>
Cc: David Dajun Chen <dchen@diasemi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agortc/rtc-88pm80x: assign ret only when rtc_register_driver fails
Devendra Naga [Tue, 31 Jul 2012 23:46:22 +0000]
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails

At the probe we are assigning ret to return value of PTR_ERR right after
the rtc_register_drive()r, as we would have done it in the if
(IS_ERR(ptr)) check, since the function fails and goes inside that case

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Ashish Jangam <ashish.jangam@kpitcummins.com>
Cc: David Dajun Chen <dchen@diasemi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
Mel Gorman [Tue, 31 Jul 2012 23:46:20 +0000]
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log.  The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37.  It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce.  To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted.  The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
hard reset or a sysrq-b.

The basic problem is a race between page table sharing and teardown.  For
the most part page table sharing depends on i_mmap_mutex.  In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A Process B
--------- ---------
hugetlb_fault shmdt
   LockWrite(mmap_sem)
       do_munmap
    unmap_region
      unmap_vmas
        unmap_single_vma
          unmap_hugepage_range
                   Lock(i_mmap_mutex)
    Lock(mm->page_table_lock)
    huge_pmd_unshare/unmap tables <--- (1)
    Unlock(mm->page_table_lock)
                   Unlock(i_mmap_mutex)
  huge_pte_alloc       ...
    Lock(i_mmap_mutex)       ...
    vma_prio_walk, find svma, spte       ...
    Lock(mm->page_table_lock)       ...
    share spte       ...
    Unlock(mm->page_table_lock)       ...
    Unlock(i_mmap_mutex)       ...
  hugetlb_no_page   <--- (2)
      free_pgtables
        unlink_file_vma
hugetlb_free_pgd_range
    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables.  At (1) above,
the page tables are not shared yet so it unmaps the PMDs.  Process A sets
up page table sharing and at (2) faults a new entry.  Process B then trips
up on it in free_pgtables.

This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed.  This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
ineligible for sharing and avoids the race.  Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).

This should be treated as a -stable candidate if it is merged.

Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
struct timeval tv;

gettimeofday(&tv, NULL);
srandom(tv.tv_usec);
return random() % max;
}

static void play(void *addr, size_t size)
{
unsigned char *start = addr,
      *end = start + size,
      *a;
start += get_random(size/2);

/* we could itterate on huge pages but let's give it more time. */
for (a = start; a < end; a += 4096)
*a = 0;
}

int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;

if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}

if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}

if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}

fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}

[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agotmpfs: distribute interleave better across nodes
Nathan Zimmer [Tue, 31 Jul 2012 23:46:17 +0000]
tmpfs: distribute interleave better across nodes

When tmpfs has the interleave memory policy, it always starts allocating
for each file from node 0 at offset 0.  When there are many small files,
the lower nodes fill up disproportionately.

This patch spreads out node usage by starting files at nodes other than 0,
by using the inode number to bias the starting node for interleave.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: remove redundant initialization
Minchan Kim [Tue, 31 Jul 2012 23:46:16 +0000]
mm: remove redundant initialization

pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: warn if pg_data_t isn't initialized with zero
Minchan Kim [Tue, 31 Jul 2012 23:46:14 +0000]
mm: warn if pg_data_t isn't initialized with zero

Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated.  Arch code and memory hotplug already initiailize
pg_data_t.  So this warning should never happen.  I select fields randomly
near the beginning, middle and end of pg_data_t for checking.

This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.

Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomips: zero out pg_data_t when it's allocated
Minchan Kim [Tue, 31 Jul 2012 23:46:11 +0000]
mips: zero out pg_data_t when it's allocated

This patch is preparation for the next patch which removes the zeroing of
the pg_data_t in core MM.  All archs except MIPS already do this.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Tejun Heo <tj@kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomemcg: gix memory accounting scalability in shrink_page_list
Tim Chen [Tue, 31 Jul 2012 23:46:08 +0000]
memcg: gix memory accounting scalability in shrink_page_list

I noticed in a multi-process parallel files reading benchmark I ran on a 8
socket machine, throughput slowed down by a factor of 8 when I ran the
benchmark within a cgroup container.  I traced the problem to the
following code path (see below) when we are trying to reclaim memory from
file cache.  The res_counter_uncharge function is called on every page
that's reclaimed and created heavy lock contention.  The patch below
allows the reclaimed pages to be uncharged from the resource counter in
batch and recovered the regression.

Tim

     40.67%           usemem  [kernel.kallsyms]                   [k] _raw_spin_lock
                      |
                      --- _raw_spin_lock
                         |
                         |--92.61%-- res_counter_uncharge
                         |          |
                         |          |--100.00%-- __mem_cgroup_uncharge_common
                         |          |          |
                         |          |          |--100.00%-- mem_cgroup_uncharge_cache_page
                         |          |          |          __remove_mapping
                         |          |          |          shrink_page_list
                         |          |          |          shrink_inactive_list
                         |          |          |          shrink_mem_cgroup_zone
                         |          |          |          shrink_zone
                         |          |          |          do_try_to_free_pages
                         |          |          |          try_to_free_pages
                         |          |          |          __alloc_pages_nodemask
                         |          |          |          alloc_pages_current

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm/sparse: remove index_init_lock
Gavin Shan [Tue, 31 Jul 2012 23:46:06 +0000]
mm/sparse: remove index_init_lock

sparse_index_init() uses the index_init_lock spinlock to protect root
mem_section assignment.  The lock is not necessary anymore because the
function is called only during boot (during paging init which is executed
only from a single CPU) and from the hotplug code (by add_memory() via
arch_add_memory()) which uses mem_hotplug_mutex.

The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
preparation") and sparse_index_init() was used only during boot at that
time.

Later when the hotplug code (and add_memory()) was introduced there was no
synchronization so it was possible to online more sections from the same
root probably (though I am not 100% sure about that).  The first
synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
hibernation in the same kernel") which was later replaced by the
mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
{un}lock_memory_hotplug()").

Let's remove the lock as it is not needed and it makes the code more
confusing.

[mhocko@suse.cz: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm/sparse: more checks on mem_section number
Gavin Shan [Tue, 31 Jul 2012 23:46:04 +0000]
mm/sparse: more checks on mem_section number

__section_nr() was implemented to retrieve the corresponding memory
section number according to its descriptor.  It's possible that the
specified memory section descriptor doesn't exist in the global array.  So
add more checking on that and report an error for a wrong case.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm/sparse: optimize sparse_index_alloc
Gavin Shan [Tue, 31 Jul 2012 23:46:02 +0000]
mm/sparse: optimize sparse_index_alloc

With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
descriptors are allocated from slab or bootmem.  When allocating from
slab, let slab/bootmem allocator clear the memory chunk.  We needn't clear
it explicitly.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomemcg: add mem_cgroup_from_css() helper
Wanpeng Li [Tue, 31 Jul 2012 23:46:01 +0000]
memcg: add mem_cgroup_from_css() helper

Add a mem_cgroup_from_css() helper to replace open-coded invokations of
container_of().  To clarify the code and to add a little more type safety.

[akpm@linux-foundation.org: fix extensive breakage]
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomemcg: further prevent OOM with too many dirty pages
Hugh Dickins [Tue, 31 Jul 2012 23:45:59 +0000]
memcg: further prevent OOM with too many dirty pages

The may_enter_fs test turns out to be too restrictive: though I saw no
problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
slightly changed the way I started off the testing: dd if=/dev/zero
of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
the transaction needs to be started even before allocating pagecache
memory.  But it may not be worth worrying about these days: if direct
reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
device; but since that also masks off __GFP_IO, we can test for __GFP_IO
directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing on
3.5-rc6, it OOMed about one time in five or ten; when testing just now on
3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under ordinary
writeback, not marked PageReclaim, so rightly not causing the memcg check
to wait on their writeback: these too can prevent shrink_page_list() from
freeing any pages, so many times that memcg reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with dirty FS
pages: mark them PageReclaim.  It is appropriate to rotate these to tail
of list when writepage completes, but more importantly, the PageReclaim
flag makes memcg reclaim wait on them if encountered again.  Increment
NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else, and goto
keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomemcg: prevent OOM with too many dirty pages
Michal Hocko [Tue, 31 Jul 2012 23:45:55 +0000]
memcg: prevent OOM with too many dirty pages

The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages.  Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.

This patch fixes the problem by throttling the allocating process
(possibly a writer) during the hard limit reclaim by waiting on
PageReclaim pages.  We are waiting only for PageReclaim pages because
those are the pages that made one full round over LRU and that means that
the writeback is much slower than scanning.

The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.  We are seeing this happening during nightly backups which are placed
into containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter
PageReclaim pages which is a signal that the reclaim doesn't catch up on
with the writers so somebody should be throttled.  This could be
potentially unfair because it could be somebody else from the group who
gets throttled on behalf of the writer but as writers need to allocate as
well and they allocate in higher rate the probability that only innocent
processes would be penalized is not that high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or
ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
containers) and dd got killed by OOM killer every time.  With the patch I
could run the dd with the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
[hughd@google.com: fix deadlock with loop driver]
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: mmu_notifier: fix freed page still mapped in secondary MMU
Xiao Guangrong [Tue, 31 Jul 2012 23:45:52 +0000]
mm: mmu_notifier: fix freed page still mapped in secondary MMU

mmu_notifier_release() is called when the process is exiting.  It will
delete all the mmu notifiers.  But at this time the page belonging to the
process is still present in page tables and is present on the LRU list, so
this race will happen:

      CPU 0                 CPU 1
mmu_notifier_release:    try_to_unmap:
   hlist_del_init_rcu(&mn->hlist);
                            ptep_clear_flush_notify:
                                  mmu nofifler not found
                            free page  !!!!!!
                            /*
                             * At the point, the page has been
                             * freed, but it is still mapped in
                             * the secondary MMU.
                             */

  mn->ops->release(mn, mm);

Then the box is not stable and sometimes we can get this bug:

[  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
[  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
[  738.075936] page flags: 0x20000000000014(referenced|dirty)

The same issue is present in mmu_notifier_unregister().

We can call ->release before deleting the notifier to ensure the page has
been unmapped from the secondary MMU before it is freed.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: only check anon swapin page charges for swap cache
Johannes Weiner [Tue, 31 Jul 2012 23:45:50 +0000]
mm: memcg: only check anon swapin page charges for swap cache

shmem knows for sure that the page is in swap cache when attempting to
charge a page, because the cache charge entry function has a check for it.
Only anon pages may be removed from swap cache already when trying to
charge their swapin.

Adjust the comment, though: '4969c11 mm: fix swapin race condition' added
a stable PageSwapCache check under the page lock in the do_swap_page()
before calling the memory controller, so it's unuse_pte()'s pte_same()
that may fail.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: only check swap cache pages for repeated charging
Johannes Weiner [Tue, 31 Jul 2012 23:45:47 +0000]
mm: memcg: only check swap cache pages for repeated charging

Only anon and shmem pages in the swap cache are attempted to be charged
multiple times, from every swap pte fault or from shmem_unuse().  No other
pages require checking PageCgroupUsed().

Charging pages in the swap cache is also serialized by the page lock, and
since both the try_charge and commit_charge are called under the same page
lock section, the PageCgroupUsed() check might as well happen before the
counter charging, let alone reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: split swapin charge function into private and public part
Johannes Weiner [Tue, 31 Jul 2012 23:45:43 +0000]
mm: memcg: split swapin charge function into private and public part

When shmem is charged upon swapin, it does not need to check twice whether
the memory controller is enabled.

Also, shmem pages do not have to be checked for everything that regular
anon pages have to be checked for, so let shmem use the internal version
directly and allow future patches to move around checks that are only
required when swapping in anon pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: remove needless !mm fixup to init_mm when charging
Johannes Weiner [Tue, 31 Jul 2012 23:45:40 +0000]
mm: memcg: remove needless !mm fixup to init_mm when charging

It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
or init_mm, it will charge the root memcg in either case.

Also fix up the comment in __mem_cgroup_try_charge() that claimed the
init_mm would be charged when no mm was passed.  It's not really
incorrect, but confusing.  Clarify that the root memcg is charged in this
case.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: remove unneeded shmem charge type
Johannes Weiner [Tue, 31 Jul 2012 23:45:39 +0000]
mm: memcg: remove unneeded shmem charge type

shmem page charges have not needed a separate charge type to tell them
from regular file pages since 08e552c ("memcg: synchronized LRU").

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: move swapin charge functions above callsites
Johannes Weiner [Tue, 31 Jul 2012 23:45:36 +0000]
mm: memcg: move swapin charge functions above callsites

Charging cache pages may require swapin in the shmem case.  Save the
forward declaration and just move the swapin functions above the cache
charging functions.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: only check for PageSwapCache when uncharging anon
Johannes Weiner [Tue, 31 Jul 2012 23:45:34 +0000]
mm: memcg: only check for PageSwapCache when uncharging anon

Only anon pages that are uncharged at the time of the last page table
mapping vanishing may be in swapcache.

When shmem pages, file pages, swap-freed anon pages, or just migrated
pages are uncharged, they are known for sure to be not in swapcache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: push down PageSwapCache check into uncharge entry functions
Johannes Weiner [Tue, 31 Jul 2012 23:45:31 +0000]
mm: memcg: push down PageSwapCache check into uncharge entry functions

Not all uncharge paths need to check if the page is swapcache, some of
them can know for sure.

Push down the check into all callsites of uncharge_common() so that the
patch that removes some of them is more obvious.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: swapfile: clean up unuse_pte race handling
Johannes Weiner [Tue, 31 Jul 2012 23:45:28 +0000]
mm: swapfile: clean up unuse_pte race handling

The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
the function would continue to reestablish the page even after
mem_cgroup_try_charge_swapin() failed.  After 85d9fc8 "memcg: fix refcnt
handling at swapoff", the condition is always true when this code is
reached.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: memcg: fix compaction/migration failing due to memcg limits
Johannes Weiner [Tue, 31 Jul 2012 23:45:25 +0000]
mm: memcg: fix compaction/migration failing due to memcg limits

Compaction (and page migration in general) can currently be hindered
through pages being owned by memory cgroups that are at their limits and
unreclaimable.

The reason is that the replacement page is being charged against the limit
while the page being replaced is also still charged.  But this seems
unnecessary, given that only one of the two pages will still be in use
after migration finishes.

This patch changes the memcg migration sequence so that the replacement
page is not charged.  Whatever page is still in use after successful or
failed migration gets to keep the charge of the page that was going to be
replaced.

The replacement page will still show up temporarily in the rss/cache
statistics, this can be fixed in a later patch as it's less urgent.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agoswapfile: avoid dereferencing bd_disk during swap_entry_free for network storage
Mel Gorman [Tue, 31 Jul 2012 23:45:20 +0000]
swapfile: avoid dereferencing bd_disk during swap_entry_free for network storage

Commit b3a27d ("swap: Add swap slot free callback to
block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
dereference if using swap-over-NFS.  This patch checks SWP_BLKDEV on the
swap_info_struct before dereferencing.

With reference to this callback, Christoph Hellwig stated "Please just
remove the callback entirely.  It has no user outside the staging tree and
was added clearly against the rules for that staging tree".  This would
also be my preference but there was not an obvious way of keeping zram in
staging/ happy.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agonfs: prevent page allocator recursions with swap over NFS.
Mel Gorman [Tue, 31 Jul 2012 23:45:16 +0000]
nfs: prevent page allocator recursions with swap over NFS.

GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
just not of any filesystem data.

The problem is that previously NOFS was correct because that avoids
recursion into the NFS code.  With swap-over-NFS, it is no longer correct
as swap IO can lead to this recursion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agonfs: enable swap on NFS
Mel Gorman [Tue, 31 Jul 2012 23:45:12 +0000]
nfs: enable swap on NFS

Implement the new swapfile a_ops for NFS and hook up ->direct_IO.  This
will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related
objects and the early (re)setting of SOCK_MEMALLOC should allow us to
receive the packets required for the TCP connection buildup.

[jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
[dfeng@redhat.com: Fix handling of multiple swap files]
[a.p.zijlstra@chello.nl: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agonfs: disable data cache revalidation for swapfiles
Mel Gorman [Tue, 31 Jul 2012 23:45:10 +0000]
nfs: disable data cache revalidation for swapfiles

The VM does not like PG_private set on PG_swapcache pages.  As suggested
by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables NFS
data cache revalidation on swap files.  as it does not make sense to have
other clients change the file while it is being used as swap.  This avoids
setting PG_private on swap pages, since there ought to be no further races
with invalidate_inode_pages2() to deal with.

Since we cannot set PG_private we cannot use page->private which is
already used by PG_swapcache pages to store the nfs_page.  Thus augment
the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agonfs: teach the NFS client how to treat PG_swapcache pages
Mel Gorman [Tue, 31 Jul 2012 23:45:06 +0000]
nfs: teach the NFS client how to treat PG_swapcache pages

Replace all relevant occurences of page->index and page->mapping in the
NFS client with the new page_file_index() and page_file_mapping()
functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: add support for direct_IO to highmem pages
Mel Gorman [Tue, 31 Jul 2012 23:45:02 +0000]
mm: add support for direct_IO to highmem pages

The patch "mm: add support for a filesystem to activate swap files and use
direct_IO for writing swap pages" added support for using direct_IO to
write swap pages but it is insufficient for highmem pages.

To support highmem pages, this patch kmaps() the page before calling the
direct_IO() handler.  As direct_IO deals with virtual addresses an
additional helper is necessary for get_kernel_pages() to lookup the struct
page for a kmap virtual address.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agomm: swap: implement generic handler for swap_activate
Mel Gorman [Tue, 31 Jul 2012 23:44:57 +0000]
mm: swap: implement generic handler for swap_activate

The version of swap_activate introduced is sufficient for swap-over-NFS
but would not provide enough information to implement a generic handler.
This patch shuffles things slightly to ensure the same information is
available for aops->swap_activate() as is available to the core.

No functionality change.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>