7 years agoKVM: Use kmemdup() instead of kmalloc/memcpy
Sasha Levin [Sun, 4 Dec 2011 17:36:28 +0000]
KVM: Use kmemdup() instead of kmalloc/memcpy

Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Document KVM_NMI
Avi Kivity [Wed, 7 Dec 2011 10:42:47 +0000]
KVM: Document KVM_NMI

Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write
Jan Kiszka [Fri, 2 Dec 2011 17:26:28 +0000]
KVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write

This was probably copy&pasted from the cr0 case, but it's unneeded here.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages
Jan Kiszka [Fri, 2 Dec 2011 17:35:24 +0000]
KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages

freed_pages is never evaluated, so remove it as well as the return code
kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: use this_cpu_xxx replace percpu_xxx funcs
Alex,Shi [Thu, 20 Oct 2011 07:34:01 +0000]
KVM: use this_cpu_xxx replace percpu_xxx funcs

percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: audit: inline audit function
Xiao Guangrong [Wed, 30 Nov 2011 09:43:24 +0000]
KVM: MMU: audit: inline audit function

inline audit function and little cleanup

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: remove oos_shadow parameter
Xiao Guangrong [Mon, 28 Nov 2011 12:43:18 +0000]
KVM: MMU: remove oos_shadow parameter

The unsync code should be stable now, maybe it is the time to remove this
parameter to cleanup the code a little bit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: move the relevant mmu code to mmu.c
Xiao Guangrong [Mon, 28 Nov 2011 12:42:16 +0000]
KVM: MMU: move the relevant mmu code to mmu.c

Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86: remove the dead code of KVM_EXIT_HYPERCALL
Xiao Guangrong [Mon, 28 Nov 2011 12:41:38 +0000]
KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL

KVM_EXIT_HYPERCALL is not used anymore, so remove the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: audit: replace mmu audit tracepoint with jump-label
Xiao Guangrong [Mon, 28 Nov 2011 12:41:00 +0000]
KVM: MMU: audit: replace mmu audit tracepoint with jump-label

The tracepoint is only used to audit mmu code, it should not be exposed to
user, let us replace it with jump-label.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agojump-label: export jump_label_inc/jump_label_dec
Xiao Guangrong [Mon, 28 Nov 2011 12:39:59 +0000]
jump-label: export jump_label_inc/jump_label_dec

Export these two symbols, they will be used by KVM mmu audit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid
Sasha Levin [Mon, 28 Nov 2011 09:20:29 +0000]
KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid

This patch cleans and simplifies kvm_dev_ioctl_get_supported_cpuid by using a table
instead of duplicating code as Avi suggested.

This patch also fixes a bug where kvm_dev_ioctl_get_supported_cpuid would return
-E2BIG when amount of entries passed was just right.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest
Liu, Jinsong [Mon, 28 Nov 2011 11:55:19 +0000]
KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest

Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945
The new feature cpuid listed as below:

1. FMA CPUID.EAX=01H:ECX.FMA[bit 12]
2. MOVBE CPUID.EAX=01H:ECX.MOVBE[bit 22]
3. BMI1 CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3]
4. AVX2 CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5]
5. BMI2 CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8]
6. LZCNT CPUID.EAX=80000001H:ECX.LZCNT[bit 5]

This patch expose these features to guest.
Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has
already been exposed.

This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Move cpuid code to new file
Avi Kivity [Wed, 23 Nov 2011 14:30:32 +0000]
KVM: Move cpuid code to new file

The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for INS/OUTS from/to port in DX
Takuya Yoshikawa [Wed, 23 Nov 2011 03:27:39 +0000]
KVM: x86 emulator: Use opcode::execute for INS/OUTS from/to port in DX

INSB       : 6C
INSW/INSD  : 6D
OUTSB      : 6E
OUTSW/OUTSD: 6F

The I/O port address is read from the DX register when we decode the
operand because we see the SrcDX/DstDX flag is set.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Allow aligned byte and word writes to IOAPIC registers.
Julian Stecklina [Wed, 23 Nov 2011 12:54:30 +0000]
KVM: Allow aligned byte and word writes to IOAPIC registers.

This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
consistent with ioapic_mmio_read, which also allows byte and word accesses.

Signed-off-by: Julian Stecklina <js@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: IA64: fix struct redefinition
Xiao Guangrong [Thu, 24 Nov 2011 10:09:11 +0000]
KVM: IA64: fix struct redefinition

There is the same struct definition in ia64 and kvm common code:
arch/ia64/kvm//kvm-ia64.c: At top level:
arch/ia64/kvm//kvm-ia64.c:777:8: error: redefinition of ‘struct kvm_io_range’
include/linux/kvm_host.h:62:8: note: originally defined here

So, rename kvm_io_range to kvm_ia64_io_range in ia64 code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: introduce a table to map slot id to index in memslots array
Xiao Guangrong [Thu, 24 Nov 2011 09:41:54 +0000]
KVM: introduce a table to map slot id to index in memslots array

The operation of getting dirty log is frequent when framebuffer-based
displays are used(for example, Xwindow), so, we introduce a mapping table
to speed up id_to_memslot()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: sort memslots by its size and use line search
Xiao Guangrong [Thu, 24 Nov 2011 09:40:57 +0000]
KVM: sort memslots by its size and use line search

Sort memslots base on its size and use line search to find it, so that the
larger memslots have better fit

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: introduce id_to_memslot function
Xiao Guangrong [Thu, 24 Nov 2011 11:04:35 +0000]
KVM: introduce id_to_memslot function

Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: introduce kvm_for_each_memslot macro
Xiao Guangrong [Thu, 24 Nov 2011 09:39:18 +0000]
KVM: introduce kvm_for_each_memslot macro

Introduce kvm_for_each_memslot to walk all valid memslot

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: introduce update_memslots function
Xiao Guangrong [Thu, 24 Nov 2011 09:38:24 +0000]
KVM: introduce update_memslots function

Introduce update_memslots to update slot which will be update to
kvm->memslots

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: introduce KVM_MEM_SLOTS_NUM macro
Xiao Guangrong [Thu, 24 Nov 2011 09:37:48 +0000]
KVM: introduce KVM_MEM_SLOTS_NUM macro

Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for BSF/BSR
Takuya Yoshikawa [Tue, 22 Nov 2011 06:21:33 +0000]
KVM: x86 emulator: Use opcode::execute for BSF/BSR

BSF: 0F BC
BSR: 0F BD

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for CMPXCHG
Takuya Yoshikawa [Tue, 22 Nov 2011 06:20:47 +0000]
KVM: x86 emulator: Use opcode::execute for CMPXCHG

CMPXCHG: 0F B0, 0F B1

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR
Takuya Yoshikawa [Tue, 22 Nov 2011 06:20:03 +0000]
KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR

WRMSR: 0F 30
RDMSR: 0F 32

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for MOV to cr/dr
Takuya Yoshikawa [Tue, 22 Nov 2011 06:19:19 +0000]
KVM: x86 emulator: Use opcode::execute for MOV to cr/dr

MOV: 0F 22 (move to control registers)
MOV: 0F 23 (move to debug registers)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for CALL
Takuya Yoshikawa [Tue, 22 Nov 2011 06:18:35 +0000]
KVM: x86 emulator: Use opcode::execute for CALL

CALL: E8

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for BT family
Takuya Yoshikawa [Tue, 22 Nov 2011 06:17:48 +0000]
KVM: x86 emulator: Use opcode::execute for BT family

BT : 0F A3
BTS: 0F AB
BTR: 0F B3
BTC: 0F BB

Group 8: 0F BA

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86 emulator: Use opcode::execute for IN/OUT
Takuya Yoshikawa [Tue, 22 Nov 2011 06:16:54 +0000]
KVM: x86 emulator: Use opcode::execute for IN/OUT

IN : E4, E5, EC, ED
OUT: E6, E7, EE, EF

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: VMX: remove unneeded vmx_load_host_state() calls.
Gleb Natapov [Thu, 17 Nov 2011 08:56:09 +0000]
KVM: VMX: remove unneeded vmx_load_host_state() calls.

vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981b3f. Remove call to it
where it is no longer make sense.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Optimize dirty logging by rmap_write_protect()
Takuya Yoshikawa [Mon, 14 Nov 2011 09:24:50 +0000]
KVM: Optimize dirty logging by rmap_write_protect()

Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Count the number of dirty pages for dirty logging
Takuya Yoshikawa [Mon, 14 Nov 2011 09:23:34 +0000]
KVM: Count the number of dirty pages for dirty logging

Needed for the next patch which uses this number to decide how to write
protect a slot.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: Split gfn_to_rmap() into two functions
Takuya Yoshikawa [Mon, 14 Nov 2011 09:22:28 +0000]
KVM: MMU: Split gfn_to_rmap() into two functions

rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.

This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.

This is also needed for the following dirty logging optimization.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()
Takuya Yoshikawa [Mon, 14 Nov 2011 09:21:34 +0000]
KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()

Remove redundant checks and use is_large_pte() macro.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Use kmemdup rather than duplicating its implementation
Thomas Meyer [Tue, 8 Nov 2011 19:32:19 +0000]
KVM: Use kmemdup rather than duplicating its implementation

 Use kmemdup rather than duplicating its implementation

 The semantic patch that makes this change is available
 in scripts/coccinelle/api/memdup.cocci.

 More information about semantic patching is available at
 http://coccinelle.lip6.fr/

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: MMU: remove KVM host pv mmu support
Chris Wright [Wed, 2 Nov 2011 00:31:18 +0000]
KVM: MMU: remove KVM host pv mmu support

The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM guest: remove KVM guest pv mmu support
Chris Wright [Wed, 2 Nov 2011 00:28:47 +0000]
KVM guest: remove KVM guest pv mmu support

This has not been used for some years now.  It's time to remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: make checks stricter in coalesced_mmio_in_range()
Dan Carpenter [Wed, 19 Oct 2011 06:15:10 +0000]
KVM: make checks stricter in coalesced_mmio_in_range()

My testing version of Smatch complains that addr and len come from
the user and they can wrap.  The path is:
  -> kvm_vm_ioctl()
     -> kvm_vm_ioctl_unregister_coalesced_mmio()
        -> coalesced_mmio_in_range()

I don't know what the implications are of wrapping here, but we may
as well fix it, if only to silence the warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86: Simplify kvm timer handler
Jan Kiszka [Wed, 14 Sep 2011 07:58:32 +0000]
KVM: x86: Simplify kvm timer handler

The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: Fix include dependency for mmu_notifier
Eric B Munson [Mon, 10 Oct 2011 15:46:15 +0000]
KVM: Fix include dependency for mmu_notifier

The kvm_host struct can include an mmu_notifier struct but mmu_notifier.h is
not included directly.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: improve write flooding detected
Xiao Guangrong [Thu, 22 Sep 2011 08:58:36 +0000]
KVM: MMU: improve write flooding detected

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: fix detecting misaligned accessed
Xiao Guangrong [Thu, 22 Sep 2011 08:57:55 +0000]
KVM: MMU: fix detecting misaligned accessed

Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: split kvm_mmu_pte_write function
Xiao Guangrong [Thu, 22 Sep 2011 08:57:23 +0000]
KVM: MMU: split kvm_mmu_pte_write function

kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: remove unnecessary kvm_mmu_free_some_pages
Xiao Guangrong [Thu, 22 Sep 2011 08:56:58 +0000]
KVM: MMU: remove unnecessary kvm_mmu_free_some_pages

In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: fast prefetch spte on invlpg path
Xiao Guangrong [Thu, 22 Sep 2011 08:56:39 +0000]
KVM: MMU: fast prefetch spte on invlpg path

Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: cleanup FNAME(invlpg)
Xiao Guangrong [Thu, 22 Sep 2011 08:56:06 +0000]
KVM: MMU: cleanup FNAME(invlpg)

Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: do not mark accessed bit on pte write path
Xiao Guangrong [Thu, 22 Sep 2011 08:55:36 +0000]
KVM: MMU: do not mark accessed bit on pte write path

In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86: cleanup port-in/port-out emulated
Xiao Guangrong [Thu, 22 Sep 2011 08:55:10 +0000]
KVM: x86: cleanup port-in/port-out emulated

Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86: retry non-page-table writing instructions
Xiao Guangrong [Thu, 22 Sep 2011 09:02:48 +0000]
KVM: x86: retry non-page-table writing instructions

If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: x86: tag the instructions which are used to write page table
Xiao Guangrong [Thu, 22 Sep 2011 08:53:46 +0000]
KVM: x86: tag the instructions which are used to write page table

The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
Xiao Guangrong [Thu, 22 Sep 2011 08:53:17 +0000]
KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: nVMX: Fix warning-causing idt-vectoring-info behavior
Nadav Har'El [Thu, 22 Sep 2011 10:53:26 +0000]
KVM: nVMX: Fix warning-causing idt-vectoring-info behavior

When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
Nadav Har'El [Thu, 22 Sep 2011 10:52:56 +0000]
KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT

This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agodrm/i915: Disable RC6 on Sandybridge by default
Keith Packard [Tue, 27 Dec 2011 01:02:11 +0000]
drm/i915: Disable RC6 on Sandybridge by default

RC6 fails again.

> I found my system freeze mostly during starting up X and KDE. Sometimes it
> works for some minutes, sometimes it freezes immediatly. When the freeze
> happens, everything is dead (even the reset button does not work, I need to
> power cycle).

> I disabled RC6, and my system runs wonderfully.

> The system is a Z68 Pro board with Sandybridge i5-2500K processor, 8
> GB of RAM and UEFI firmware.

Reported-by: Kai Krakow <hurikhan77@gmail.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agodrm/i915: Disable semaphores by default on SNB
Keith Packard [Tue, 27 Dec 2011 01:02:10 +0000]
drm/i915: Disable semaphores by default on SNB

Semaphores still cause problems on some machines:

> From Udo Steinberg:
>
> With Linux-3.2-rc6 I'm frequently seeing GPU hangs when large amounts of
> text scroll in an xterm, such as when extracting a tar archive. Such as this
> one (note the timestamps):
>
>  I can reproduce it fairly easily with something
>  as simple as:
>
>   while true; do dmesg; done

This patch turns them off on SNB while leaving them on for IVB.

Reported-by: Udo Steinberg <udo@hypervisor.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Eugeni Dodonov <eugeni@dodonov.net>
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agoMerge branch 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Mon, 26 Dec 2011 21:17:00 +0000]
Merge branch 'kvm-updates/3.2' of git://git./virt/kvm/kvm

* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: PPC: e500: include linux/export.h
  KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N
  KVM: PPC: protect use of kvmppc_h_pr
  KVM: PPC: move compute_tlbie_rb to book3s_64 common header
  KVM: Don't automatically expose the TSC deadline timer in cpuid
  KVM: Device assignment permission checks
  KVM: Remove ability to assign a device without iommu support
  KVM: x86: Prevent starting PIT timers in the absence of irqchip support

7 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Linus Torvalds [Mon, 26 Dec 2011 20:46:17 +0000]
Merge tag 'for-linus' of git://git./linux/kernel/git/ieee1394/linux1394

post 3.2-rc7 pull request

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  MAINTAINERS: firewire git URL update

7 years agovfs: fix handling of lock allocation failure in lease-break case
Linus Torvalds [Mon, 26 Dec 2011 18:25:26 +0000]
vfs: fix handling of lock allocation failure in lease-break case

Bruce Fields notes that commit 778fc546f749 ("locks: fix tracking of
inprogress lease breaks") introduced a possible error pointer
dereference on failure to allocate memory.  locks_conflict() will
dereference the passed-in new lease lock structure that may be an error pointer.

This means an open (without O_NONBLOCK set) on a file with a lease
applied (generally only done when Samba or nfsd (with v4) is running)
could crash if a kmalloc() fails.

So instead of playing games with IS_ERROR() all over the place, just
check the allocation failure early.  That makes the code more
straightforward, and avoids this possible bad pointer dereference.

Based-on-patch-by: J. Bruce Fields <bfields@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agoKVM: PPC: e500: include linux/export.h
Scott Wood [Tue, 20 Dec 2011 14:43:45 +0000]
KVM: PPC: e500: include linux/export.h

This is required for THIS_MODULE.  We recently stopped acquiring
it via some other header.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

7 years agoKVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N
Michael Neuling [Thu, 10 Nov 2011 16:03:20 +0000]
KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N

Currently kvmppc_start_thread() tries to wake other SMT threads via
xics_wake_cpu().  Unfortunately xics_wake_cpu only exists when
CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get:

  arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread':
  book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu'

The following should be fine since kvmppc_start_thread() shouldn't
called to start non-zero threads when SMP=N since threads_per_core=1.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

7 years agoKVM: PPC: protect use of kvmppc_h_pr
Andreas Schwab [Tue, 8 Nov 2011 07:17:39 +0000]
KVM: PPC: protect use of kvmppc_h_pr

kvmppc_h_pr is only available if CONFIG_KVM_BOOK3S_64_PR.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

7 years agoKVM: PPC: move compute_tlbie_rb to book3s_64 common header
Andreas Schwab [Tue, 8 Nov 2011 07:08:52 +0000]
KVM: PPC: move compute_tlbie_rb to book3s_64 common header

compute_tlbie_rb is only used on ppc64 and cannot be compiled on ppc32.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

7 years agoKVM: Don't automatically expose the TSC deadline timer in cpuid
Jan Kiszka [Wed, 21 Dec 2011 11:28:29 +0000]
KVM: Don't automatically expose the TSC deadline timer in cpuid

Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]

Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7 years agoKVM: Device assignment permission checks
Alex Williamson [Wed, 21 Dec 2011 04:59:09 +0000]
KVM: Device assignment permission checks

Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

[Yang Bai: add missing #include]
[avi: fix comment style]

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: Remove ability to assign a device without iommu support
Alex Williamson [Wed, 21 Dec 2011 04:59:03 +0000]
KVM: Remove ability to assign a device without iommu support

This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoKVM: x86: Prevent starting PIT timers in the absence of irqchip support
Jan Kiszka [Wed, 14 Dec 2011 18:25:13 +0000]
KVM: x86: Prevent starting PIT timers in the absence of irqchip support

User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm]
 [<ffffffff81071431>] process_one_work+0x111/0x4d0
 [<ffffffff81071bb2>] worker_thread+0x152/0x340
 [<ffffffff81075c8e>] kthread+0x7e/0x90
 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7 years agoMAINTAINERS: firewire git URL update
Stefan Richter [Tue, 20 Dec 2011 20:23:28 +0000]
MAINTAINERS: firewire git URL update

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

7 years agoMerge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Sat, 24 Dec 2011 21:34:44 +0000]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  vmwgfx: fix incorrect VRAM size check in vmw_kms_fb_create()
  drm/radeon/kms: bail on BTC parts if MC ucode is missing

7 years agoLinux 3.2-rc7
Linus Torvalds [Sat, 24 Dec 2011 05:51:06 +0000]
Linux 3.2-rc7

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 24 Dec 2011 05:47:28 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Fix race between CPU hotplug and lglocks

7 years agoMerge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Linus Torvalds [Sat, 24 Dec 2011 04:25:36 +0000]
Merge tag 'writeback' of git://git./linux/kernel/git/wfg/linux

for linus: writeback reason binary tracing format fix

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: show writeback reason with __print_symbolic

7 years agoMerge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Linus Torvalds [Fri, 23 Dec 2011 23:01:24 +0000]
Merge branch 'rc-fixes' of git://git./linux/kernel/git/mmarek/kbuild

* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kconfig: adapt update-po-config to new UML layout

7 years agoMerge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Fri, 23 Dec 2011 22:59:08 +0000]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] omap3isp: Fix crash caused by subdevs now having a pointer to devnodes

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux...
Linus Torvalds [Fri, 23 Dec 2011 22:58:39 +0000]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: call d_instantiate after all ops are setup
  Btrfs: fix worker lock misuse in find_worker

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Linus Torvalds [Fri, 23 Dec 2011 22:58:14 +0000]
Merge git://git./linux/kernel/git/davem/sparc

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix MSIQ HV call ordering in pci_sun4v_msiq_build_irq().

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Fri, 23 Dec 2011 22:57:55 +0000]
Merge git://git./linux/kernel/git/davem/net

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  netfilter: xt_connbytes: handle negation correctly
  net: relax rcvbuf limits
  rps: fix insufficient bounds checking in store_rps_dev_flow_table_cnt()
  net: introduce DST_NOPEER dst flag
  mqprio: Avoid panic if no options are provided
  bridge: provide a mtu() method for fake_dst_ops

7 years agoMerge branch 'nf' of git://1984.lsi.us.es/net
David S. Miller [Fri, 23 Dec 2011 19:29:20 +0000]
Merge branch 'nf' of git://1984.lsi.us.es/net

7 years agonetfilter: xt_connbytes: handle negation correctly
Florian Westphal [Fri, 16 Dec 2011 17:35:15 +0000]
netfilter: xt_connbytes: handle negation correctly

"! --connbytes 23:42" should match if the packet/byte count is not in range.

As there is no explict "invert match" toggle in the match structure,
userspace swaps the from and to arguments
(i.e., as if "--connbytes 42:23" were given).

However, "what <= 23 && what >= 42" will always be false.

Change things so we use "||" in case "from" is larger than "to".

This change may look like it breaks backwards compatibility when "to" is 0.
However, older iptables binaries will refuse "connbytes 42:0",
and current releases treat it to mean "! --connbytes 0:42",
so we should be fine.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7 years agoBtrfs: call d_instantiate after all ops are setup
Al Viro [Fri, 23 Dec 2011 12:58:13 +0000]
Btrfs: call d_instantiate after all ops are setup

This closes races where btrfs is calling d_instantiate too soon during
inode creation.  All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7 years agoBtrfs: fix worker lock misuse in find_worker
Chris Mason [Fri, 23 Dec 2011 12:53:00 +0000]
Btrfs: fix worker lock misuse in find_worker

Dan Carpenter noticed that we were doing a double unlock on the worker
lock, and sometimes picking a worker thread without the lock held.

This fixes both errors.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

7 years agonet: relax rcvbuf limits
Eric Dumazet [Wed, 21 Dec 2011 07:11:44 +0000]
net: relax rcvbuf limits

skb->truesize might be big even for a small packet.

Its even bigger after commit 87fb4b7b533 (net: more accurate skb
truesize) and big MTU.

We should allow queueing at least one packet per receiver, even with a
low RCVBUF setting.

Reported-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agorps: fix insufficient bounds checking in store_rps_dev_flow_table_cnt()
Xi Wang [Thu, 22 Dec 2011 13:35:22 +0000]
rps: fix insufficient bounds checking in store_rps_dev_flow_table_cnt()

Setting a large rps_flow_cnt like (1 << 30) on 32-bit platform will
cause a kernel oops due to insufficient bounds checking.

if (count > 1<<30) {
/* Enforce a limit to prevent overflow */
return -EINVAL;
}
count = roundup_pow_of_two(count);
table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));

Note that the macro RPS_DEV_FLOW_TABLE_SIZE(count) is defined as:

... + (count * sizeof(struct rps_dev_flow))

where sizeof(struct rps_dev_flow) is 8.  (1 << 30) * 8 will overflow
32 bits.

This patch replaces the magic number (1 << 30) with a symbolic bound.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agonet: introduce DST_NOPEER dst flag
Eric Dumazet [Thu, 22 Dec 2011 04:15:53 +0000]
net: introduce DST_NOPEER dst flag

Chris Boot reported crashes occurring in ipv6_select_ident().

[  461.457562] RIP: 0010:[<ffffffff812dde61>]  [<ffffffff812dde61>]
ipv6_select_ident+0x31/0xa7

[  461.578229] Call Trace:
[  461.580742] <IRQ>
[  461.582870]  [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2
[  461.589054]  [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155
[  461.595140]  [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b
[  461.601198]  [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e
[nf_conntrack_ipv6]
[  461.608786]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.614227]  [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543
[  461.620659]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.626440]  [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a
[bridge]
[  461.633581]  [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459
[  461.639577]  [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76
[bridge]
[  461.646887]  [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f
[bridge]
[  461.653997]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.659473]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.665485]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.671234]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.677299]  [<ffffffffa0379215>] ?
nf_bridge_update_protocol+0x20/0x20 [bridge]
[  461.684891]  [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
[  461.691520]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.697572]  [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56
[bridge]
[  461.704616]  [<ffffffffa0379031>] ?
nf_bridge_push_encap_header+0x1c/0x26 [bridge]
[  461.712329]  [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95
[bridge]
[  461.719490]  [<ffffffffa037900a>] ?
nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
[  461.727223]  [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
[  461.734292]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.739758]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[  461.746203]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.751950]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[  461.758378]  [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56
[bridge]

This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.

Problem is present since commit 87c48fa3b46 (ipv6: make fragment
identifications less predictable)

Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.

Reported-by: Chris Boot <bootc@bootc.net>
Tested-by: Chris Boot <bootc@bootc.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agomqprio: Avoid panic if no options are provided
Thomas Graf [Thu, 22 Dec 2011 02:05:07 +0000]
mqprio: Avoid panic if no options are provided

Userspace may not provide TCA_OPTIONS, in fact tc currently does
so not do so if no arguments are specified on the command line.
Return EINVAL instead of panicing.

Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agobridge: provide a mtu() method for fake_dst_ops
Eric Dumazet [Wed, 21 Dec 2011 20:00:32 +0000]
bridge: provide a mtu() method for fake_dst_ops

Commit 618f9bc74a039da76 (net: Move mtu handling down to the protocol
depended handlers) forgot the bridge netfilter case, adding a NULL
dereference in ip_fragment().

Reported-by: Chris Boot <bootc@bootc.net>
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agoMerge branch 'for-linus' of git://neil.brown.name/md
Linus Torvalds [Thu, 22 Dec 2011 23:36:17 +0000]
Merge branch 'for-linus' of git://neil.brown.name/md

* 'for-linus' of git://neil.brown.name/md:
  md/bitmap: It is OK to clear bits during recovery.
  md: don't give up looking for spares on first failure-to-add
  md/raid5: ensure correct assessment of drives during degraded reshape.
  md/linear: fix hot-add of devices to linear arrays.

7 years agomd/bitmap: It is OK to clear bits during recovery.
NeilBrown [Thu, 22 Dec 2011 22:57:48 +0000]
md/bitmap: It is OK to clear bits during recovery.

commit d0a4bb492772ce5c4bdfba3744a99ed6f6fb238f introduced a
regression which is annoying but fairly harmless.

When writing to an array that is undergoing recovery (a spare
in being integrated into the array), writing to the array will
set bits in the bitmap, but they will not be cleared when the
write completes.

For bits covering areas that have not been recovered yet this is not a
problem as the recovery will clear the bits.  However bits set in
already-recovered region will stay set and never be cleared.
This doesn't risk data integrity.  The only negatives are:
 - next time there is a crash, more resyncing than necessary will
   be done.
 - the bitmap doesn't look clean, which is confusing.

While an array is recovering we don't want to update the
'events_cleared' setting in the bitmap but we do still want to clear
bits that have very recently been set - providing they were written to
the recovering device.

So split those two needs - which previously both depended on 'success'
and always clear the bit of the write went to all devices.

Signed-off-by: NeilBrown <neilb@suse.de>

7 years agomd: don't give up looking for spares on first failure-to-add
NeilBrown [Thu, 22 Dec 2011 22:57:19 +0000]
md: don't give up looking for spares on first failure-to-add

Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.

Reported-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agomd/raid5: ensure correct assessment of drives during degraded reshape.
NeilBrown [Thu, 22 Dec 2011 22:57:00 +0000]
md/raid5: ensure correct assessment of drives during degraded reshape.

While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync.  In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.

Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7 years agomd/linear: fix hot-add of devices to linear arrays.
NeilBrown [Thu, 22 Dec 2011 22:56:55 +0000]
md/linear: fix hot-add of devices to linear arrays.

commit d70ed2e4fafdbef0800e73942482bb075c21578b
broke hot-add to a linear array.
After that commit, metadata if not written to devices until they
have been fully integrated into the array as determined by
saved_raid_disk.  That patch arranged to clear that field after
a recovery completed.

However for linear arrays, there is no recovery - the integration is
instantaneous.  So we need to explicitly clear the saved_raid_disk
field.

Signed-off-by: NeilBrown <neilb@suse.de>

7 years agosparc64: Fix MSIQ HV call ordering in pci_sun4v_msiq_build_irq().
David S. Miller [Thu, 22 Dec 2011 21:23:59 +0000]
sparc64: Fix MSIQ HV call ordering in pci_sun4v_msiq_build_irq().

This silently was working for many years and stopped working on
Niagara-T3 machines.

We need to set the MSIQ to VALID before we can set it's state to IDLE.

On Niagara-T3, setting the state to IDLE first was causing HV_EINVAL
errors.  The hypervisor documentation says, rather ambiguously, that
the MSIQ must be "initialized" before one can set the state.

I previously understood this to mean merely that a successful setconf()
operation has been performed on the MSIQ, which we have done at this
point.  But it seems to also mean that it has been set VALID too.

Signed-off-by: David S. Miller <davem@davemloft.net>

7 years agoMerge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Thu, 22 Dec 2011 20:59:47 +0000]
Merge branch 'usb-linus' of git://git./linux/kernel/git/gregkh/usb

* 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: Fix usb/isp1760 build on sparc
  usb: gadget: epautoconf: do not change number of streams
  usb: dwc3: core: fix cached revision on our structure
  usb: musb: fix reset issue with full speed device

7 years agoMerge branch 'upstream-linus' of git://github.com/jgarzik/libata-dev
Linus Torvalds [Thu, 22 Dec 2011 20:53:32 +0000]
Merge branch 'upstream-linus' of git://github.com/jgarzik/libata-dev

* 'upstream-linus' of git://github.com/jgarzik/libata-dev:
  pata_of_platform: Add missing CONFIG_OF_IRQ dependency.

7 years agopata_of_platform: Add missing CONFIG_OF_IRQ dependency.
David Miller [Wed, 21 Dec 2011 22:38:10 +0000]
pata_of_platform: Add missing CONFIG_OF_IRQ dependency.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

7 years agoipv4: using prefetch requires including prefetch.h
Stephen Rothwell [Thu, 22 Dec 2011 06:03:29 +0000]
ipv4: using prefetch requires including prefetch.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7 years agovmwgfx: fix incorrect VRAM size check in vmw_kms_fb_create()
Xi Wang [Wed, 21 Dec 2011 10:18:33 +0000]
vmwgfx: fix incorrect VRAM size check in vmw_kms_fb_create()

Commit e133e737 didn't correctly fix the integer overflow issue.

- unsigned int required_size;
+ u64 required_size;
...
required_size = mode_cmd->pitch * mode_cmd->height;
- if (unlikely(required_size > dev_priv->vram_size)) {
+ if (unlikely(required_size > (u64) dev_priv->vram_size)) {

Note that both pitch and height are u32.  Their product is still u32 and
would overflow before being assigned to required_size.  A correct way is
to convert pitch and height to u64 before the multiplication.

required_size = (u64)mode_cmd->pitch * (u64)mode_cmd->height;

This patch calls the existing vmw_kms_validate_mode_vram() for
validation.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-and-tested-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

7 years agodrm/radeon/kms: bail on BTC parts if MC ucode is missing
Alex Deucher [Wed, 21 Dec 2011 16:58:17 +0000]
drm/radeon/kms: bail on BTC parts if MC ucode is missing

We already do this for cayman, need to also do it for
BTC parts.  The default memory and voltage setup is not
adequate for advanced operation.  Continuing will
result in an unusable display.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@kernel.org
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

7 years agoVFS: Fix race between CPU hotplug and lglocks
Srivatsa S. Bhat [Wed, 21 Dec 2011 21:15:29 +0000]
VFS: Fix race between CPU hotplug and lglocks

Currently, the *_global_[un]lock_online() routines are not at all synchronized
with CPU hotplug. Soft-lockups detected as a consequence of this race was
reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
for finding out that the root-cause of this issue is the race condition
between br_write_[un]lock() and CPU hotplug, which results in the lock states
getting messed up).

Fixing this race by just adding {get,put}_online_cpus() at appropriate places
in *_global_[un]lock_online() is not a good option, because, then suddenly
br_write_[un]lock() would become blocking, whereas they have been kept as
non-blocking all this time, and we would want to keep them that way.

So, overall, we want to ensure 3 things:
1. br_write_lock() and br_write_unlock() must remain as non-blocking.
2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
   for different sets of CPUs.
3. Either prevent any new CPU online operation in between this lock-unlock, or
   ensure that the newly onlined CPU does not proceed with its corresponding
   per-cpu spinlock unlocked.

To achieve all this:
(a) We introduce a new spinlock that is taken by the *_global_lock_online()
    routine and released by the *_global_unlock_online() routine.
(b) We register a callback for CPU hotplug notifications, and this callback
    takes the same spinlock as above.
(c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
    initialized in the lock_init() code, all future updates to it are done in
    the callback, under the above spinlock.
(d) The above bitmap is used (instead of cpu_online_mask) while locking and
    unlocking the per-cpu locks.

The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
br_write_lock-unlock sequence is in progress, the callback keeps spinning,
thus preventing the CPU online operation till the lock-unlock sequence is
complete. This takes care of requirement (3).

The bitmap that we maintain remains unmodified throughout the lock-unlock
sequence, since all updates to it are managed by the callback, which takes
the same spinlock as the one taken by the lock code and released only by the
unlock routine. Combining this with (d) above, satisfies requirement (2).

Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
operations from racing with br_write_lock-unlock, requirement (1) is also
taken care of.

By the way, it is to be noted that a CPU offline operation can actually run
in parallel with our lock-unlock sequence, because our callback doesn't react
to notifications earlier than CPU_DEAD (in order to maintain our bitmap
properly). And this means, since we use our own bitmap (which is stale, on
purpose) during the lock-unlock sequence, we could end up unlocking the
per-cpu lock of an offline CPU (because we had locked it earlier, when the
CPU was online), in order to satisfy requirement (2). But this is harmless,
though it looks a bit awkward.

Debugged-by: Cong Meng <mc@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 22 Dec 2011 02:29:26 +0000]
Merge git://git./linux/kernel/git/davem/net

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: Add a flow_cache_flush_deferred function
  ipv4: reintroduce route cache garbage collector
  net: have ipconfig not wait if no dev is available
  sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
  asix: new device id
  davinci-cpdma: fix locking issue in cpdma_chan_stop
  sctp: fix incorrect overflow check on autoclose
  r8169: fix Config2 MSIEnable bit setting.
  llc: llc_cmsg_rcv was getting called after sk_eat_skb.
  net: bpf_jit: fix an off-one bug in x86_64 cond jump target
  iwlwifi: update SCD BC table for all SCD queues
  Revert "Bluetooth: Revert: Fix L2CAP connection establishment"
  Bluetooth: Clear RFCOMM session timer when disconnecting last channel
  Bluetooth: Prevent uninitialized data access in L2CAP configuration
  iwlwifi: allow to switch to HT40 if not associated
  iwlwifi: tx_sync only on PAN context
  mwifiex: avoid double list_del in command cancel path
  ath9k: fix max phy rate at rate control init
  nfc: signedness bug in __nci_request()
  iwlwifi: do not set the sequence control bit is not needed