summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2013-05-07s390: move dummy io_remap_pfn_range() to asm/pgtable.hLinus Torvalds
commit 4f2e29031e6c67802e7370292dd050fd62f337ee upstream. Commit b4cbb197c7e7 ("vm: add vm_iomap_memory() helper function") added a helper function wrapper around io_remap_pfn_range(), and every other architecture defined it in <asm/pgtable.h>. The s390 choice of <asm/io.h> may make sense, but is not very convenient for this case, and gratuitous differences like that cause unexpected errors like this: mm/memory.c: In function 'vm_iomap_memory': mm/memory.c:2439:2: error: implicit declaration of function 'io_remap_pfn_range' [-Werror=implicit-function-declaration] Glory be the kbuild test robot who noticed this, bisected it, and reported it to the guilty parties (ie me). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07x86: Eliminate irq_mis_count counted in arch_irq_statLi Fei
commit f7b0e1055574ce06ab53391263b4e205bf38daf3 upstream. With the current implementation, kstat_cpu(cpu).irqs_sum is also increased in case of irq_mis_count increment. So there is no need to count irq_mis_count in arch_irq_stat, otherwise irq_mis_count will be counted twice in the sum of /proc/stat. Reported-by: Liu Chuansheng <chuansheng.liu@intel.com> Signed-off-by: Li Fei <fei.li@intel.com> Acked-by: Liu Chuansheng <chuansheng.liu@intel.com> Cc: tomoki.sekiyama.qu@hitachi.com Cc: joe@perches.com Link: http://lkml.kernel.org/r/1366980611.32469.7.camel@fli24-HP-Compaq-8100-Elite-CMT-PC Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07Wrong asm register contraints in the kvm implementationStephan Schreiber
commit de53e9caa4c6149ef4a78c2f83d7f5b655848767 upstream. The Linux Kernel contains some inline assembly source code which has wrong asm register constraints in arch/ia64/kvm/vtlb.c. I observed this on Kernel 3.2.35 but it is also true on the most recent Kernel 3.9-rc1. File arch/ia64/kvm/vtlb.c: u64 guest_vhpt_lookup(u64 iha, u64 *pte) { u64 ret; struct thash_data *data; data = __vtr_lookup(current_vcpu, iha, D_TLB); if (data != NULL) thash_vhpt_insert(current_vcpu, data->page_flags, data->itir, iha, D_TLB); asm volatile ( "rsm psr.ic|psr.i;;" "srlz.d;;" "ld8.s r9=[%1];;" "tnat.nz p6,p7=r9;;" "(p6) mov %0=1;" "(p6) mov r9=r0;" "(p7) extr.u r9=r9,0,53;;" "(p7) mov %0=r0;" "(p7) st8 [%2]=r9;;" "ssm psr.ic;;" "srlz.d;;" "ssm psr.i;;" "srlz.d;;" : "=r"(ret) : "r"(iha), "r"(pte):"memory"); return ret; } The list of output registers is : "=r"(ret) : "r"(iha), "r"(pte):"memory"); The constraint "=r" means that the GCC has to maintain that these vars are in registers and contain valid info when the program flow leaves the assembly block (output registers). But "=r" also means that GCC can put them in registers that are used as input registers. Input registers are iha, pte on the example. If the predicate p7 is true, the 8th assembly instruction "(p7) mov %0=r0;" is the first one which writes to a register which is maintained by the register constraints; it sets %0. %0 means the first register operand; it is ret here. This instruction might overwrite the %2 register (pte) which is needed by the next instruction: "(p7) st8 [%2]=r9;;" Whether it really happens depends on how GCC decides what registers it uses and how it optimizes the code. The attached patch fixes the register operand constraints in arch/ia64/kvm/vtlb.c. The register constraints should be : "=&r"(ret) : "r"(iha), "r"(pte):"memory"); The & means that GCC must not use any of the input registers to place this output register in. This is Debian bug#702639 (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=702639). The patch is applicable on Kernel 3.9-rc1, 3.2.35 and many other versions. Signed-off-by: Stephan Schreiber <info@fs-driver.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07Wrong asm register contraints in the futex implementationStephan Schreiber
commit 136f39ddc53db3bcee2befbe323a56d4fbf06da8 upstream. The Linux Kernel contains some inline assembly source code which has wrong asm register constraints in arch/ia64/include/asm/futex.h. I observed this on Kernel 3.2.23 but it is also true on the most recent Kernel 3.9-rc1. File arch/ia64/include/asm/futex.h: static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval) { if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) return -EFAULT; { register unsigned long r8 __asm ("r8"); unsigned long prev; __asm__ __volatile__( " mf;; \n" " mov %0=r0 \n" " mov ar.ccv=%4;; \n" "[1:] cmpxchg4.acq %1=[%2],%3,ar.ccv \n" " .xdata4 \"__ex_table\", 1b-., 2f-. \n" "[2:]" : "=r" (r8), "=r" (prev) : "r" (uaddr), "r" (newval), "rO" ((long) (unsigned) oldval) : "memory"); *uval = prev; return r8; } } The list of output registers is : "=r" (r8), "=r" (prev) The constraint "=r" means that the GCC has to maintain that these vars are in registers and contain valid info when the program flow leaves the assembly block (output registers). But "=r" also means that GCC can put them in registers that are used as input registers. Input registers are uaddr, newval, oldval on the example. The second assembly instruction " mov %0=r0 \n" is the first one which writes to a register; it sets %0 to 0. %0 means the first register operand; it is r8 here. (The r0 is read-only and always 0 on the Itanium; it can be used if an immediate zero value is needed.) This instruction might overwrite one of the other registers which are still needed. Whether it really happens depends on how GCC decides what registers it uses and how it optimizes the code. The objdump utility can give us disassembly. The futex_atomic_cmpxchg_inatomic() function is inline, so we have to look for a module that uses the funtion. This is the cmpxchg_futex_value_locked() function in kernel/futex.c: static int cmpxchg_futex_value_locked(u32 *curval, u32 __user *uaddr, u32 uval, u32 newval) { int ret; pagefault_disable(); ret = futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval); pagefault_enable(); return ret; } Now the disassembly. At first from the Kernel package 3.2.23 which has been compiled with GCC 4.4, remeber this Kernel seemed to work: objdump -d linux-3.2.23/debian/build/build_ia64_none_mckinley/kernel/futex.o 0000000000000230 <cmpxchg_futex_value_locked>: 230: 0b 18 80 1b 18 21 [MMI] adds r3=3168,r13;; 236: 80 40 0d 00 42 00 adds r8=40,r3 23c: 00 00 04 00 nop.i 0x0;; 240: 0b 50 00 10 10 10 [MMI] ld4 r10=[r8];; 246: 90 08 28 00 42 00 adds r9=1,r10 24c: 00 00 04 00 nop.i 0x0;; 250: 09 00 00 00 01 00 [MMI] nop.m 0x0 256: 00 48 20 20 23 00 st4 [r8]=r9 25c: 00 00 04 00 nop.i 0x0;; 260: 08 10 80 06 00 21 [MMI] adds r2=32,r3 266: 00 00 00 02 00 00 nop.m 0x0 26c: 02 08 f1 52 extr.u r16=r33,0,61 270: 05 40 88 00 08 e0 [MLX] addp4 r8=r34,r0 276: ff ff 0f 00 00 e0 movl r15=0xfffffffbfff;; 27c: f1 f7 ff 65 280: 09 70 00 04 18 10 [MMI] ld8 r14=[r2] 286: 00 00 00 02 00 c0 nop.m 0x0 28c: f0 80 1c d0 cmp.ltu p6,p7=r15,r16;; 290: 08 40 fc 1d 09 3b [MMI] cmp.eq p8,p9=-1,r14 296: 00 00 00 02 00 40 nop.m 0x0 29c: e1 08 2d d0 cmp.ltu p10,p11=r14,r33 2a0: 56 01 10 00 40 10 [BBB] (p10) br.cond.spnt.few 2e0 <cmpxchg_futex_value_locked+0xb0> 2a6: 02 08 00 80 21 03 (p08) br.cond.dpnt.few 2b0 <cmpxchg_futex_value_locked+0x80> 2ac: 40 00 00 41 (p06) br.cond.spnt.few 2e0 <cmpxchg_futex_value_locked+0xb0> 2b0: 0a 00 00 00 22 00 [MMI] mf;; 2b6: 80 00 00 00 42 00 mov r8=r0 2bc: 00 00 04 00 nop.i 0x0 2c0: 0b 00 20 40 2a 04 [MMI] mov.m ar.ccv=r8;; 2c6: 10 1a 85 22 20 00 cmpxchg4.acq r33=[r33],r35,ar.ccv 2cc: 00 00 04 00 nop.i 0x0;; 2d0: 10 00 84 40 90 11 [MIB] st4 [r32]=r33 2d6: 00 00 00 02 00 00 nop.i 0x0 2dc: 20 00 00 40 br.few 2f0 <cmpxchg_futex_value_locked+0xc0> 2e0: 09 40 c8 f9 ff 27 [MMI] mov r8=-14 2e6: 00 00 00 02 00 00 nop.m 0x0 2ec: 00 00 04 00 nop.i 0x0;; 2f0: 0b 58 20 1a 19 21 [MMI] adds r11=3208,r13;; 2f6: 20 01 2c 20 20 00 ld4 r18=[r11] 2fc: 00 00 04 00 nop.i 0x0;; 300: 0b 88 fc 25 3f 23 [MMI] adds r17=-1,r18;; 306: 00 88 2c 20 23 00 st4 [r11]=r17 30c: 00 00 04 00 nop.i 0x0;; 310: 11 00 00 00 01 00 [MIB] nop.m 0x0 316: 00 00 00 02 00 80 nop.i 0x0 31c: 08 00 84 00 br.ret.sptk.many b0;; The lines 2b0: 0a 00 00 00 22 00 [MMI] mf;; 2b6: 80 00 00 00 42 00 mov r8=r0 2bc: 00 00 04 00 nop.i 0x0 2c0: 0b 00 20 40 2a 04 [MMI] mov.m ar.ccv=r8;; 2c6: 10 1a 85 22 20 00 cmpxchg4.acq r33=[r33],r35,ar.ccv 2cc: 00 00 04 00 nop.i 0x0;; are the instructions of the assembly block. The line 2b6: 80 00 00 00 42 00 mov r8=r0 sets the r8 register to 0 and after that 2c0: 0b 00 20 40 2a 04 [MMI] mov.m ar.ccv=r8;; prepares the 'oldvalue' for the cmpxchg but it takes it from r8. This is wrong. What happened here is what I explained above: An input register is overwritten which is still needed. The register operand constraints in futex.h are wrong. (The problem doesn't occur when the Kernel is compiled with GCC 4.6.) The attached patch fixes the register operand constraints in futex.h. The code after patching of it: static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval) { if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) return -EFAULT; { register unsigned long r8 __asm ("r8") = 0; unsigned long prev; __asm__ __volatile__( " mf;; \n" " mov ar.ccv=%4;; \n" "[1:] cmpxchg4.acq %1=[%2],%3,ar.ccv \n" " .xdata4 \"__ex_table\", 1b-., 2f-. \n" "[2:]" : "+r" (r8), "=&r" (prev) : "r" (uaddr), "r" (newval), "rO" ((long) (unsigned) oldval) : "memory"); *uval = prev; return r8; } } I also initialized the 'r8' var with the C programming language. The _asm qualifier on the definition of the 'r8' var forces GCC to use the r8 processor register for it. I don't believe that we should use inline assembly for zeroing out a local variable. The constraint is "+r" (r8) what means that it is both an input register and an output register. Note that the page fault handler will modify the r8 register which will be the return value of the function. The real fix is "=&r" (prev) The & means that GCC must not use any of the input registers to place this output register in. Patched the Kernel 3.2.23 and compiled it with GCC4.4: 0000000000000230 <cmpxchg_futex_value_locked>: 230: 0b 18 80 1b 18 21 [MMI] adds r3=3168,r13;; 236: 80 40 0d 00 42 00 adds r8=40,r3 23c: 00 00 04 00 nop.i 0x0;; 240: 0b 50 00 10 10 10 [MMI] ld4 r10=[r8];; 246: 90 08 28 00 42 00 adds r9=1,r10 24c: 00 00 04 00 nop.i 0x0;; 250: 09 00 00 00 01 00 [MMI] nop.m 0x0 256: 00 48 20 20 23 00 st4 [r8]=r9 25c: 00 00 04 00 nop.i 0x0;; 260: 08 10 80 06 00 21 [MMI] adds r2=32,r3 266: 20 12 01 10 40 00 addp4 r34=r34,r0 26c: 02 08 f1 52 extr.u r16=r33,0,61 270: 05 40 00 00 00 e1 [MLX] mov r8=r0 276: ff ff 0f 00 00 e0 movl r15=0xfffffffbfff;; 27c: f1 f7 ff 65 280: 09 70 00 04 18 10 [MMI] ld8 r14=[r2] 286: 00 00 00 02 00 c0 nop.m 0x0 28c: f0 80 1c d0 cmp.ltu p6,p7=r15,r16;; 290: 08 40 fc 1d 09 3b [MMI] cmp.eq p8,p9=-1,r14 296: 00 00 00 02 00 40 nop.m 0x0 29c: e1 08 2d d0 cmp.ltu p10,p11=r14,r33 2a0: 56 01 10 00 40 10 [BBB] (p10) br.cond.spnt.few 2e0 <cmpxchg_futex_value_locked+0xb0> 2a6: 02 08 00 80 21 03 (p08) br.cond.dpnt.few 2b0 <cmpxchg_futex_value_locked+0x80> 2ac: 40 00 00 41 (p06) br.cond.spnt.few 2e0 <cmpxchg_futex_value_locked+0xb0> 2b0: 0b 00 00 00 22 00 [MMI] mf;; 2b6: 00 10 81 54 08 00 mov.m ar.ccv=r34 2bc: 00 00 04 00 nop.i 0x0;; 2c0: 09 58 8c 42 11 10 [MMI] cmpxchg4.acq r11=[r33],r35,ar.ccv 2c6: 00 00 00 02 00 00 nop.m 0x0 2cc: 00 00 04 00 nop.i 0x0;; 2d0: 10 00 2c 40 90 11 [MIB] st4 [r32]=r11 2d6: 00 00 00 02 00 00 nop.i 0x0 2dc: 20 00 00 40 br.few 2f0 <cmpxchg_futex_value_locked+0xc0> 2e0: 09 40 c8 f9 ff 27 [MMI] mov r8=-14 2e6: 00 00 00 02 00 00 nop.m 0x0 2ec: 00 00 04 00 nop.i 0x0;; 2f0: 0b 88 20 1a 19 21 [MMI] adds r17=3208,r13;; 2f6: 30 01 44 20 20 00 ld4 r19=[r17] 2fc: 00 00 04 00 nop.i 0x0;; 300: 0b 90 fc 27 3f 23 [MMI] adds r18=-1,r19;; 306: 00 90 44 20 23 00 st4 [r17]=r18 30c: 00 00 04 00 nop.i 0x0;; 310: 11 00 00 00 01 00 [MIB] nop.m 0x0 316: 00 00 00 02 00 80 nop.i 0x0 31c: 08 00 84 00 br.ret.sptk.many b0;; Much better. There is a 270: 05 40 00 00 00 e1 [MLX] mov r8=r0 which was generated by C code r8 = 0. Below 2b6: 00 10 81 54 08 00 mov.m ar.ccv=r34 what means that oldval is no longer overwritten. This is Debian bug#702641 (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=702641). The patch is applicable on Kernel 3.9-rc1, 3.2.23 and many other versions. Signed-off-by: Stephan Schreiber <info@fs-driver.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07Fix initialization of CMCI/CMCP interruptsTony Luck
commit d303e9e98fce56cdb3c6f2ac92f626fc2bd51c77 upstream. Back 2010 during a revamp of the irq code some initializations were moved from ia64_mca_init() to ia64_mca_late_init() in commit c75f2aa13f5b268aba369b5dc566088b5194377c Cannot use register_percpu_irq() from ia64_mca_init() But this was hideously wrong. First of all these initializations are now down far too late. Specifically after all the other cpus have been brought up and initialized their own CMC vectors from smp_callin(). Also ia64_mca_late_init() may be called from any cpu so the line: ia64_mca_cmc_vector_setup(); /* Setup vector on BSP */ is generally not executed on the BSP, and so the CMC vector isn't setup at all on that processor. Make use of the arch_early_irq_init() hook to get this code executed at just the right moment: not too early, not too late. Reported-by: Fred Hartnett <fred.hartnett@hp.com> Tested-by: Fred Hartnett <fred.hartnett@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07xen/time: Fix kasprintf splat when allocating timer%d IRQ line.Konrad Rzeszutek Wilk
commit 7918c92ae9638eb8a6ec18e2b4a0de84557cccc8 upstream. When we online the CPU, we get this splat: smpboot: Booting Node 0 Processor 1 APIC 0x2 installing Xen timer for CPU 1 BUG: sleeping function called from invalid context at /home/konrad/ssd/konrad/linux/mm/slab.c:3179 in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/1 Pid: 0, comm: swapper/1 Not tainted 3.9.0-rc6upstream-00001-g3884fad #1 Call Trace: [<ffffffff810c1fea>] __might_sleep+0xda/0x100 [<ffffffff81194617>] __kmalloc_track_caller+0x1e7/0x2c0 [<ffffffff81303758>] ? kasprintf+0x38/0x40 [<ffffffff813036eb>] kvasprintf+0x5b/0x90 [<ffffffff81303758>] kasprintf+0x38/0x40 [<ffffffff81044510>] xen_setup_timer+0x30/0xb0 [<ffffffff810445af>] xen_hvm_setup_cpu_clockevents+0x1f/0x30 [<ffffffff81666d0a>] start_secondary+0x19c/0x1a8 The solution to that is use kasprintf in the CPU hotplug path that 'online's the CPU. That is, do it in in xen_hvm_cpu_notify, and remove the call to in xen_hvm_setup_cpu_clockevents. Unfortunatly the later is not a good idea as the bootup path does not use xen_hvm_cpu_notify so we would end up never allocating timer%d interrupt lines when booting. As such add the check for atomic() to continue. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07powerpc/spufs: Initialise inode->i_ino in spufs_new_inode()Michael Ellerman
commit 6747e83235caecd30b186d1282e4eba7679f81b7 upstream. In commit 85fe402 (fs: do not assign default i_ino in new_inode), the initialisation of i_ino was removed from new_inode() and pushed down into the callers. However spufs_new_inode() was not updated. This exhibits as no files appearing in /spu, because all our dirents have a zero inode, which readdir() seems to dislike. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07powerpc: Add isync to copy_and_flushMichael Neuling
commit 29ce3c5073057991217916abc25628e906911757 upstream. In __after_prom_start we copy the kernel down to zero in two calls to copy_and_flush. After the first call (copy from 0 to copy_to_here:) we jump to the newly copied code soon after. Unfortunately there's no isync between the copy of this code and the jump to it. Hence it's possible that stale instructions could still be in the icache or pipeline before we branch to it. We've seen this on real machines and it's results in no console output after: calling quiesce... returning from prom_init The below adds an isync to ensure that the copy and flushing has completed before any branching to the new instructions occurs. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-01sparc32: support atomic64_tSam Ravnborg
commit aea1181b0bd0a09c54546399768f359d1e198e45 upstream, Needed to compile ext4 for sparc32 since commit 503f4bdcc078e7abee273a85ce322de81b18a224 There is no-one that really require atomic64_t support on sparc32. But several drivers fails to build without proper atomic64 support. And for an allyesconfig build for sparc32 this is annoying. Include the generic atomic64_t support for sparc32. This has a text footprint cost: $size vmlinux (before atomic64_t support) text data bss dec hex filename 3578860 134260 108781 3821901 3a514d vmlinux $size vmlinux (after atomic64_t support) text data bss dec hex filename 3579892 130684 108781 3819357 3a475d vmlinux text increase (3579892 - 3578860) = 1032 bytes data decreases - but I fail to explain why! I have rebuild twice to check my numbers. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-01sparc64: Fix race in TLB batch processing.David S. Miller
[ Commits f36391d2790d04993f48da6a45810033a2cdf847 and f0af97070acbad5d6a361f485828223a4faaa0ee upstream. ] As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25ARM: 7696/1: Fix kexec by setting outer_cache.inv_all for FeroceonIllia Ragozin
commit cd272d1ea71583170e95dde02c76166c7f9017e6 upstream. On Feroceon the L2 cache becomes non-coherent with the CPU when the L1 caches are disabled. Thus the L2 needs to be invalidated after both L1 caches are disabled. On kexec before the starting the code for relocation the kernel, the L1 caches are disabled in cpu_froc_fin (cpu_v7_proc_fin for Feroceon), but after L2 cache is never invalidated, because inv_all is not set in cache-feroceon-l2.c. So kernel relocation and decompression may has (and usually has) errors. Setting the function enables L2 invalidation and fixes the issue. Signed-off-by: Illia Ragozin <illia.ragozin@grapecom.com> Acked-by: Jason Cooper <jason@lakedaemon.net> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25KVM: Allow cross page reads and writes from cached translations.Andrew Honig
commit 8f964525a121f2ff2df948dac908dcc65be21b5b upstream. This patch adds support for kvm_gfn_to_hva_cache_init functions for reads and writes that will cross a page. If the range falls within the same memslot, then this will be a fast operation. If the range is split between two memslots, then the slower kvm_read_guest and kvm_write_guest are used. Tested: Test against kvm_clock unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions ↵Andy Honig
(CVE-2013-1797) commit 0b79459b482e85cb7426aa7da683a9f2c97aeae1 upstream. There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME ↵Andy Honig
(CVE-2013-1796) commit c300aa64ddf57d9c5d9c898a64b36877345dd4a9 upstream. If the guest sets the GPA of the time_page so that the request to update the time straddles a page then KVM will write onto an incorrect page. The write is done byusing kmap atomic to get a pointer to the page for the time structure and then performing a memcpy to that page starting at an offset that the guest controls. Well behaved guests always provide a 32-byte aligned address, however a malicious guest could use this to corrupt host kernel memory. Tested: Tested against kvmclock unit test. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metalBoris Ostrovsky
commit 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 upstream. Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updatesSamu Kallio
commit 1160c2779b826c6f5c08e5cc542de58fd1f667d5 upstream. In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc'd mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Tested-by: Josh Boyer <jwboyer@redhat.com> Reported-and-Tested-by: Krishna Raman <kraman@redhat.com> Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com> Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen
commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream. [was already included in 3.0, but I missed the patch hunk for arch/x86/mm/numa_32.c - gregkh] This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen
commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream. This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12powerpc: pSeries_lpar_hpte_remove fails from Adjunct partition being ↵Michael Wolf
performed before the ANDCOND test commit 9fb2640159f9d4f5a2a9d60e490482d4cbecafdb upstream. Some versions of pHyp will perform the adjunct partition test before the ANDCOND test. The result of this is that H_RESOURCE can be returned and cause the BUG_ON condition to occur. The HPTE is not removed. So add a check for H_RESOURCE, it is ok if this HPTE is not removed as pSeries_lpar_hpte_remove is looking for an HPTE to remove and not a specific HPTE to remove. So it is ok to just move on to the next slot and try again. Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12alpha: Add irongate_io to PCI bus resourcesJay Estabrook
commit aa8b4be3ac049c8b1df2a87e4d1d902ccfc1f7a9 upstream. Fixes a NULL pointer dereference at boot on UP1500. Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Jay Estabrook <jay.estabrook@gmail.com> Signed-off-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05iommu/amd: Make sure dma_ops are set for hotplug devicesJoerg Roedel
commit c2a2876e863356b092967ea62bebdb4dd663af80 upstream. There is a bug introduced with commit 27c2127 that causes devices which are hot unplugged and then hot-replugged to not have per-device dma_ops set. This causes these devices to not function correctly. Fixed with this patch. Reported-by: Andreas Degert <andreas.degert@googlemail.com> Signed-off-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461)Petr Matousek
commit 6d1068b3a98519247d8ba4ec85cd40ac136dbdf9 upstream. On hosts without the XSAVE support unprivileged local user can trigger oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN ioctl. invalid opcode: 0000 [#2] SMP Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables ... Pid: 24935, comm: zoog_kvm_monito Tainted: G D 3.2.0-3-686-pae EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0 EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0 task.ti=d7c62000) Stack: 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80 Call Trace: [<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm] ... [<c12bfb44>] ? syscall_call+0x7/0xb Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74 1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01 d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89 EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP 0068:d7c63e70 QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID and then sets them later. So guest's X86_FEATURE_XSAVE should be masked out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with X86_FEATURE_XSAVE even on hosts that do not support it, might be susceptible to this attack from inside the guest as well. Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05KVM: Ensure all vcpus are consistent with in-kernel irqchip settingsAvi Kivity
commit 3e515705a1f46beb1c942bb8043c16f8ac7b1e9e upstream. If some vcpus are created before KVM_CREATE_IRQCHIP, then irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading to potential NULL pointer dereferences. Fix by: - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP This is somewhat long winded because vcpu->arch.apic is created without kvm->lock held. Based on earlier patch by Michael Ellerman. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05KVM: x86: Prevent starting PIT timers in the absence of irqchip supportJan Kiszka
commit 0924ab2cfa98b1ece26c033d696651fd62896c69 upstream. User space may create the PIT and forgets about setting up the irqchips. In that case, firing PIT IRQs will crash the host: BUG: unable to handle kernel NULL pointer dereference at 0000000000000128 IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm] ... Call Trace: [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm] [<ffffffff81071431>] process_one_work+0x111/0x4d0 [<ffffffff81071bb2>] worker_thread+0x152/0x340 [<ffffffff81075c8e>] kthread+0x7e/0x90 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10 Prevent this by checking the irqchip mode before starting a timer. We can't deny creating the PIT if the irqchips aren't set up yet as current user land expects this order to work. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05KVM: Clean up error handling during VCPU creationJan Kiszka
commit d780592b99d7d8a5ff905f6bacca519d4a342c76 upstream. So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if it fails. Move this confusing resonsibility back into the hands of kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected, all other archs cannot fail. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05signal: Define __ARCH_HAS_SA_RESTORER so we know whether to clear sa_restorerBen Hutchings
Vaguely based on upstream commit 574c4866e33d 'consolidate kernel-side struct sigaction declarations'. flush_signal_handlers() needs to know whether sigaction::sa_restorer is defined, not whether SA_RESTORER is defined. Define the __ARCH_HAS_SA_RESTORER macro to indicate this. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28x86-64: Fix the failure case in copy_user_handle_tail()CQ Tang
commit 66db3feb486c01349f767b98ebb10b0c3d2d021b upstream. The increment of "to" in copy_user_handle_tail() will have incremented before a failure has been noted. This causes us to skip a byte in the failure case. Only do the increment when assured there is no failure. Signed-off-by: CQ Tang <cq.tang@intel.com> Link: http://lkml.kernel.org/r/20130318150221.8439.993.stgit@phlsvslse11.ph.intel.com Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20s390/mm: fix flush_tlb_kernel_range()Heiko Carstens
commit f6a70a07079518280022286a1dceb797d12e1edf upstream. Our flush_tlb_kernel_range() implementation calls __tlb_flush_mm() with &init_mm as argument. __tlb_flush_mm() however will only flush tlbs for the passed in mm if its mm_cpumask is not empty. For the init_mm however its mm_cpumask has never any bits set. Which in turn means that our flush_tlb_kernel_range() implementation doesn't work at all. This can be easily verified with a vmalloc/vfree loop which allocates a page, writes to it and then frees the page again. A crash will follow almost instantly. To fix this remove the cpumask_empty() check in __tlb_flush_mm() since there shouldn't be too many mms with a zero mm_cpumask, besides the init_mm of course. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20perf,x86: fix wrmsr_on_cpu() warning on suspend/resumeLinus Torvalds
commit 2a6e06b2aed6995af401dcd4feb5e79a0c7ea554 upstream. Commit 1d9d8639c063 ("perf,x86: fix kernel crash with PEBS/BTS after suspend/resume") fixed a crash when doing PEBS performance profiling after resuming, but in using init_debug_store_on_cpu() to restore the DS_AREA mtrr it also resulted in a new WARN_ON() triggering. init_debug_store_on_cpu() uses "wrmsr_on_cpu()", which in turn uses CPU cross-calls to do the MSR update. Which is not really valid at the early resume stage, and the warning is quite reasonable. Now, it all happens to _work_, for the simple reason that smp_call_function_single() ends up just doing the call directly on the CPU when the CPU number matches, but we really should just do the wrmsr() directly instead. This duplicates the wrmsr() logic, but hopefully we can just remove the wrmsr_on_cpu() version eventually. Reported-and-tested-by: Parag Warudkar <parag.lkml@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20perf,x86: fix kernel crash with PEBS/BTS after suspend/resumeStephane Eranian
commit 1d9d8639c063caf6efc2447f5f26aa637f844ff6 upstream. This patch fixes a kernel crash when using precise sampling (PEBS) after a suspend/resume. Turns out the CPU notifier code is not invoked on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly by the kernel and keeps it power-on/resume value of 0 causing any PEBS measurement to crash when running on CPU0. The workaround is to add a hook in the actual resume code to restore the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0, the DS_AREA will be restored twice but this is harmless. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Shuah Khan <shuah.khan@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20powerpc: Fix cputable entry for 970MP rev 1.0Benjamin Herrenschmidt
commit d63ac5f6cf31c8a83170a9509b350c1489a7262b upstream. Commit 44ae3ab3358e962039c36ad4ae461ae9fb29596c forgot to update the entry for the 970MP rev 1.0 processor when moving some CPU features bits to the MMU feature bit mask. This breaks booting on some rare G5 models using that chip revision. Reported-by: Phileas Fogg <phileas-fogg@mail.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14ARM: fix scheduling while atomic warning in alignment handling codeRussell King
commit b255188f90e2bade1bd11a986dd1ca4861869f4d upstream. Paolo Pisati reports that IPv6 triggers this warning: BUG: scheduling while atomic: swapper/0/0/0x40000100 Modules linked in: [<c001b1c4>] (unwind_backtrace+0x0/0xf0) from [<c0503c5c>] (__schedule_bug+0x48/0x5c) [<c0503c5c>] (__schedule_bug+0x48/0x5c) from [<c0508608>] (__schedule+0x700/0x740) [<c0508608>] (__schedule+0x700/0x740) from [<c007007c>] (__cond_resched+0x24/0x34) [<c007007c>] (__cond_resched+0x24/0x34) from [<c05086dc>] (_cond_resched+0x3c/0x44) [<c05086dc>] (_cond_resched+0x3c/0x44) from [<c0021f6c>] (do_alignment+0x178/0x78c) [<c0021f6c>] (do_alignment+0x178/0x78c) from [<c00083e0>] (do_DataAbort+0x34/0x98) [<c00083e0>] (do_DataAbort+0x34/0x98) from [<c0509a60>] (__dabt_svc+0x40/0x60) Exception stack(0xc0763d70 to 0xc0763db8) 3d60: e97e805e e97e806e 2c000000 11000000 3d80: ea86bb00 0000002c 00000011 e97e807e c076d2a8 e97e805e e97e806e 0000002c 3da0: 3d000000 c0763dbc c04b98fc c02a8490 00000113 ffffffff [<c0509a60>] (__dabt_svc+0x40/0x60) from [<c02a8490>] (__csum_ipv6_magic+0x8/0xc8) Fix this by using probe_kernel_address() stead of __get_user(). Reported-by: Paolo Pisati <p.pisati@gmail.com> Tested-by: Paolo Pisati <p.pisati@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14ARM: VFP: fix emulation of second VFP instructionRussell King
commit 5e4ba617c1b584b2e376f31a63bd4e734109318a upstream. Martin Storsjö reports that the sequence: ee312ac1 vsub.f32 s4, s3, s2 ee702ac0 vsub.f32 s5, s1, s0 e59f0028 ldr r0, [pc, #40] ee111a90 vmov r1, s3 on Raspberry Pi (implementor 41 architecture 1 part 20 variant b rev 5) where s3 is a denormal and s2 is zero results in incorrect behaviour - the instruction "vsub.f32 s5, s1, s0" is not executed: VFP: bounce: trigger ee111a90 fpexc d0000780 VFP: emulate: INST=0xee312ac1 SCR=0x00000000 ... As we can see, the instruction triggering the exception is the "vmov" instruction, and we emulate the "vsub.f32 s4, s3, s2" but fail to properly take account of the FPEXC_FP2V flag in FPEXC. This is because the test for the second instruction register being valid is bogus, and will always skip emulation of the second instruction. Reported-by: Martin Storsjö <martin@martin.st> Tested-by: Martin Storsjö <martin@martin.st> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04s390/kvm: Fix store status for ACRS/FPRS fixJiri Slaby
In 3.0.67, commit 58c9ce6fad8e00d9726447f939fe7e78e2aec891 (s390/kvm: Fix store status for ACRS/FPRS), upstream commit 15bc8d8457875f495c59d933b05770ba88d1eacb, added a call to save_access_regs to save ACRS. But we do not have ARCS in kvm_run in 3.0 yet, so this results in: arch/s390/kvm/kvm-s390.c: In function 'kvm_s390_vcpu_store_status': arch/s390/kvm/kvm-s390.c:593: error: 'struct kvm_run' has no member named 's' Fix it by saving guest_acrs which is where ARCS are in 3.0. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04x86/apic: Work around boot failure on HP ProLiant DL980 G7 Server systemsStoney Wang
commit cb214ede7657db458fd0b2a25ea0b28dbf900ebc upstream. When a HP ProLiant DL980 G7 Server boots a regular kernel, there will be intermittent lost interrupts which could result in a hang or (in extreme cases) data loss. The reason is that this system only supports x2apic physical mode, while the kernel boots with a logical-cluster default setting. This bug can be worked around by specifying the "x2apic_phys" or "nox2apic" boot option, but we want to handle this system without requiring manual workarounds. The BIOS sets ACPI_FADT_APIC_PHYSICAL in FADT table. As all apicids are smaller than 255, BIOS need to pass the control to the OS with xapic mode, according to x2apic-spec, chapter 2.9. Current code handle x2apic when BIOS pass with xapic mode enabled: When user specifies x2apic_phys, or FADT indicates PHYSICAL: 1. During madt oem check, apic driver is set with xapic logical or xapic phys driver at first. 2. enable_IR_x2apic() will enable x2apic_mode. 3. if user specifies x2apic_phys on the boot line, x2apic_phys_probe() will install the correct x2apic phys driver and use x2apic phys mode. Otherwise it will skip the driver will let x2apic_cluster_probe to take over to install x2apic cluster driver (wrong one) even though FADT indicates PHYSICAL, because x2apic_phys_probe does not check FADT PHYSICAL. Add checking x2apic_fadt_phys in x2apic_phys_probe() to fix the problem. Signed-off-by: Stoney Wang <song-bo.wang@hp.com> [ updated the changelog and simplified the code ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Zhang Lin-Bao <Linbao.zhang@hp.com> [ make a patch specially for 3.0.66] Link: http://lkml.kernel.org/r/1360263182-16226-1-git-send-email-yinghai@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04x86: Do not leak kernel page mapping locationsKees Cook
commit e575a86fdc50d013bf3ad3aa81d9100e8e6cc60d upstream. Without this patch, it is trivial to determine kernel page mappings by examining the error code reported to dmesg[1]. Instead, declare the entire kernel memory space as a violation of a present page. Additionally, since show_unhandled_signals is enabled by default, switch branch hinting to the more realistic expectation, and unobfuscate the setting of the PF_PROT bit to improve readability. [1] http://vulnfactory.org/blog/2013/02/06/a-linux-memory-trick/ Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130207174413.GA12485@www.outflux.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04x86: Make sure we can boot in the case the BDA contains pure garbageH. Peter Anvin
commit 7c10093692ed2e6f318387d96b829320aa0ca64c upstream. On non-BIOS platforms it is possible that the BIOS data area contains garbage instead of being zeroed or something equivalent (firmware people: we are talking of 1.5K here, so please do the sane thing.) We need on the order of 20-30K of low memory in order to boot, which may grow up to < 64K in the future. We probably want to avoid the lowest of the low memory. At the same time, it seems extremely unlikely that a legitimate EBDA would ever reach down to the 128K (which would require it to be over half a megabyte in size.) Thus, pick 128K as the cutoff for "this is insane, ignore." We may still end up reserving a bunch of extra memory on the low megabyte, but that is not really a major issue these days. In the worst case we lose 512K of RAM. This code really should be merged with trim_bios_range() in arch/x86/kernel/setup.c, but that is a bigger patch for a later merge window. Reported-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04iommu/amd: Initialize device table after dma_opsJoerg Roedel
commit f528d980c17b8714aedc918ba86e058af914d66b upstream. When dma_ops are initialized the unity mappings are created. The init_device_table_dma() function makes sure DMA from all devices is blocked by default. This opens a short window in time where DMA to unity mapped regions is blocked by the IOMMU. Make sure this does not happen by initializing the device table after dma_ops. Signed-off-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Shuah Khan <shuah.khan@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28Purge existing TLB entries in set_pte_at and ptep_set_wrprotectJohn David Anglin
commit 7139bc1579901b53db7e898789e916ee2fb52d78 upstream. This patch goes a long way toward fixing the minifail bug, and it  significantly improves the stability of SMP machines such as the rp3440.  When write  protecting a page for COW, we need to purge the existing translation.  Otherwise, the COW break doesn't occur as expected because the TLB may still have a stale entry which allows writes. [jejb: fix up checkpatch errors] Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28powerpc/kexec: Disable hard IRQ before kexecPhileas Fogg
commit 8520e443aa56cc157b015205ea53e7b9fc831291 upstream. Disable hard IRQ before kexec a new kernel image. Not doing it can result in corrupted data in the memory segments reserved for the new kernel. Signed-off-by: Phileas Fogg <phileas-fogg@mail.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28ARM: PXA3xx: program the CSMSADRCFG registerIgor Grinberg
commit d107a204154ddd79339203c2deeb7433f0cf6777 upstream. The Chip Select Configuration Register must be programmed to 0x2 in order to achieve the correct behavior of the Static Memory Controller. Without this patch devices wired to DFI and accessed through SMC cannot be accessed after resume from S2. Do not rely on the boot loader to program the CSMSADRCFG register by programming it in the kernel smemc module. Signed-off-by: Igor Grinberg <grinberg@compulab.co.il> Acked-by: Eric Miao <eric.y.miao@gmail.com> Signed-off-by: Haojian Zhuang <haojian.zhuang@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28s390/kvm: Fix store status for ACRS/FPRSChristian Borntraeger
commit 15bc8d8457875f495c59d933b05770ba88d1eacb upstream. On store status we need to copy the current state of registers into a save area. Currently we might save stale versions: The sie state descriptor doesnt have fields for guest ACRS,FPRS, those registers are simply stored in the host registers. The host program must copy these away if needed. We do that in vcpu_put/load. If we now do a store status in KVM code between vcpu_put/load, the saved values are not up-to-date. Lets collect the ACRS/FPRS before saving them. This also fixes some strange problems with hotplug and virtio-ccw, since the low level machine check handler (on hotplug a machine check will happen) will revalidate all registers with the content of the save area. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28xen: Send spinlock IPI to all waitersStefan Bader
commit 76eaca031f0af2bb303e405986f637811956a422 upstream. There is a loophole between Xen's current implementation of pv-spinlocks and the scheduler. This was triggerable through a testcase until v3.6 changed the TLB flushing code. The problem potentially is still there just not observable in the same way. What could happen was (is): 1. CPU n tries to schedule task x away and goes into a slow wait for the runq lock of CPU n-# (must be one with a lower number). 2. CPU n-#, while processing softirqs, tries to balance domains and goes into a slow wait for its own runq lock (for updating some records). Since this is a spin_lock_irqsave in softirq context, interrupts will be re-enabled for the duration of the poll_irq hypercall used by Xen. 3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives an interrupt (e.g. endio) and when processing the interrupt, tries to wake up task x. But that is in schedule and still on_cpu, so try_to_wake_up goes into a tight loop. 4. The runq lock of CPU n-# gets unlocked, but the message only gets sent to the first waiter, which is CPU n-# and that is busily stuck. 5. CPU n-# never returns from the nested interruption to take and release the lock because the scheduler uses a busy wait. And CPU n never finishes the task migration because the unlock notification only went to CPU n-#. To avoid this and since the unlocking code has no real sense of which waiter is best suited to grab the lock, just send the IPI to all of them. This causes the waiters to return from the hyper- call (those not interrupted at least) and do active spinlocking. BugLink: http://bugs.launchpad.net/bugs/1011792 Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86-32, mm: Remove reference to resume_map_numa_kva()H. Peter Anvin
commit bb112aec5ee41427e9b9726e3d57b896709598ed upstream. Remove reference to removed function resume_map_numa_kva(). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.Jan Beulich
commit 13d2b4d11d69a92574a55bfd985cfb0ca77aebdc upstream. This fixes CVE-2013-0228 / XSA-42 Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user in 32bit PV guest can use to crash the > guest with the panic like this: ------------- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/dev Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1 EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0 EIP is at xen_iret+0x12/0x2b EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010 ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0 DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069 Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000) Stack: 00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000 Call Trace: Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00 8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40 10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02 EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0 general protection fault: 0000 [#2] ---[ end trace ab0d29a492dcd330 ]--- Kernel panic - not syncing: Fatal exception Pid: 1250, comm: r Tainted: G D --------------- 2.6.32-356.el6.i686 #1 Call Trace: [<c08476df>] ? panic+0x6e/0x122 [<c084b63c>] ? oops_end+0xbc/0xd0 [<c084b260>] ? do_general_protection+0x0/0x210 [<c084a9b7>] ? error_code+0x73/ ------------- Petr says: " I've analysed the bug and I think that xen_iret() cannot cope with mangled DS, in this case zeroed out (null selector/descriptor) by either xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT entry was invalidated by the reproducer. " Jan took a look at the preliminary patch and came up a fix that solves this problem: "This code gets called after all registers other than those handled by IRET got already restored, hence a null selector in %ds or a non-null one that got loaded from a code or read-only data descriptor would cause a kernel mode fault (with the potential of crashing the kernel as a whole, if panic_on_oops is set)." The way to fix this is to realize that the we can only relay on the registers that IRET restores. The two that are guaranteed are the %cs and %ss as they are always fixed GDT selectors. Also they are inaccessible from user mode - so they cannot be altered. This is the approach taken in this patch. Another alternative option suggested by Jan would be to relay on the subtle realization that using the %ebp or %esp relative references uses the %ss segment. In which case we could switch from using %eax to %ebp and would not need the %ss over-rides. That would also require one extra instruction to compensate for the one place where the register is used as scaled index. However Andrew pointed out that is too subtle and if further work was to be done in this code-path it could escape folks attention and lead to accidents. Reviewed-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Petr Matousek <pmatouse@redhat.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/mm: Check if PUD is large when validating a kernel addressMel Gorman
commit 0ee364eb316348ddf3e0dfcd986f5f13f528f821 upstream. A user reported the following oops when a backup process reads /proc/kcore: BUG: unable to handle kernel paging request at ffffbb00ff33b000 IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110 [...] Call Trace: [<ffffffff811b8aaa>] read_kcore+0x17a/0x370 [<ffffffff811ad847>] proc_reg_read+0x77/0xc0 [<ffffffff81151687>] vfs_read+0xc7/0x130 [<ffffffff811517f3>] sys_read+0x53/0xa0 [<ffffffff81449692>] system_call_fastpath+0x16/0x1b Investigation determined that the bug triggered when reading system RAM at the 4G mark. On this system, that was the first address using 1G pages for the virt->phys direct mapping so the PUD is pointing to a physical address, not a PMD page. The problem is that the page table walker in kern_addr_valid() is not checking pud_large() and treats the physical address as if it was a PMD. If it happens to look like pmd_none then it'll silently fail, probably returning zeros instead of real data. If the data happens to look like a present PMD though, it will be walked resulting in the oops above. This patch adds the necessary pud_large() check. Unfortunately the problem was not readily reproducible and now they are running the backup program without accessing /proc/kcore so the patch has not been validated but I think it makes sense. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.coM> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-11x86-64: Replace left over sti/cli in ia32 audit exit codeJan Beulich
commit 40a1ef95da85843696fc3ebe5fce39b0db32669f upstream. For some reason they didn't get replaced so far by their paravirt equivalents, resulting in code to be run with interrupts disabled that doesn't expect so (causing, in the observed case, a BUG_ON() to trigger) when syscall auditing is enabled. David (Cc-ed) came up with an identical fix, so likely this can be taken to count as an ack from him. Reported-by: Peter Moody <pmoody@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/5108E01902000078000BA9C5@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Tested-by: Peter Moody <pmoody@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-03x86/Sandy Bridge: Sandy Bridge workaround depends on CONFIG_PCIH. Peter Anvin
commit e43b3cec711a61edf047adf6204d542f3a659ef8 upstream. early_pci_allowed() and read_pci_config_16() are only available if CONFIG_PCI is defined. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Abdallah Chatila <abdallah.chatila@ericsson.com>
2013-02-03efi, x86: Pass a proper identity mapping in efi_call_phys_prelogNathan Zimmer
commit b8f2c21db390273c3eaf0e5308faeaeb1e233840 upstream. Update efi_call_phys_prelog to install an identity mapping of all available memory. This corrects a bug on very large systems with more then 512 GB in which bios would not be able to access addresses above not in the mapping. The result is a crash that looks much like this. BUG: unable to handle kernel paging request at 000000effd870020 IP: [<0000000078bce331>] 0x78bce330 PGD 0 Oops: 0000 [#1] SMP Modules linked in: CPU 0 Pid: 0, comm: swapper/0 Tainted: G W 3.8.0-rc1-next-20121224-medusa_ntz+ #2 Intel Corp. Stoutland Platform RIP: 0010:[<0000000078bce331>] [<0000000078bce331>] 0x78bce330 RSP: 0000:ffffffff81601d28 EFLAGS: 00010006 RAX: 0000000078b80e18 RBX: 0000000000000004 RCX: 0000000000000004 RDX: 0000000078bcf958 RSI: 0000000000002400 RDI: 8000000000000000 RBP: 0000000078bcf760 R08: 000000effd870000 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000000000c3 R12: 0000000000000030 R13: 000000effd870000 R14: 0000000000000000 R15: ffff88effd870000 FS: 0000000000000000(0000) GS:ffff88effe400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000effd870020 CR3: 000000000160c000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 0, threadinfo ffffffff81600000, task ffffffff81614400) Stack: 0000000078b80d18 0000000000000004 0000000078bced7b ffff880078b81fff 0000000000000000 0000000000000082 0000000078bce3a8 0000000000002400 0000000060000202 0000000078b80da0 0000000078bce45d ffffffff8107cb5a Call Trace: [<ffffffff8107cb5a>] ? on_each_cpu+0x77/0x83 [<ffffffff8102f4eb>] ? change_page_attr_set_clr+0x32f/0x3ed [<ffffffff81035946>] ? efi_call4+0x46/0x80 [<ffffffff816c5abb>] ? efi_enter_virtual_mode+0x1f5/0x305 [<ffffffff816aeb24>] ? start_kernel+0x34a/0x3d2 [<ffffffff816ae5ed>] ? repair_env_string+0x60/0x60 [<ffffffff816ae2be>] ? x86_64_start_reservations+0xba/0xc1 [<ffffffff816ae120>] ? early_idt_handlers+0x120/0x120 [<ffffffff816ae419>] ? x86_64_start_kernel+0x154/0x163 Code: Bad RIP value. RIP [<0000000078bce331>] 0x78bce330 RSP <ffffffff81601d28> CR2: 000000effd870020 ---[ end trace ead828934fef5eab ]--- Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-03x86/msr: Add capabilities checkAlan Cox
commit c903f0456bc69176912dee6dd25c6a66ee1aed00 upstream. At the moment the MSR driver only relies upon file system checks. This means that anything as root with any capability set can write to MSRs. Historically that wasn't very interesting but on modern processors the MSRs are such that writing to them provides several ways to execute arbitary code in kernel space. Sample code and documentation on doing this is circulating and MSR attacks are used on Windows 64bit rootkits already. In the Linux case you still need to be able to open the device file so the impact is fairly limited and reduces the security of some capability and security model based systems down towards that of a generic "root owns the box" setup. Therefore they should require CAP_SYS_RAWIO to prevent an elevation of capabilities. The impact of this is fairly minimal on most setups because they don't have heavy use of capabilities. Those using SELinux, SMACK or AppArmor rules might want to consider if their rulesets on the MSR driver could be tighter. Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>