summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2017-09-22Merge tag 'v4.1.44' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.44 * tag 'v4.1.44': (180 commits) Linux 4.1.44 mtd: fsl-quadspi: fix macro collision problems with READ/WRITE pinctrl: samsung: Remove bogus irq_[un]mask from resource management pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver pnfs/blocklayout: require 64-bit sector_t iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter USB: Check for dropped connection before switching to full speed uas: Add US_FL_IGNORE_RESIDUE for Initio Corporation INIC-3069 iio: light: tsl2563: use correct event code staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read USB: hcd: Mark secondary HCD as dead if the primary one died USB: serial: pl2303: add new ATEN device id USB: serial: cp210x: add support for Qivicon USB ZigBee dongle USB: serial: option: add D-Link DWM-222 device ID nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays fuse: initialize the flock flag in fuse_file on allocation iscsi-target: Fix iscsi_np reset hung task during parallel delete iscsi-target: fix memory leak in iscsit_setup_text_cmd() ... Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
2017-09-10x86/boot: Add missing declaration of string functionsNicholas Mc Guire
[ Upstream commit fac69d0efad08fc15e4dbfc116830782acc0dc9a ] Add the missing declarations of basic string functions to string.h to allow a clean build. Fixes: 5be865661516 ("String-handling functions for the new x86 setup code.") Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Link: http://lkml.kernel.org/r/1483781911-21399-1-git-send-email-hofrat@osadl.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10KVM: async_pf: make rcu irq exit if not triggered from idle taskWanpeng Li
[ Upstream commit 337c017ccdf2653d0040099433fc1a2b1beb5926 ] WARNING: CPU: 5 PID: 1242 at kernel/rcu/tree_plugin.h:323 rcu_note_context_switch+0x207/0x6b0 CPU: 5 PID: 1242 Comm: unity-settings- Not tainted 4.13.0-rc2+ #1 RIP: 0010:rcu_note_context_switch+0x207/0x6b0 Call Trace: __schedule+0xda/0xba0 ? kvm_async_pf_task_wait+0x1b2/0x270 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 RIP: 0010:__d_lookup_rcu+0x90/0x1e0 I encounter this when trying to stress the async page fault in L1 guest w/ L2 guests running. Commit 9b132fbe5419 (Add rcu user eqs exception hooks for async page fault) adds rcu_irq_enter/exit() to kvm_async_pf_task_wait() to exit cpu idle eqs when needed, to protect the code that needs use rcu. However, we need to call the pair even if the function calls schedule(), as seen from the above backtrace. This patch fixes it by informing the RCU subsystem exit/enter the irq towards/away from idle for both n.halted and !n.halted. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10x86/mce/AMD: Make the init code more robustThomas Gleixner
[ Upstream commit 0dad3a3014a0b9e72521ff44f17e0054f43dcdea ] If mce_device_init() fails then the mce device pointer is NULL and the AMD mce code happily dereferences it. Add a sanity check. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10x86/acpi: Prevent out of bound access caused by broken ACPI tablesSeunghun Han
[ Upstream commit dad5ab0db8deac535d03e3fe3d8f2892173fa6a4 ] The bus_irq argument of mp_override_legacy_irq() is used as the index into the isa_irq_to_gsi[] array. The bus_irq argument originates from ACPI_MADT_TYPE_IO_APIC and ACPI_MADT_TYPE_INTERRUPT items in the ACPI tables, but is nowhere sanity checked. That allows broken or malicious ACPI tables to overwrite memory, which might cause malfunction, panic or arbitrary code execution. Add a sanity check and emit a warning when that triggers. [ tglx: Added warning and rewrote changelog ] Signed-off-by: Seunghun Han <kkamagui@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10x86/xen: allow userspace access during hypercallsMarek Marczykowski-Górecki
[ Upstream commit c54590cac51db8ab5fd30156bdaba34af915e629 ] Userspace application can do a hypercall through /dev/xen/privcmd, and some for some hypercalls argument is a pointers to user-provided structure. When SMAP is supported and enabled, hypervisor can't access. So, lets allow it. The same applies to HYPERVISOR_dm_op, where additionally privcmd driver carefully verify buffer addresses. Cc: stable@vger.kernel.org Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-08-06Merge tag 'v4.1.43' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.43 * tag 'v4.1.43': (182 commits) Linux 4.1.43 HID: core: prevent out-of-bound readings ipvs: SNAT packet replies only for NATed connections Revert "dmaengine: ep93xx: Don't drain the transfers in terminate_all()" staging: comedi: ni_mio_common: fix E series ni_ai_insn_read() data kvm: vmx: Do not disable intercepts for BNDCFGS tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results PM / QoS: return -EINVAL for bogus strings sched/topology: Optimize build_group_mask() sched/topology: Fix overlapping sched_group_mask crypto: caam - fix signals handling crypto: atmel - only treat EBUSY as transient if backlog crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD Add "shutdown" to "struct class". mnt: Make propagate_umount less slow for overlapping mount propagation trees mnt: In propgate_umount handle visiting mounts in any order mnt: In umount propagation reparent in a separate pass vt: fix unchecked __put_user() in tioclinux ioctls exec: Limit arg stack to at most 75% of _STK_LIM s390: reduce ELF_ET_DYN_BASE ...
2017-07-31kvm: vmx: Do not disable intercepts for BNDCFGSJim Mattson
[ Upstream commit a8b6fda38f80e75afa3b125c9e7f2550b579454b ] The MSR permission bitmaps are shared by all VMs. However, some VMs may not be configured to support MPX, even when the host does. If the host supports VMX and the guest does not, we should intercept accesses to the BNDCFGS MSR, so that we can synthesize a #GP fault. Furthermore, if the host does not support MPX and the "ignore_msrs" kvm kernel parameter is set, then we should intercept accesses to the BNDCFGS MSR, so that we can skip over the rdmsr/wrmsr without raising a #GP fault. Fixes: da8999d31818fdc8 ("KVM: x86: Intel MPX vmx and msr handle") Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31binfmt_elf: use ELF_ET_DYN_BASE only for PIEKees Cook
[ Upstream commit eab09532d40090698b05a07c1c87f39fdbc5fab5 ] The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded above the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and suggestions on how to implement this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Pratyush Anand <panand@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31x86/uaccess: Optimize copy_user_enhanced_fast_string() for short stringsPaolo Abeni
[ Upstream commit 236222d39347e0e486010f10c1493e83dbbdfba8 ] According to the Intel datasheet, the REP MOVSB instruction exposes a pretty heavy setup cost (50 ticks), which hurts short string copy operations. This change tries to avoid this cost by calling the explicit loop available in the unrolled code for strings shorter than 64 bytes. The 64 bytes cutoff value is arbitrary from the code logic point of view - it has been selected based on measurements, as the largest value that still ensures a measurable gain. Micro benchmarks of the __copy_from_user() function with lengths in the [0-63] range show this performance gain (shorter the string, larger the gain): - in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4 - in the [72%-9%] range on Intel Core i7-4810MQ Other tested CPUs - namely Intel Atom S1260 and AMD Opteron 8216 - show no difference, because they do not expose the ERMS feature bit. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4533a1d101fd460f80e21329a34928fad521c1d4.1498744345.git.pabeni@redhat.com [ Clarified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31x86/tools: Fix gcc-7 warning in relocs.cMarkus Trippelsdorf
[ Upstream commit 7ebb916782949621ff6819acf373a06902df7679 ] gcc-7 warns: In file included from arch/x86/tools/relocs_64.c:17:0: arch/x86/tools/relocs.c: In function ‘process_64’: arch/x86/tools/relocs.c:953:2: warning: argument 1 null where non-null expected [-Wnonnull] qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from arch/x86/tools/relocs.h:6:0, from arch/x86/tools/relocs_64.c:1: /usr/include/stdlib.h:741:13: note: in a call to function ‘qsort’ declared here extern void qsort This happens because relocs16 is not used for ELF_BITS == 64, so there is no point in trying to sort it. Make the sort_relocs(&relocs16) call 32bit only. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Link: http://lkml.kernel.org/r/20161215124513.GA289@x4 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31KVM: nVMX: Fix exception injectionWanpeng Li
[ Upstream commit d4912215d1031e4fb3d1038d2e1857218dba0d0a ] WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel] CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G OE 4.12.0-rc3+ #23 RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel] Call Trace: ? kvm_check_async_pf_completion+0xef/0x120 [kvm] ? rcu_read_lock_sched_held+0x79/0x80 vmx_queue_exception+0x104/0x160 [kvm_intel] ? vmx_queue_exception+0x104/0x160 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm] ? kvm_arch_vcpu_load+0x47/0x240 [kvm] ? kvm_arch_vcpu_load+0x62/0x240 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 This is triggered occasionally by running both win7 and win2016 in L2, in addition, EPT is disabled on both L1 and L2. It can't be reproduced easily. Commit 0b6ac343fc (KVM: nVMX: Correct handling of exception injection) mentioned that "KVM wants to inject page-faults which it got to the guest. This function assumes it is called with the exit reason in vmcs02 being a #PF exception". Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to L2) allows to check all exceptions for intercept during delivery to L2. However, there is no guarantee the exit reason is exception currently, when there is an external interrupt occurred on host, maybe a time interrupt for host which should not be injected to guest, and somewhere queues an exception, then the function nested_vmx_check_exception() will be called and the vmexit emulation codes will try to emulate the "Acknowledge interrupt on exit" behavior, the warning is triggered. Reusing the exit reason from the L2->L0 vmexit is wrong in this case, the reason must always be EXCEPTION_NMI when injecting an exception into L1 as a nested vmexit. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Fixes: e011c663b9c7 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31KVM: x86: zero base3 of unusable segmentsRadim Krčmář
[ Upstream commit f0367ee1d64d27fa08be2407df5c125442e885e3 ] Static checker noticed that base3 could be used uninitialized if the segment was not present (useable). Random stack values probably would not pass VMCS entry checks. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 1aa366163b8b ("KVM: x86 emulator: consolidate segment accessors") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31x86/mpx: Use compatible types in comparison to fix sparse errorTobias Klauser
[ Upstream commit 453828625731d0ba7218242ef6ec88f59408f368 ] info->si_addr is of type void __user *, so it should be compared against something from the same address space. This fixes the following sparse error: arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31KVM: x86: fix fixing of hypercallsDmitry Vyukov
[ Upstream commit ce2e852ecc9a42e4b8dabb46025cfef63209234a ] emulator_fix_hypercall() replaces hypercall with vmcall instruction, but it does not handle GP exception properly when writes the new instruction. It can return X86EMUL_PROPAGATE_FAULT without setting exception information. This leads to incorrect emulation and triggers WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn() as discovered by syzkaller fuzzer: WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558 Call Trace: warn_slowpath_null+0x2c/0x40 kernel/panic.c:582 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline] handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline] vcpu_run arch/x86/kvm/x86.c:6947 [inline] Set exception information when write in emulator_fix_hypercall() fails. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: kvm@vger.kernel.org Cc: syzkaller@googlegroups.com Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-10Merge tag 'v4.1.42' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.42 * tag 'v4.1.42': (146 commits) Linux 4.1.42 mm: fix new crash in unmapped_area_topdown() mm: larger stack guard gap, between vmas alarmtimer: Rate limit periodic intervals MIPS: Fix bnezc/jialc return address calculation usb: dwc3: exynos fix axius clock error path to do cleanup genirq: Release resources in __setup_irq() error path swap: cond_resched in swap_cgroup_prepare() mm/memory-failure.c: use compound_head() flags for huge pages USB: gadgetfs, dummy-hcd, net2280: fix locking for callbacks usb: xhci: ASMedia ASM1042A chipset need shorts TX quirk drivers/misc/c2port/c2port-duramar2150.c: checking for NULL instead of IS_ERR() usb: r8a66597-hcd: decrease timeout usb: r8a66597-hcd: select a different endpoint on timeout USB: gadget: dummy_hcd: fix hub-descriptor removable fields [media] pvrusb2: reduce stack usage pvr2_eeprom_analyze() usb: core: fix potential memory leak in error path during hcd creation USB: hub: fix SS max number of ports iio: proximity: as3935: recalibrate RCO after resume staging: rtl8188eu: prevent an underflow in rtw_check_beacon_data() ...
2017-06-28mm: larger stack guard gap, between vmasSasha Levin
[ Upstream commit 1be7107fbe18eed3e319a6c3e83c78254b693acb ] Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()Laura Abbott
[ Upstream commit 861ce4a3244c21b0af64f880d5bfe5e6e2fb9e4a ] '__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbeb6: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dc16ecf7fd1f ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/1494278596-30373-1-git-send-email-labbott@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25KVM: async_pf: avoid async pf injection when in guest modeWanpeng Li
[ Upstream commit 9bc1f09f6fa76fdf31eb7d6a4a4df43574725f93 ] INFO: task gnome-terminal-:1734 blocked for more than 120 seconds. Not tainted 4.12.0-rc4+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gnome-terminal- D 0 1734 1015 0x00000000 Call Trace: __schedule+0x3cd/0xb30 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? __vfs_read+0x37/0x150 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 This is triggered by running both win7 and win2016 on L1 KVM simultaneously, and then gives stress to memory on L1, I can observed this hang on L1 when at least ~70% swap area is occupied on L0. This is due to async pf was injected to L2 which should be injected to L1, L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host actually), and L1 guest starts accumulating tasks stuck in D state in kvm_async_pf_task_wait() since missing PAGE_READY async_pfs. This patch fixes the hang by doing async pf when executing L1 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulationWanpeng Li
[ Upstream commit a3641631d14571242eec0d30c9faa786cbf52d44 ] If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it potentially can be exploited the vulnerability. this will out-of-bounds read and write. Luckily, the effect is small: /* when no next entry is found, the current entry[i] is reselected */ for (j = i + 1; ; j = (j + 1) % nent) { struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j]; if (ej->function == e->function) { It reads ej->maxphyaddr, which is user controlled. However... ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; After cpuid_entries there is int maxphyaddr; struct x86_emulate_ctxt emulate_ctxt; /* 16-byte aligned */ So we have: - cpuid_entries at offset 1B50 (6992) - maxphyaddr at offset 27D0 (6992 + 3200 = 10192) - padding at 27D4...27DF - emulate_ctxt at 27E0 And it writes in the padding. Pfew, writing the ops field of emulate_ctxt would have been much worse. This patch fixes it by modding the index to avoid the out-of-bounds access. Worst case, i == j and ej->function == e->function, the loop can bail out. Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Guofang Mo <moguofang@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25kvm: async_pf: fix rcu_irq_enter() with irqs enabledPaolo Bonzini
[ Upstream commit bbaf0e2b1c1b4f88abd6ef49576f0efb1734eae5 ] native_safe_halt enables interrupts, and you just shouldn't call rcu_irq_enter() with interrupts enabled. Reorder the call with the following local_irq_disable() to respect the invariant. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-17Merge tag 'v4.1.41' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.41 * tag 'v4.1.41': (473 commits) Linux 4.1.41 mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp xc2028: Fix use-after-free bug properly iio: proximity: as3935: fix as3935_write ipx: call ipxitf_put() in ioctl error path sched/fair: Initialize throttle_count for new task-groups lazily sched/fair: Do not announce throttled next buddy in dequeue_task_fair() iio: dac: ad7303: fix channel description mwifiex: pcie: fix cmd_buf use-after-free in remove/reset rtlwifi: rtl8821ae: setup 8812ae RFE according to device type ARM: tegra: paz00: Mark panel regulator as enabled on boot fs/xattr.c: zero out memory copied to userspace in getxattr vfio/type1: Remove locked page accounting workqueue crypto: algif_aead - Require setkey before accept(2) staging: gdm724x: gdm_mux: fix use-after-free on module unload drm/ttm: fix use-after-free races in vm fault handling f2fs: sanity check segment count ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf ipv6: initialize route null entry in addrconf_init() rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string ... Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
2017-06-13KVM: nVMX: initialize PML fields in vmcs02Ladi Prosek
[ Upstream commit 1fb883bb827ee8efc1cc9ea0154f953f8a219d38 ] L2 was running with uninitialized PML fields which led to incomplete dirty bitmap logging. This manifested as all kinds of subtle erratic behavior of the nested guest. Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13Revert "KVM: nested VMX: disable perf cpuid reporting"Jim Mattson
[ Upstream commit 0b4c208d443ba2af82b4c70f99ca8df31e9a0020 ] This reverts commit bc6134942dbbf31c25e9bd7c876be5da81c9e1ce. A CPUID instruction executed in VMX non-root mode always causes a VM-exit, regardless of the leaf being queried. Fixes: bc6134942dbb ("KVM: nested VMX: disable perf cpuid reporting") Signed-off-by: Jim Mattson <jmattson@google.com> [The issue solved by bc6134942dbb has been resolved with ff651cb613b4 ("KVM: nVMX: Add nested msr load/restore algorithm").] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13x86/platform/intel-mid: Correct MSI IRQ line for watchdog deviceAndy Shevchenko
[ Upstream commit 80354c29025833acd72ddac1ffa21c6cb50128cd ] The interrupt line used for the watchdog is 12, according to the official Intel Edison BSP code. And indeed after fixing it we start getting an interrupt and thus the watchdog starts working again: [ 191.699951] Kernel panic - not syncing: Kernel Watchdog Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Cohen <david.a.cohen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 78a3bb9e408b ("x86: intel-mid: add watchdog platform code for Merrifield") Link: http://lkml.kernel.org/r/20170312150744.45493-1-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13kprobes/x86: Fix kernel panic when certain exception-handling addresses are ↵Masami Hiramatsu
probed [ Upstream commit 75013fb16f8484898eaa8d0b08fed942d790f029 ] Fix to the exception table entry check by using probed address instead of the address of copied instruction. This bug may cause unexpected kernel panic if user probe an address where an exception can happen which should be fixup by __ex_table (e.g. copy_from_user.) Unless user puts a kprobe on such address, this doesn't cause any problem. This bug has been introduced years ago, by commit: 464846888d9a ("x86/kprobes: Fix a bug which can modify kernel code permanently"). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 464846888d9a ("x86/kprobes: Fix a bug which can modify kernel code permanently") Link: http://lkml.kernel.org/r/148829899399.28855.12581062400757221722.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13x86/pci-calgary: Fix iommu_free() comparison of unsigned expression >= 0Nikola Pajkovsky
[ Upstream commit 68dee8e2f2cacc54d038394e70d22411dee89da2 ] commit 8fd524b355da ("x86: Kill bad_dma_address variable") has killed bad_dma_address variable and used instead of macro DMA_ERROR_CODE which is always zero. Since dma_addr is unsigned, the statement dma_addr >= DMA_ERROR_CODE is always true, and not needed. arch/x86/kernel/pci-calgary_64.c: In function ‘iommu_free’: arch/x86/kernel/pci-calgary_64.c:299:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if (unlikely((dma_addr >= DMA_ERROR_CODE) && (dma_addr < badend))) { Fixes: 8fd524b355da ("x86: Kill bad_dma_address variable") Signed-off-by: Nikola Pajkovsky <npajkovsky@suse.cz> Cc: iommu@lists.linux-foundation.org Cc: Jon Mason <jdmason@kudzu.us> Cc: Muli Ben-Yehuda <mulix@mulix.org> Link: http://lkml.kernel.org/r/7612c0f9dd7c1290407dbf8e809def922006920b.1479161177.git.npajkovsky@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13ftrace/x86: Fix triple fault with graph tracing and suspend-to-ramJosh Poimboeuf
[ Upstream commit 34a477e5297cbaa6ecc6e17c042a866e1cbe80d6 ] On x86-32, with CONFIG_FIRMWARE and multiple CPUs, if you enable function graph tracing and then suspend to RAM, it will triple fault and reboot when it resumes. The first fault happens when booting a secondary CPU: startup_32_smp() load_ucode_ap() prepare_ftrace_return() ftrace_graph_is_dead() (accesses 'kill_ftrace_graph') The early head_32.S code calls into load_ucode_ap(), which has an an ftrace hook, so it calls prepare_ftrace_return(), which calls ftrace_graph_is_dead(), which tries to access the global 'kill_ftrace_graph' variable with a virtual address, causing a fault because the CPU is still in real mode. The fix is to add a check in prepare_ftrace_return() to make sure it's running in protected mode before continuing. The check makes sure the stack pointer is a virtual kernel address. It's a bit of a hack, but it's not very intrusive and it works well enough. For reference, here are a few other (more difficult) ways this could have potentially been fixed: - Move startup_32_smp()'s call to load_ucode_ap() down to *after* paging is enabled. (No idea what that would break.) - Track down load_ucode_ap()'s entire callee tree and mark all the functions 'notrace'. (Probably not realistic.) - Pause graph tracing in ftrace_suspend_notifier_call() or bringup_cpu() or __cpu_up(), and ensure that the pause facility can be queried from real mode. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net> Cc: linux-acpi@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: stable@kernel.org Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/5c1272269a580660703ed2eccf44308e790c7a98.1492123841.git.jpoimboe@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13xen/x86: don't lose event interruptsStefano Stabellini
[ Upstream commit c06b6d70feb32d28f04ba37aa3df17973fd37b6b ] On slow platforms with unreliable TSC, such as QEMU emulated machines, it is possible for the kernel to request the next event in the past. In that case, in the current implementation of xen_vcpuop_clockevent, we simply return -ETIME. To be precise the Xen returns -ETIME and we pass it on. However the result of this is a missed event, which simply causes the kernel to hang. Instead it is better to always ask the hypervisor for a timer event, even if the timeout is in the past. That way there are no lost interrupts and the kernel survives. To do that, remove the VCPU_SSHOTTMR_future flag. Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRsYazen Ghannam
[ Upstream commit 29f72ce3e4d18066ec75c79c857bee0618a3504b ] MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name. However, MCA bank 3 is defined on Fam17h systems and can be accessed using legacy MSRs. Without a name we get a stack trace on Fam17h systems when trying to register sysfs files for bank 3 on kernels that don't recognize Scalable MCA. Call MCA bank 3 "decode_unit" since this is what it represents on Fam17h. This will allow kernels without SMCA support to see this bank on Fam17h+ and prevent the stack trace. This will not affect older systems since this bank is reserved on them, i.e. it'll be ignored. Tested on AMD Fam15h and Fam17h systems. WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal kobject: (ffff88085bb256c0): attempted to be registered with empty name! ... Call Trace: kobject_add_internal kobject_add kobject_create_and_add threshold_create_device threshold_init_device Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1490102285-3659-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13mm: Tighten x86 /dev/mem with zeroing readsKees Cook
[ Upstream commit a4866aa812518ed1a37d8ea0c881dc946409de94 ] Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is disallowed. However, on x86, the first 1MB was always allowed for BIOS and similar things, regardless of it actually being System RAM. It was possible for heap to end up getting allocated in low 1MB RAM, and then read by things like x86info or dd, which would trip hardened usercopy: usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes) This changes the x86 exception for the low 1MB by reading back zeros for System RAM areas instead of blindly allowing them. More work is needed to extend this to mmap, but currently mmap doesn't go through usercopy, so hardened usercopy won't Oops the kernel. Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13kvm: fix page struct leak in handle_vmonPaolo Bonzini
[ Upstream commit 06ce521af9558814b8606c0476c54497cf83a653 ] handle_vmon gets a reference on VMXON region page, but does not release it. Release the reference. Found by syzkaller; based on a patch by Dmitry. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-08x86/mce: Don't use percpu workqueuesChen, Gong
[ Upstream commit 061120aed7081b9a4393fbe07b558192f40ad911 ] An MCE is a rare event. Therefore, there's no need to have per-CPU instances of both normal and IRQ workqueues. Make them both global. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> [ Fold in subsequent patch from Rui/Boris/Tony for early boot logging. ] Signed-off-by: Tony Luck <tony.luck@intel.com> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439396985-12812-4-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-08KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulationWanpeng Li
[ Upstream commit cbfc6c9184ce71b52df4b1d82af5afc81a709178 ] Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation. - "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only, a read should be ignored) in guest can get a random number. - "rep insb" instruction to access PIT register port 0x43 can control memcpy() in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes, which will disclose the unimportant kernel memory in host but no crash. The similar test program below can reproduce the read out-of-bounds vulnerability: void hexdump(void *mem, unsigned int len) { unsigned int i, j; for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++) { /* print offset */ if(i % HEXDUMP_COLS == 0) { printf("0x%06x: ", i); } /* print hex data */ if(i < len) { printf("%02x ", 0xFF & ((char*)mem)[i]); } else /* end of block, just aligning for ASCII dump */ { printf(" "); } /* print ASCII dump */ if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1)) { for(j = i - (HEXDUMP_COLS - 1); j <= i; j++) { if(j >= len) /* end of block, not really printing */ { putchar(' '); } else if(isprint(((char*)mem)[j])) /* printable char */ { putchar(0xFF & ((char*)mem)[j]); } else /* other char */ { putchar('.'); } } putchar('\n'); } } } int main(void) { int i; if (iopl(3)) { err(1, "set iopl unsuccessfully\n"); return -1; } static char buf[0x40]; /* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x40, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x41, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x42, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x44, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x45, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x40 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x40, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x43 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); return 0; } The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation w/o clear after using which results in some random datas are left over in the buffer. Guest reads port 0x43 will be ignored since it is write only, however, the function kernel_pio() can't distigush this ignore from successfully reads data from device's ioport. There is no new data fill the buffer from port 0x43, however, emulator_pio_in_emulated() will copy the stale data in the buffer to the guest unconditionally. This patch fixes it by clearing the buffer before in instruction emulation to avoid to grant guest the stale data in the buffer. In addition, string I/O is not supported for in kernel device. So there is no iteration to read ioport %RCX times for string I/O. The function kernel_pio() just reads one round, and then copy the io size * %RCX to the guest unconditionally, actually it copies the one round ioport data w/ other random datas which are left over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by introducing the string I/O support for in kernel device in order to grant the right ioport datas to the guest. Before the patch: 0x000000: fe 38 93 93 ff ff ab ab .8...... 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ After the patch: 0x000000: 1e 02 f8 00 ff ff ab ab ........ 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: d2 e2 d2 df d2 db d2 d7 ........ 0x000008: d2 d3 d2 cf d2 cb d2 c7 ........ 0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........ 0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: 00 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 00 00 00 00 ........ 0x000018: 00 00 00 00 00 00 00 00 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Moguofang <moguofang@huawei.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17um: Fix PTRACE_POKEUSER on x86_64Richard Weinberger
[ Upstream commit 9abc74a22d85ab29cef9896a2582a530da7e79bf ] This is broken since ever but sadly nobody noticed. Recent versions of GDB set DR_CONTROL unconditionally and UML dies due to a heap corruption. It turns out that the PTRACE_POKEUSER was copy&pasted from i386 and assumes that addresses are 4 bytes long. Fix that by using 8 as address size in the calculation. Cc: <stable@vger.kernel.org> Reported-by: jie cao <cj3054@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startupAshish Kalra
[ Upstream commit d594aa0277e541bb997aef0bc0a55172d8138340 ] The minimum size for a new stack (512 bytes) setup for arch/x86/boot components when the bootloader does not setup/provide a stack for the early boot components is not "enough". The setup code executing as part of early kernel startup code, uses the stack beyond 512 bytes and accidentally overwrites and corrupts part of the BSS section. This is exposed mostly in the early video setup code, where it was corrupting BSS variables like force_x, force_y, which in-turn affected kernel parameters such as screen_info (screen_info.orig_video_cols) and later caused an exception/panic in console_init(). Most recent boot loaders setup the stack for early boot components, so this stack overwriting into BSS section issue has not been exposed. Signed-off-by: Ashish Kalra <ashish@bluestacks.com> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170419152015.10011-1-ashishkalra@Ashishs-MacBook-Pro.local Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17x86/vdso: Plug race between mapping and ELF header setupThomas Gleixner
[ Upstream commit 6fdc6dd90272ce7e75d744f71535cfbd8d77da81 ] The vsyscall32 sysctl can racy against a concurrent fork when it switches from disabled to enabled: arch_setup_additional_pages() if (vdso32_enabled) --> No mapping sysctl.vsysscall32() --> vdso32_enabled = true create_elf_tables() ARCH_DLINFO_IA32 if (vdso32_enabled) { --> Add VDSO entry with NULL pointer Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for the newly forked process or not. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17x86/perf: Fix CR4.PCE propagation to use active_mm instead of mmAndy Lutomirski
[ Upstream commit 5dc855d44c2ad960a86f593c60461f1ae1566b6d ] If one thread mmaps a perf event while another thread in the same mm is in some context where active_mm != mm (which can happen in the scheduler, for example), refresh_pce() would write the wrong value to CR4.PCE. This broke some PAPI tests. Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 7911d3f7af14 ("perf/x86: Only allow rdpmc if a perf_event is mapped") Link: http://lkml.kernel.org/r/0c5b38a76ea50e405f9abe07a13dfaef87c173a1.1489694270.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17x86/platform/goldfish: Prevent unconditional loadingThomas Gleixner
[ Upstream commit 47512cfd0d7a8bd6ab71d01cd89fca19eb2093eb ] The goldfish platform code registers the platform device unconditionally which causes havoc in several ways if the goldfish_pdev_bus driver is enabled: - Access to the hardcoded physical memory region, which is either not available or contains stuff which is completely unrelated. - Prevents that the interrupt of the serial port can be requested - In case of a spurious interrupt it goes into a infinite loop in the interrupt handler of the pdev_bus driver (which needs to be fixed seperately). Add a 'goldfish' command line option to make the registration opt-in when the platform is compiled in. I'm seriously grumpy about this engineering trainwreck, which has seven SOBs from Intel developers for 50 lines of code. And none of them figured out that this is broken. Impressive fail! Fixes: ddd70cf93d78 ("goldfish: platform device for x86") Reported-by: Gabriel C <nix.or.die@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-03-17Merge tag 'v4.1.39' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.39 * tag 'v4.1.39': (138 commits) Linux 4.1.39 KVM: x86: remove data variable from kvm_get_msr_common KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX KVM: x86: pass host_initiated to functions that read MSRs perf/core: Fix the perf_cpu_time_max_percent check perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentation perf/core: Fix implicitly enable dynamic interrupt throttle perf/core: Fix dynamic interrupt throttle Fix missing sanity check in /dev/sg printk: use rcuidle console tracepoint vfs: fix uninitialized flags in splice_to_pipe() drm/radeon: Use mode h/vdisplay fields to hide out of bounds HW cursor ARM: 8658/1: uaccess: fix zeroing of 64-bit get_user() drm/dp/mst: fix kernel oops when turning off secondary monitor [media] siano: make it work again with CONFIG_VMAP_STACK mmc: core: fix multi-bit bus width without high-speed mode futex: Move futex_init() to core_initcall xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend() scsi: aacraid: Fix INTx/MSI-x issue with older controllers cpumask: use nr_cpumask_bits for parsing functions ... Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
2017-03-06KVM: x86: remove data variable from kvm_get_msr_commonNicolas Iooss
[ Upstream commit b0996ae48285364710bce812e70ce67771ea6ef7 ] Commit 609e36d372ad ("KVM: x86: pass host_initiated to functions that read MSRs") modified kvm_get_msr_common function to use msr_info->data instead of data but missed one occurrence. Replace it and remove the unused local variable. Fixes: 609e36d372ad ("KVM: x86: pass host_initiated to functions that read MSRs") Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-03-06KVM: VMX: Fix host initiated access to guest MSR_TSC_AUXHaozhong Zhang
[ Upstream commit 81b1b9ca6d5ca5f3ce91c0095402def657cf5db3 ] The current handling of accesses to guest MSR_TSC_AUX returns error if vcpu does not support rdtscp, though those accesses are initiated by host. This can result in the reboot failure of some versions of QEMU. This patch fixes this issue by passing those host initiated accesses for further handling instead. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-03-06KVM: x86: pass host_initiated to functions that read MSRsPaolo Bonzini
[ Upstream commit 609e36d372ad9329269e4a1467bd35311893d1d6 ] SMBASE is only readable from SMM for the VCPU, but it must be always accessible if userspace is accessing it. Thus, all functions that read MSRs are changed to accept a struct msr_data; the host_initiated and index fields are pre-initialized, while the data field is filled on return. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-03-06KVM: x86: flush pending lapic jump label updates on module unloadDavid Matlack
[ Upstream commit cef84c302fe051744b983a92764d3fcca933415d ] KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled). These are implemented with delayed_work structs which can still be pending when the KVM module is unloaded. We've seen this cause kernel panics when the kvm_intel module is quickly reloaded. Use the new static_key_deferred_flush() API to flush pending updates on module unload. Signed-off-by: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-03-06x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6FBjorn Helgaas
[ Upstream commit 89e9f7bcd8744ea25fcf0ac671b8d72c10d7d790 ] Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect host bridge windows via _CRS: pci_root PNP0A08:00: host bridge window [io 0xf000-0xffff] pci_root PNP0A08:01: host bridge window [io 0xf000-0xffff] Both bridges advertise the 0xf000-0xffff window, which cannot be correct. Work around this by ignoring _CRS on this system. The downside is that we may not assign resources correctly to hot-added PCI devices (if they are possible on this system). Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606 Reported-by: Martin Burnicki <martin.burnicki@meinberg.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-01-23Merge tag 'v4.1.38' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.38 * tag 'v4.1.38': (109 commits) Linux 4.1.38 gro: Allow tunnel stacking in the case of FOU/GUE tunnels: Don't apply GRO to multiple layers of encapsulation. net: ipv4: Convert IP network timestamps to be y2038 safe ipip: Properly mark ipip GRO packets as encapsulated. sg_write()/bsg_write() is not fit to be called under KERNEL_DS fs: exec: apply CLOEXEC before changing dumpable task flags IB/cma: Fix a race condition in iboe_addr_get_sgid() Revert "ALSA: usb-audio: Fix race at stopping the stream" kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF) drivers/gpu/drm/ast: Fix infinite loop if read fails target/user: Fix use-after-free of tcmu_cmds if they are expired kernel/debug/debug_core.c: more properly delay for secondary CPUs scsi: avoid a permanent stop of the scsi device's request queue IB/multicast: Check ib_find_pkey() return value IPoIB: Avoid reading an uninitialized member variable block_dev: don't test bdev->bd_contains when it is not stable btrfs: limit async_work allocation and worker func duration mm/vmscan.c: set correct defer count for shrinker Input: drv260x - fix input device's parent assignment ...
2017-01-12kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)Jim Mattson
[ Upstream commit ef85b67385436ddc1998f45f1d6a210f935b3388 ] When L2 exits to L0 due to "exception or NMI", software exceptions (#BP and #OF) for which L1 has requested an intercept should be handled by L1 rather than L0. Previously, only hardware exceptions were forwarded to L1. Signed-off-by: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-12-22x86/kexec: add -fno-PIESebastian Andrzej Siewior
[ Upstream commit 90944e40ba1838de4b2a9290cf273f9d76bd3bdd ] If the gcc is configured to do -fPIE by default then the build aborts later with: | Unsupported relocation type: unknown type rel type name (29) Tagging it stable so it is possible to compile recent stable kernels as well. Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Michal Marek <mmarek@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-12-21x86/init: Fix cr4_init_shadow() on CR4-less machinesAndy Lutomirski
[ Upstream commit e1bfc11c5a6f40222a698a818dc269113245820e ] cr4_init_shadow() will panic on 486-like machines without CR4. Fix it using __read_cr4_safe(). Reported-by: david@saggiorato.net Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-12-17Merge tag 'v4.1.36' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.36 * tag 'v4.1.36': (72 commits) Linux 4.1.36 kbuild: add -fno-PIE firewire: net: fix fragmented datagram_size off-by-one firewire: net: guard against rx buffer overflows parisc: Ensure consistent state when switching to kernel stack at syscall entry ovl: fsync after copy-up virtio: console: Unlock vqs while freeing buffers md: be careful not lot leak internal curr_resync value into metadata. -- (all) md: sync sync_completed has correct value as recovery finishes. scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded drm/radeon/si_dpm: workaround for SI kickers drm/dp/mst: Check peer device type before attempting EDID read drm/dp/mst: add some defines for logical/physical ports drm/dp/mst: Clear port->pdt when tearing down the i2c adapter KVM: MIPS: Precalculate MMIO load resume PC KVM: MIPS: Make ERET handle ERL before EXL drm/radeon: drop register readback in cayman_cp_int_cntl_setup scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices Revert "drm/radeon: fix DP link training issue with second 4K monitor" ...