summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2012-01-31net: remove mm.h inclusion from netdevice.hAlexey Dobriyan
Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually). To prevent mm.h inclusion via other channels also extract "enum dma_data_direction" definition into separate header. This tiny piece is what gluing netdevice.h with mm.h via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h". Removal of mm.h from scatterlist.h was tried and was found not feasible on most archs, so the link was cutoff earlier. Hope people are OK with tiny include file. Note, that mm_types.h is still dragged in, but it is a separate story. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-21xen: only limit memory map to maximum reservation for domain 0.Ian Campbell
commit d3db728125c4470a2d061ac10fa7395e18237263 upstream. d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM" clamped the total amount of RAM to the current maximum reservation. This is correct for dom0 but is not correct for guest domains. In order to boot a guest "pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for future memory expansion the guest must derive max_pfn from the e820 provided by the toolstack and not the current maximum reservation (which can reflect only the current maximum, not the guest lifetime max). The existing algorithm already behaves this correctly if we do not artificially limit the maximum number of pages for the guest case. For a guest booted with maxmem=512, memory=128 this results in: [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) -[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable) -[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable) +[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable) ... [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) -[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000 +[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000 [ 0.000000] initial memory mapped : 0 - 027ff000 [ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096 -[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000 -[ 0.000000] 0000000000 - 0008100000 page 4k -[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000 +[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000 +[ 0.000000] 0000000000 - 0020800000 page 4k +[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000 [ 0.000000] xen: setting RW the range 27e8000 - 27ff000 [ 0.000000] 0MB HIGHMEM available. -[ 0.000000] 129MB LOWMEM available. -[ 0.000000] mapped low ram: 0 - 08100000 -[ 0.000000] low ram: 0 - 08100000 +[ 0.000000] 520MB LOWMEM available. +[ 0.000000] mapped low ram: 0 - 20800000 +[ 0.000000] low ram: 0 - 20800000 With this change "xl mem-set <domain> 512M" will successfully increase the guest RAM (by reducing the balloon). There is no change for dom0. Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21x86, hpet: Immediately disable HPET timer 1 if rtc irq is maskedMark Langsdorf
commit 2ded6e6a94c98ea453a156748cb7fabaf39a76b9 upstream. When HPET is operating in RTC mode, the TN_ENABLE bit on timer1 controls whether the HPET or the RTC delivers interrupts to irq8. When the system goes into suspend, the RTC driver sends a signal to the HPET driver so that the HPET releases control of irq8, allowing the RTC to wake the system from suspend. The switchover is accomplished by a write to the HPET configuration registers which currently only occurs while servicing the HPET interrupt. On some systems, I have seen the system suspend before an HPET interrupt occurs, preventing the write to the HPET configuration register and leaving the HPET in control of the irq8. As the HPET is not active during suspend, it does not generate a wake signal and RTC alarms do not work. This patch forces the HPET driver to immediately transfer control of the irq8 channel to the RTC instead of waiting until the next interrupt event. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21thp: add compound tail page _mapcount when mappedYouquan Song
commit b6999b19120931ede364fa3b685e698a61fed31d upstream. With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page failed to be used. The root cause is that 1GB page allocation calls gup_huge_pud() while 2M page calls gup_huge_pmd. If compound pages are used and the page is a tail page, gup_huge_pmd() increases _mapcount to record tail page are mapped while gup_huge_pud does not do that. So when the mapped page is relesed, it will result in kernel oops because the page is not marked mapped. This patch add tail process for compound page in 1GB huge page which keeps the same process as 2M page. Reproduce like: 1. Add grub boot option: hugepagesz=1G hugepages=8 2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages 3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages -net none -device pci-assign,host=07:00.1 kernel BUG at mm/swap.c:114! invalid opcode: 0000 [#1] SMP Call Trace: put_page+0x15/0x37 kvm_release_pfn_clean+0x31/0x36 kvm_iommu_put_pages+0x94/0xb1 kvm_iommu_unmap_memslots+0x80/0xb6 kvm_assign_device+0xba/0x117 kvm_vm_ioctl_assigned_device+0x301/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP put_compound_page+0xd4/0x168 Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09oprofile, x86: Fix crash when unloading module (nmi timer mode)Robert Richter
commit 97f7f8189fe54e3cfe324ef9ad35064f3d2d3bff upstream. If oprofile uses the nmi timer interrupt there is a crash while unloading the module. The bug can be triggered with oprofile build as module and kernel parameter nolapic set. This patch fixes this. oprofile: using NMI timer interrupt. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 PGD 42dbca067 PUD 41da6a067 PMD 0 Oops: 0002 [#1] PREEMPT SMP CPU 5 Modules linked in: oprofile(-) [last unloaded: oprofile] Pid: 2518, comm: modprobe Not tainted 3.1.0-rc7-00019-gb2fb49d #19 Advanced Micro Device Anaheim/Anaheim RIP: 0010:[<ffffffff8123c226>] [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 RSP: 0018:ffff88041ef71e98 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffffffa0017100 RCX: dead000000200200 RDX: 0000000000000000 RSI: dead000000100100 RDI: ffffffff8178c620 RBP: ffff88041ef71ea8 R08: 0000000000000001 R09: 0000000000000082 R10: 0000000000000000 R11: ffff88041ef71de8 R12: 0000000000000080 R13: fffffffffffffff5 R14: 0000000000000001 R15: 0000000000610210 FS: 00007fc902f20700(0000) GS:ffff88042fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000041cdb6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 2518, threadinfo ffff88041ef70000, task ffff88041d348040) Stack: ffff88041ef71eb8 ffffffffa0017790 ffff88041ef71eb8 ffffffffa0013532 ffff88041ef71ec8 ffffffffa00132d6 ffff88041ef71ed8 ffffffffa00159b2 ffff88041ef71f78 ffffffff81073115 656c69666f72706f 0000000000610200 Call Trace: [<ffffffffa0013532>] op_nmi_exit+0x15/0x17 [oprofile] [<ffffffffa00132d6>] oprofile_arch_exit+0xe/0x10 [oprofile] [<ffffffffa00159b2>] oprofile_exit+0x1e/0x20 [oprofile] [<ffffffff81073115>] sys_delete_module+0x1c3/0x22f [<ffffffff811bf09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8148070b>] system_call_fastpath+0x16/0x1b Code: 20 c6 78 81 e8 c5 cc 23 00 48 8b 13 48 8b 43 08 48 be 00 01 10 00 00 00 ad de 48 b9 00 02 20 00 00 00 ad de 48 c7 c7 20 c6 78 81 89 42 08 48 89 10 48 89 33 48 89 4b 08 e8 a6 c0 23 00 5a 5b RIP [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 RSP <ffff88041ef71e98> CR2: 0000000000000008 ---[ end trace 43a541a52956b7b0 ]--- Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09perf/x86: Fix PEBS instruction unwindPeter Zijlstra
commit 57d1c0c03c6b48b2b96870d831b9ce6b917f53ac upstream. Masami spotted that we always try to decode the instruction stream as 64bit instructions when running a 64bit kernel, this doesn't work for ia32-compat proglets. Use TIF_IA32 to detect if we need to use the 32bit instruction decoder. Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, ↵Konrad Rzeszutek Wilk
regardless of lazy_mmu mode commit 2cd1c8d4dc7ecca9e9431e2dabe41ae9c7d89e51 upstream. Fix an outstanding issue that has been reported since 2.6.37. Under a heavy loaded machine processing "fork()" calls could crash with: BUG: unable to handle kernel paging request at f573fc8c IP: [<c01abc54>] swap_count_continued+0x104/0x180 *pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1 EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3 EIP is at swap_count_continued+0x104/0x180 .. snip.. Call Trace: [<c01ac222>] ? __swap_duplicate+0xc2/0x160 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0 [<c01ac2e4>] ? swap_duplicate+0x14/0x40 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500 [<c01a0ca5>] ? copy_page_range+0x195/0x200 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0 [<c0132cf8>] ? dup_mm+0xa8/0x130 [<c013376a>] ? copy_process+0x98a/0xb30 [<c013395f>] ? do_fork+0x4f/0x280 [<c01573b3>] ? getnstimeofday+0x43/0x100 [<c010f770>] ? sys_clone+0x30/0x40 [<c06c048d>] ? ptregs_clone+0x15/0x48 [<c06bfb71>] ? syscall_call+0x7/0xb The problem is that in copy_page_range() we turn lazy mode on, and then in swap_entry_free() we call swap_count_continued() which ends up in: map = kmap_atomic(page, KM_USER0) + offset; and then later we touch *map. Since we are running in batched mode (lazy) we don't actually set up the PTE mappings and the kmap_atomic is not done synchronously and ends up trying to dereference a page that has not been set. Looking at kmap_atomic_prot_pfn(), it uses 'arch_flush_lazy_mmu_mode' and doing the same in kmap_atomic_prot() and __kunmap_atomic() makes the problem go away. Interestingly, commit b8bcfe997e4615 ("x86/paravirt: remove lazy mode in interrupts") removed part of this to fix an interrupt issue - but it went to far and did not consider this scenario. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09x86: Fix "Acer Aspire 1" reboot hangPeter Chubb
commit 1ef03890969932e9359b9a4c658f7f87771910ac upstream. Looks like on some Acer Aspire 1s with older bioses, reboot via bios fails. It works on my machine, (with BIOS version 0.3310) but not on some others (BIOS version 0.3309). There's a log of problems at: https://bbs.archlinux.org/viewtopic.php?id=124136 This patch adds a different callback to the reboot quirk table, to allow rebooting via keybaord controller. Reported-by: Uroš Vampl <mobile.leecher@gmail.com> Tested-by: Vasily Khoruzhick <anarsoul@gmail.com> Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09x86/mpparse: Account for bus types other than ISA and PCIBjorn Helgaas
commit 9e6866686bdf2dcf3aeb0838076237ede532dcc8 upstream. In commit f8924e770e04 ("x86: unify mp_bus_info"), the 32-bit and 64-bit versions of MP_bus_info were rearranged to match each other better. Unfortunately it introduced a regression: prior to that change we used to always set the mp_bus_not_pci bit, then clear it if we found a PCI bus. After it, we set mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave it alone otherwise. In the cases of ISA and PCI, there's not much difference. But ISA is not the only non-PCI bus, so it's better to always set mp_bus_not_pci and clear it only for PCI. Without this change, Dan's Dell PowerEdge 4200 panics on boot with a log indicating interrupt routing trouble unless the "noapic" option is supplied. With this change, the machine boots reliably without "noapic". Fixes http://bugs.debian.org/586494 Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Dan McGrath <troubledaemon@gmail.com> Cc: Alexey Starikovskiy <aystarik@gmail.com> [jrnieder@gmail.com: clarified commit message] Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09sched, x86: Avoid unnecessary overflow in sched_clockSalman Qazi
commit 4cecf6d401a01d054afc1e5f605bcbfe553cb9b9 upstream. (Added the missing signed-off-by line) In hundreds of days, the __cycles_2_ns calculation in sched_clock has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing the final value to become zero. We can solve this without losing any precision. We can decompose TSC into quotient and remainder of division by the scale factor, and then use this to convert TSC into nanoseconds. Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111115221121.7262.88871.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.Zhenzhong Duan
commit 90d4f5534d14815bd94c10e8ceccc57287657ecc upstream. PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled need VCPU placement to work, or else it will softlockup. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21x86, mrst: use a temporary variable for SFI irqMika Westerberg
commit 153b19a3b9fd8b9478495b9ee1f93f6a77c564f9 upstream. SFI tables reside in RAM and should not be modified once they are written. Current code went to set pentry->irq to zero which causes subsequent reads to fail with invalid SFI table checksum. This will break kexec as the second kernel fails to validate SFI tables. To fix this we use temporary variable for irq number. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21sfi: table irq 0xFF means 'no interrupt'Kirill A. Shutemov
commit a94cc4e6c0a26a7c8f79a432ab2c89534aa674d5 upstream. According to the SFI specification irq number 0xFF means device has no interrupt or interrupt attached via GPIO. Currently, we don't handle this special case and set irq field in *_board_info structs to 255. It leads to confusion in some drivers. Accelerometer driver tries to register interrupt 255, fails and prints "Cannot get IRQ" to dmesg. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11thp: share get_huge_page_tail()Andrea Arcangeli
commit b35a35b556f5e6b7993ad0baf20173e75c09ce8c upstream. This avoids duplicating the function in every arch gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11mm: thp: tail page refcounting fixAndrea Arcangeli
commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream. Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11iommu/amd: Fix wrong shift directionJoerg Roedel
commit fcd0861db1cf4e6ed99f60a815b7b72c2ed36ea4 upstream. The shift direction was wrong because the function takes a page number and i is the address is the loop. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatchesJan Beulich
commit 838312be46f3abfbdc175f81c3e54a857994476d upstream. These warnings (generally one per CPU) are a result of initializing x86_cpu_to_logical_apicid while apic_default is still in use, but the check in setup_local_APIC() being done when apic_bigsmp was already used as an override in default_setup_apic_routing(): Overriding APIC driver with bigsmp Enabling APIC mode: Physflat. Using 5 I/O APICs ------------[ cut here ]------------ WARNING: at .../arch/x86/kernel/apic/apic.c:1239 ... CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000 Booting Node 0, Processors #1 smpboot cpu 1: start_ip = 9e000 Initializing CPU#1 ------------[ cut here ]------------ WARNING: at .../arch/x86/kernel/apic/apic.c:1239 setup_local_APIC+0x137/0x46b() Hardware name: ... CPU1 logical APIC ID: 2 != 8 ... Fix this (for the time being, i.e. until x86_32_early_logical_apicid() will get removed again, as Tejun says ought to be possible) by overriding the previously stored values at the point where the APIC driver gets overridden. v2: Move this and the pre-existing override logic into arch/x86/kernel/apic/bigsmp_32.c. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11x86: Fix compilation bug in kprobes' twobyte_is_boostableJosh Stone
commit 315eb8a2a1b7f335d40ceeeb11b9e067475eb881 upstream. When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I noticed a warning about the asm operand for test_bit in kprobes' can_boost. I discovered that this caused only the first long of twobyte_is_boostable[] to be output. Jakub filed and fixed gcc PR50571 to correct the warning and this output issue. But to solve it for less current gcc, we can make kprobes' twobyte_is_boostable[] non-const, and it won't be optimized out. Before: CC arch/x86/kernel/kprobes.o In file included from include/linux/bitops.h:22:0, from include/linux/kernel.h:17, from [...]/arch/x86/include/asm/percpu.h:44, from [...]/arch/x86/include/asm/current.h:5, from [...]/arch/x86/include/asm/processor.h:15, from [...]/arch/x86/include/asm/atomic.h:6, from include/linux/atomic.h:4, from include/linux/mutex.h:18, from include/linux/notifier.h:13, from include/linux/kprobes.h:34, from arch/x86/kernel/kprobes.c:43: [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’: [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default] $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt 551: 0f a3 05 00 00 00 00 bt %eax,0x0 554: R_386_32 .rodata.cst4 $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o arch/x86/kernel/kprobes.o: file format elf32-i386 Contents of section .data: 0000 48000000 00000000 00000000 00000000 H............... Contents of section .rodata.cst4: 0000 4c030000 L... Only a single long of twobyte_is_boostable[] is in the object file. After, without the const on twobyte_is_boostable: $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt 551: 0f a3 05 20 00 00 00 bt %eax,0x20 554: R_386_32 .data $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o arch/x86/kernel/kprobes.o: file format elf32-i386 Contents of section .data: 0000 48000000 00000000 00000000 00000000 H............... 0010 00000000 00000000 00000000 00000000 ................ 0020 4c030000 0f000200 ffff0000 ffcff0c0 L............... 0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w Now all 32 bytes are output into .data instead. Signed-off-by: Josh Stone <jistone@redhat.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11x86: uv2: Workaround for UV2 Hub bug (system global address format)Jack Steiner
commit 6a469e4665bc158599de55d64388861d0a9f10f4 upstream. This is a workaround for a UV2 hub bug that affects the format of system global addresses. The GRU API for UV2 was inadvertently broken by a hardware change. The format of the physical address used for TLB dropins and for addresses used with instructions running in unmapped mode has changed. This change was not documented and became apparent only when diags failed running on system simulators. For UV1, TLB and GRU instruction physical addresses are identical to socket physical addresses (although high NASID bits must be OR'ed into the address). For UV2, socket physical addresses need to be converted. The NODE portion of the physical address needs to be shifted so that the low bit is in bit 39 or bit 40, depending on an MMR value. It is not yet clear if this bug will be fixed in a silicon respin. If it is fixed, the hub revision will be incremented & the workaround disabled. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25x86: Fix S4 regressionTakashi Iwai
commit 8548c84da2f47e71bbbe300f55edb768492575f7 upstream. Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4 regression since 2.6.39, namely the machine reboots occasionally at S4 resume. It doesn't happen always, overall rate is about 1/20. But, like other bugs, once when this happens, it continues to happen. This patch fixes the problem by essentially reverting the memory assignment in the older way. Signed-off-by: Takashi Iwai <tiwai@suse.de> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <yinghai.lu@oracle.com> [ We'll hopefully find the real fix, but that's too late for 3.1 now ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-16x86/PCI: use host bridge _CRS info on ASUS M2V-MX SEPaul Menzel
commit 29cf7a30f8a0ce4af2406d93d5a332099be26923 upstream. In summary, this DMI quirk uses the _CRS info by default for the ASUS M2V-MX SE by turning on `pci=use_crs` and is similar to the quirk added by commit 2491762cfb47 ("x86/PCI: use host bridge _CRS info on ASRock ALiveSATA2-GLAN") whose commit message should be read for further information. Since commit 3e3da00c01d0 ("x86/pci: AMD one chain system to use pci read out res") Linux gives the following oops: parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE] HDA Intel 0000:20:01.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 HDA Intel 0000:20:01.0: setting latency timer to 64 BUG: unable to handle kernel paging request at ffffc90011c08000 IP: [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] PGD 13781a067 PUD 13781b067 PMD 1300ba067 PTE 800000fd00000173 Oops: 0009 [#1] SMP last sysfs file: /sys/module/snd_pcm/initstate CPU 0 Modules linked in: snd_hda_intel(+) snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event tpm_tis tpm snd_seq tpm_bios psmouse parport_pc snd_timer snd_seq_device parport processor evdev snd i2c_viapro thermal_sys amd64_edac_mod k8temp i2c_core soundcore shpchp pcspkr serio_raw asus_atk0110 pci_hotplug edac_core button snd_page_alloc edac_mce_amd ext3 jbd mbcache sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod usbhid hid sg sd_mod crc_t10dif sr_mod cdrom ata_generic uhci_hcd sata_via pata_via libata ehci_hcd usbcore scsi_mod via_rhine mii nls_base [last unloaded: scsi_wait_scan] Pid: 1153, comm: work_for_cpu Not tainted 2.6.37-1-amd64 #1 M2V-MX SE/System Product Name RIP: 0010:[<ffffffffa0578402>] [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] RSP: 0018:ffff88013153fe50 EFLAGS: 00010286 RAX: ffffc90011c08000 RBX: ffff88013029ec00 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff88013341d000 R08: 0000000000000000 R09: 0000000000000040 R10: 0000000000000286 R11: 0000000000003731 R12: ffff88013029c400 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88013341d090 FS: 0000000000000000(0000) GS:ffff8800bfc00000(0000) knlGS:00000000f7610ab0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffc90011c08000 CR3: 0000000132f57000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process work_for_cpu (pid: 1153, threadinfo ffff88013153e000, task ffff8801303c86c0) Stack: 0000000000000005 ffffffff8123ad65 00000000000136c0 ffff88013029c400 ffff8801303c8998 ffff88013341d000 ffff88013341d090 ffff8801322d9dc8 ffff88013341d208 0000000000000000 0000000000000000 ffffffff811ad232 Call Trace: [<ffffffff8123ad65>] ? __pm_runtime_set_status+0x162/0x186 [<ffffffff811ad232>] ? local_pci_probe+0x49/0x92 [<ffffffff8105afc5>] ? do_work_for_cpu+0x0/0x1b [<ffffffff8105afc5>] ? do_work_for_cpu+0x0/0x1b [<ffffffff8105afd0>] ? do_work_for_cpu+0xb/0x1b [<ffffffff8105fd3f>] ? kthread+0x7a/0x82 [<ffffffff8100a824>] ? kernel_thread_helper+0x4/0x10 [<ffffffff8105fcc5>] ? kthread+0x0/0x82 [<ffffffff8100a820>] ? kernel_thread_helper+0x0/0x10 Code: f4 01 00 00 ef 31 f6 48 89 df e8 29 dd ff ff 85 c0 0f 88 2b 03 00 00 48 89 ef e8 b4 39 c3 e0 8b 7b 40 e8 fc 9d b1 e0 48 8b 43 38 <66> 8b 10 66 89 14 24 8b 43 14 83 e8 03 83 f8 01 77 32 31 d2 be RIP [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] RSP <ffff88013153fe50> CR2: ffffc90011c08000 ---[ end trace 8d1f3ebc136437fd ]--- Trusting the ACPI _CRS information (`pci=use_crs`) fixes this problem. $ dmesg | grep -i crs # with the quirk PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug The match has to be against the DMI board entries though since the vendor entries are not populated. DMI: System manufacturer System Product Name/M2V-MX SE, BIOS 0304 10/30/2007 This quirk should be removed when `pci=use_crs` is enabled for machines from 2006 or earlier or some other solution is implemented. Using coreboot [1] with this board the problem does not exist but this quirk also does not affect it either. To be safe though the check is tightened to only take effect when the BIOS from American Megatrends is used. 15:13 < ruik> but coreboot does not need that 15:13 < ruik> because i have there only one root bus 15:13 < ruik> the audio is behind a bridge $ sudo dmidecode BIOS Information Vendor: American Megatrends Inc. Version: 0304 Release Date: 10/30/2007 [1] http://www.coreboot.org/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=30552 Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Signed-off-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03perf, x86: Add model 45 SandyBridge supportYouquan Song
commit a34668f6beb4ab01e07683276d6a24bab6c175e0 upstream. Add support to Romely-EP SandyBridge. Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Anhua Xu <anhua.xu@intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1312264895-2010-1-git-send-email-youquan.song@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03xen/e820: if there is no dom0_mem=, don't tweak extra_pages.David Vrabel
commit e3b73c4a25e9a5705b4ef28b91676caf01f9bc9f upstream. The patch "xen: use maximum reservation to limit amount of usable RAM" (d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that do not use 'dom0_mem=' argument with: reserve RAM buffer: 000000133f2e2000 - 000000133fffffff (XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: ... The reason being that the last E820 entry is created using the 'extra_pages' (which is based on how many pages have been freed). The mentioned git commit sets the initial value of 'extra_pages' using a hypercall which returns the number of pages (if dom0_mem has been used) or -1 otherwise. If the later we return with MAX_DOMAIN_PAGES as basis for calculation: return min(max_pages, MAX_DOMAIN_PAGES); and use it: extra_limit = xen_get_max_pages(); if (extra_limit >= max_pfn) extra_pages = extra_limit - max_pfn; else extra_pages = 0; which means we end up with extra_pages = 128GB in PFNs (33554432) - 8GB in PFNs (2097152, on this specific box, can be larger or smaller), and then we add that value to the E820 making it: Xen: 00000000ff000000 - 0000000100000000 (reserved) Xen: 0000000100000000 - 000000133f2e2000 (usable) which is clearly wrong. It should look as so: Xen: 00000000ff000000 - 0000000100000000 (reserved) Xen: 0000000100000000 - 000000027fbda000 (usable) Naturally this problem does not present itself if dom0_mem=max:X is used. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03xen: use maximum reservation to limit amount of usable RAMDavid Vrabel
commit d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11 upstream. Use the domain's maximum reservation to limit the amount of extra RAM for the memory balloon. This reduces the size of the pages tables and the amount of reserved low memory (which defaults to about 1/32 of the total RAM). On a system with 8 GiB of RAM with the domain limited to 1 GiB the kernel reports: Before: Memory: 627792k/4472000k available After: Memory: 549740k/11132224k available A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved low memory is also reduced from 253 MiB to 32 MiB. The total additional usable RAM is 329 MiB. For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit the number of pages for dom0') (c/s 23790) Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03iommu/amd: Make sure iommu->need_sync contains correct valueJoerg Roedel
commit f1ca1512e765337a7c09eb875eedef8ea4e07654 upstream. The value is only set to true but never set back to false, which causes to many completion-wait commands to be sent to hardware. Fix it with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03iommu/amd: Don't take domain->lock recursivlyJoerg Roedel
commit e33acde91140f1809952d1c135c36feb66a51887 upstream. The domain_flush_devices() function takes the domain->lock. But this function is only called from update_domain() which itself is already called unter the domain->lock. This causes a deadlock situation when the dma-address-space of a domain grows larger than 1GB. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03xen/smp: Warn user why they keel over - nosmp or noapic and what to use instead.Konrad Rzeszutek Wilk
commit ed467e69f16e6b480e2face7bc5963834d025f91 upstream. We have hit a couple of customer bugs where they would like to use those parameters to run an UP kernel - but both of those options turn of important sources of interrupt information so we end up not being able to boot. The correct way is to pass in 'dom0_max_vcpus=1' on the Xen hypervisor line and the kernel will patch itself to be a UP kernel. Fixes bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637308 Acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03xen: x86_32: do not enable iterrupts when returning from exception in ↵Igor Mammedov
interrupt context commit d198d499148a0c64a41b3aba9e7dd43772832b91 upstream. If vmalloc page_fault happens inside of interrupt handler with interrupts disabled then on exit path from exception handler when there is no pending interrupts, the following code (arch/x86/xen/xen-asm_32.S:112): cmpw $0x0001, XEN_vcpu_info_pending(%eax) sete XEN_vcpu_info_mask(%eax) will enable interrupts even if they has been previously disabled according to eflags from the bounce frame (arch/x86/xen/xen-asm_32.S:99) testb $X86_EFLAGS_IF>>8, 8+1+ESP_OFFSET(%esp) setz XEN_vcpu_info_mask(%eax) Solution is in setting XEN_vcpu_info_mask only when it should be set according to cmpw $0x0001, XEN_vcpu_info_pending(%eax) but not clearing it if there isn't any pending events. Reproducer for bug is attached to RHBZ 707552 Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03x86, perf: Check that current->mm is alive before getting user callchainAndrey Vagin
commit 20afc60f892d285fde179ead4b24e6a7938c2f1b upstream. An event may occur when an mm is already released. I added an event in dequeue_entity() and caught a panic with the following backtrace: [ 434.421110] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 434.421258] IP: [<ffffffff810464ac>] __get_user_pages_fast+0x9c/0x120 ... [ 434.421258] Call Trace: [ 434.421258] [<ffffffff8101ae81>] copy_from_user_nmi+0x51/0xf0 [ 434.421258] [<ffffffff8109a0d5>] ? sched_clock_local+0x25/0x90 [ 434.421258] [<ffffffff8101b048>] perf_callchain_user+0x128/0x170 [ 434.421258] [<ffffffff811154cd>] ? __perf_event_header__init_id+0xed/0x100 [ 434.421258] [<ffffffff81116690>] perf_prepare_sample+0x200/0x280 [ 434.421258] [<ffffffff81118da8>] __perf_event_overflow+0x1b8/0x290 [ 434.421258] [<ffffffff81065240>] ? tg_shares_up+0x0/0x670 [ 434.421258] [<ffffffff8104fe1a>] ? walk_tg_tree+0x6a/0xb0 [ 434.421258] [<ffffffff81118f44>] perf_swevent_overflow+0xc4/0xf0 [ 434.421258] [<ffffffff81119150>] do_perf_sw_event+0x1e0/0x250 [ 434.421258] [<ffffffff81119204>] perf_tp_event+0x44/0x70 [ 434.421258] [<ffffffff8105701f>] ftrace_profile_sched_block+0xdf/0x110 [ 434.421258] [<ffffffff8106121d>] dequeue_entity+0x2ad/0x2d0 [ 434.421258] [<ffffffff810614ec>] dequeue_task_fair+0x1c/0x60 [ 434.421258] [<ffffffff8105818a>] dequeue_task+0x9a/0xb0 [ 434.421258] [<ffffffff810581e2>] deactivate_task+0x42/0xe0 [ 434.421258] [<ffffffff814bc019>] thread_return+0x191/0x808 [ 434.421258] [<ffffffff81098a44>] ? switch_task_namespaces+0x24/0x60 [ 434.421258] [<ffffffff8106f4c4>] do_exit+0x464/0x910 [ 434.421258] [<ffffffff8106f9c8>] do_group_exit+0x58/0xd0 [ 434.421258] [<ffffffff8106fa57>] sys_exit_group+0x17/0x20 [ 434.421258] [<ffffffff8100b202>] system_call_fastpath+0x16/0x1b Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314693156-24131-1-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86, UV: Remove UV delay in starting slave cpusJack Steiner
commit 05e33fc20ea5e493a2a1e7f1d04f43cdf89f83ed upstream. Delete the 10 msec delay between the INIT and SIPI when starting slave cpus. I can find no requirement for this delay. BIOS also has similar code sequences without the delay. Removing the delay reduces boot time by 40 sec. Every bit helps. Signed-off-by: Jack Steiner <steiner@sgi.com> Link: http://lkml.kernel.org/r/20110805140900.GA6774@sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86-32, vdso: On system call restart after SYSENTER, use int $0x80H. Peter Anvin
commit 7ca0758cdb7c241cb4e0490a8d95f0eb5b861daf upstream. When we enter a 32-bit system call via SYSENTER or SYSCALL, we shuffle the arguments to match the int $0x80 calling convention. This was probably a design mistake, but it's what it is now. This causes errors if the system call as to be restarted. For SYSENTER, we have to invoke the instruction from the vdso as the return address is hardcoded. Accordingly, we can simply replace the jump in the vdso with an int $0x80 instruction and use the slower entry point for a post-restart. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/CA%2B55aFztZ=r5wa0x26KJQxvZOaQq8s2v3u50wCyJcA-Sc4g8gQ@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86, olpc: Wait for last byte of EC command to be acceptedPaul Fox
commit a3ea14df0e383f44dcb2e61badb71180dbffe526 upstream. When executing EC commands, only waiting when there are still more bytes to write is usually fine. However, if the system suspends very quickly after a call to olpc_ec_cmd(), the last data byte may not yet be transferred to the EC, and the command will not complete. This solves a bug where the SCI wakeup mask was not correctly written when going into suspend. It means that sometimes, on XO-1.5 (but not XO-1), the devices that were marked as wakeup sources can't wake up the system. e.g. you ask for wifi wakeups, suspend, but then incoming wifi frames don't wake up the system as they should. Signed-off-by: Paul Fox <pgf@laptop.org> Signed-off-by: Daniel Drake <dsd@laptop.org> Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29xen: Do not enable PV IPIs when vector callback not presentStefano Stabellini
commit 3c05c4bed4ccce3f22f6d7899b308faae24ad198 upstream. Fix regression for HVM case on older (<4.1.1) hypervisors caused by commit 99bbb3a84a99cd04ab16b998b20f01a72cfa9f4f Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Date: Thu Dec 2 17:55:10 2010 +0000 xen: PV on HVM: support PV spinlocks and IPIs This change replaced the SMP operations with event based handlers without taking into account that this only works when the hypervisor supports callback vectors. This causes unexplainable hangs early on boot for HVM guests with more than one CPU. BugLink: http://bugs.launchpad.net/bugs/791850 Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Tested-and-Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29xen/x86: replace order-based range checking of M2P table by linear oneJan Beulich
commit ccbcdf7cf1b5f6c6db30d84095b9c6c53043af55 upstream. The order-based approach is not only less efficient (requiring a shift and a compare, typical generated code looking like this mov eax, [machine_to_phys_order] mov ecx, eax shr ebx, cl test ebx, ebx jnz ... whereas a direct check requires just a compare, like in cmp ebx, [machine_to_phys_nr] jae ... ), but also slightly dangerous in the 32-on-64 case - the element address calculation can wrap if the next power of two boundary is sufficiently far away from the actual upper limit of the table, and hence can result in user space addresses being accessed (with it being unknown what may actually be mapped there). Additionally, the elimination of the mistaken use of fls() here (should have been __fls()) fixes a latent issue on x86-64 that would trigger if the code was run on a system with memory extending beyond the 44-bit boundary. Signed-off-by: Jan Beulich <jbeulich@novell.com> [v1: Based on Jeremy's feedback] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86, mtrr: lock stop machine during MTRR rendezvous sequenceSuresh Siddha
commit 6d3321e8e2b3bf6a5892e2ef673c7bf536e3f904 upstream. MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially happen in parallel with another system wide rendezvous using stop_machine(). This can lead to deadlock (The order in which works are queued can be different on different cpu's. Some cpu's will be running the first rendezvous handler and others will be running the second rendezvous handler. Each set waiting for the other set to join for the system wide rendezvous, leading to a deadlock). MTRR rendezvous sequence is not implemented using stop_machine() as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). stop_machine() works with only online cpus. For now, take the stop_machine mutex in the MTRR rendezvous sequence that gets called from an online cpu (here we are in the process context and can potentially sleep while taking the mutex). And the MTRR rendezvous that gets triggered during cpu online doesn't need to take this stop_machine lock (as the stop_machine() already ensures that there is no cpu hotplug going on in parallel by doing get_online_cpus()) TBD: Pursue a cleaner solution of extending the stop_machine() infrastructure to handle the case where the calling cpu is still not online and use this for MTRR rendezvous sequence. fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008 Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86, intel, power: Correct the MSR_IA32_ENERGY_PERF_BIAS messageLen Brown
commit 17edf2d79f1ea6dfdb4c444801d928953b9f98d6 upstream. Fix the printk_once() so that it actually prints (didn't print before due to a stray comma.) [ hpa: changed to an incremental patch and adjusted the description accordingly. ] Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107151732480.18606@x980 Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15xen: allow enable use of VGA console on dom0Jeremy Fitzhardinge
commit c2419b4a4727f67af2fc2cd68b0d878b75e781bb upstream. Get the information about the VGA console hardware from Xen, and put it into the form the bootloader normally generates, so that the rest of the kernel can deal with VGA as usual. [ Impact: make VGA console work in dom0 ] Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> [v1: Rebased on 2.6.39] [v2: Removed incorrect comments and fixed compile warnings] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04oprofile, x86: Fix nmi-unsafe callgraph supportRobert Richter
commit a0e3e70243f5b270bc3eca718f0a9fa5e6b8262e upstream. Current oprofile's x86 callgraph support may trigger page faults throwing the BUG_ON(in_nmi()) message below. This patch fixes this by using the same nmi-safe copy-from-user code as in perf. ------------[ cut here ]------------ kernel BUG at .../arch/x86/kernel/traps.c:436! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:0a.0/0000:07:00.0/0000:08:04.0/net/eth0/broadcast CPU 5 Modules linked in: Pid: 8611, comm: opcontrol Not tainted 2.6.39-00007-gfe47ae7 #1 Advanced Micro Device Anaheim/Anaheim RIP: 0010:[<ffffffff813e8e35>] [<ffffffff813e8e35>] do_nmi+0x22/0x1ee RSP: 0000:ffff88042fd47f28 EFLAGS: 00010002 RAX: ffff88042c0a7fd8 RBX: 0000000000000001 RCX: 00000000c0000101 RDX: 00000000ffff8804 RSI: ffffffffffffffff RDI: ffff88042fd47f58 RBP: ffff88042fd47f48 R08: 0000000000000004 R09: 0000000000001484 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88042fd47f58 R13: 0000000000000000 R14: ffff88042fd47d98 R15: 0000000000000020 FS: 00007fca25e56700(0000) GS:ffff88042fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000074 CR3: 000000042d28b000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process opcontrol (pid: 8611, threadinfo ffff88042c0a6000, task ffff88042c532310) Stack: 0000000000000000 0000000000000001 ffff88042c0a7fd8 0000000000000000 ffff88042fd47de8 ffffffff813e897a 0000000000000020 ffff88042fd47d98 0000000000000000 ffff88042c0a7fd8 ffff88042fd47de8 0000000000000074 Call Trace: <NMI> [<ffffffff813e897a>] nmi+0x1a/0x20 [<ffffffff813f08ab>] ? bad_to_user+0x25/0x771 <<EOE>> Code: ff 59 5b 41 5c 41 5d c9 c3 55 65 48 8b 04 25 88 b5 00 00 48 89 e5 41 55 41 54 49 89 fc 53 48 83 ec 08 f6 80 47 e0 ff ff 04 74 04 <0f> 0b eb fe 81 80 44 e0 ff ff 00 00 01 04 65 ff 04 25 c4 0f 01 RIP [<ffffffff813e8e35>] do_nmi+0x22/0x1ee RSP <ffff88042fd47f28> ---[ end trace ed6752185092104b ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 8611, comm: opcontrol Tainted: G D 2.6.39-00007-gfe47ae7 #1 Call Trace: <NMI> [<ffffffff813e5e0a>] panic+0x8c/0x188 [<ffffffff813e915c>] oops_end+0x81/0x8e [<ffffffff8100403d>] die+0x55/0x5e [<ffffffff813e8c45>] do_trap+0x11c/0x12b [<ffffffff810023c8>] do_invalid_op+0x91/0x9a [<ffffffff813e8e35>] ? do_nmi+0x22/0x1ee [<ffffffff8131e6fa>] ? oprofile_add_sample+0x83/0x95 [<ffffffff81321670>] ? op_amd_check_ctrs+0x4f/0x2cf [<ffffffff813ee4d5>] invalid_op+0x15/0x20 [<ffffffff813e8e35>] ? do_nmi+0x22/0x1ee [<ffffffff813e8e7a>] ? do_nmi+0x67/0x1ee [<ffffffff813e897a>] nmi+0x1a/0x20 [<ffffffff813f08ab>] ? bad_to_user+0x25/0x771 <<EOE>> Cc: John Lumby <johnlumby@hotmail.com> Cc: Maynard Johnson <maynardj@us.ibm.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04kexec, x86: Fix incorrect jump back address if not preserving contextHuang Ying
commit 050438ed5a05b25cdf287f5691e56a58c2606997 upstream. In kexec jump support, jump back address passed to the kexeced kernel via function calling ABI, that is, the function call return address is the jump back entry. Furthermore, jump back entry == 0 should be used to signal that the jump back or preserve context is not enabled in the original kernel. But in the current implementation the stack position used for function call return address is not cleared context preservation is disabled. The patch fixes this bug. Reported-and-tested-by: Yin Kangkai <kangkai.yin@intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/r/1310607277-25029-1-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIASLen Brown
commit abe48b108247e9b90b4c6739662a2e5c765ed114 upstream. Since 2.6.36 (23016bf0d25), Linux prints the existence of "epb" in /proc/cpuinfo, Since 2.6.38 (d5532ee7b40), the x86_energy_perf_policy(8) utility has been available in-tree to update MSR_IA32_ENERGY_PERF_BIAS. However, the typical BIOS fails to initialize the MSR, presumably because this is handled by high-volume shrink-wrap operating systems... Linux distros, on the other hand, do not yet invoke x86_energy_perf_policy(8). As a result, WSM-EP, SNB, and later hardware from Intel will run in its default hardware power-on state (performance), which assumes that users care for performance at all costs and not for energy efficiency. While that is fine for performance benchmarks, the hardware's intended default operating point is "normal" mode... Initialize the MSR to the "normal" by default during kernel boot. x86_energy_perf_policy(8) is available to change the default after boot, should the user have a different preference. Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107140051020.18606@x980 Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-21x86: Make Dell Latitude E6420 use reboot=pciH. Peter Anvin
Yet another variant of the Dell Latitude series which requires reboot=pci. From the E5420 bug report by Daniel J Blueman: > The E6420 is affected also (same platform, different casing and > features), which provides an external confirmation of the issue; I can > submit a patch for that later or include it if you prefer: > http://linux.koolsolutions.com/2009/08/04/howto-fix-linux-hangfreeze-during-reboots-and-restarts/ Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>
2011-07-21x86: Make Dell Latitude E5420 use reboot=pciDaniel J Blueman
Rebooting on the Dell E5420 often hangs with the keyboard or ACPI methods, but is reliable via the PCI method. [ hpa: this was deferred because we believed for a long time that the recent reshuffling of the boot priorities in commit 660e34cebf0a11d54f2d5dd8838607452355f321 fixed this platform. Unfortunately that turned out to be incorrect. ] Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Link: http://lkml.kernel.org/r/1305248699-2347-1-git-send-email-daniel.blueman@gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>
2011-07-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86. reboot: Make Dell Latitude E6320 use reboot=pci x86, doc only: Correct real-mode kernel header offset for init_size x86: Disable AMD_NUMA for 32bit for now
2011-07-12x86. reboot: Make Dell Latitude E6320 use reboot=pciMaxime Ripard
The Dell Latitude E6320 doesn't reboot unless reboot=pci is set. Force it thanks to DMI. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Link: http://lkml.kernel.org/r/1309269451-4966-1-git-send-email-maxime.ripard@free-electrons.com Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-07-12mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a headerBenjamin Herrenschmidt
The macro MIN_MEMORY_BLOCK_SIZE is currently defined twice in two .c files, and I need it in a third one to fix a powerpc bug, so let's first move it into a header Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ingo Molnar <mingo@elte.hu>
2011-07-11x86: Disable AMD_NUMA for 32bit for nowTejun Heo
Commit 2706a0bf7b ("x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too") enabled AMD NUMA for 32bit too. Unfortunately, SPARSEMEM on 32bit had rather coarse (512MiB) addr->node mapping granularity due to lack of space in page->flags. This led to boot failure on certain AMD NUMA machines which had 128MiB alignment on nodes. Patches to properly detect this condition and reject NUMA configuration are posted[1] but deemed too pervasive for merge at this point (-rc6). Disable AMD NUMA for 32bit for now and re-enable once the detection logic is merged. [1] http://thread.gmane.org/gmane.linux.kernel/1161279/focus=1162583 Reported-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Conny Seidel <conny.seidel@amd.com> Link: http://lkml.kernel.org/r/20110711083432.GC943@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-07Merge branch 'stable/bug.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/pci: Move check for acpi_sci_override_gsi to xen_setup_acpi_sci.
2011-07-07Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Don't use the EFI reboot method by default x86, suspend: Restore MISC_ENABLE MSR in realmode wakeup x86, reboot: Acer Aspire One A110 reboot quirk x86-32, NUMA: Fix boot regression caused by NUMA init unification on highmem machines
2011-07-07Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Fix boot crash when kmemleak and debugobjects enabled * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: jump_label: Fix jump_label update for modules oprofile, x86: Fix race in nmi handler while starting counters * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Disable (revert) SCHED_LOAD_SCALE increase sched, cgroups: Fix MIN_SHARES on 64-bit boxen
2011-07-07xen/pci: Move check for acpi_sci_override_gsi to xen_setup_acpi_sci.Konrad Rzeszutek Wilk
Previously we would check for acpi_sci_override_gsi == gsi every time a PCI device was enabled. That works during early bootup, but later on it could lead to triggering unnecessarily the acpi_gsi_to_irq(..) lookup. The reason is that acpi_sci_override_gsi was declared in __initdata and after early bootup could contain bogus values. This patch moves the check for acpi_sci_override_gsi to the site where the ACPI SCI is preset. CC: stable@kernel.org Reported-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Tested-by: Raghavendra D Prabhu <rprabhu@wnohang.net> [http://lists.xensource.com/archives/html/xen-devel/2011-07/msg00154.html] Suggested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>