summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2012-05-06KVM: ensure async PF event wakes up vcpu from haltGleb Natapov
If vcpu executes hlt instruction while async PF is waiting to be delivered vcpu can block and deliver async PF only after another even wakes it up. This happens because kvm_check_async_pf_completion() will remove completion event from vcpu->async_pf.done before entering kvm_vcpu_block() and this will make kvm_arch_vcpu_runnable() return false. The solution is to make vcpu runnable when processing completion. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-21kill mm argument of vm_munmap()Al Viro
it's always current->mm Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-20VM: add "vm_mmap()" helper functionLinus Torvalds
This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20VM: add "vm_munmap()" helper functionLinus Torvalds
Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Avi Kivity: "Changes include timekeeping improvements, support for assigning host PCI devices that share interrupt lines, s390 user-controlled guests, a large ppc update, and random fixes." This is with the sign-off's fixed, hopefully next merge window we won't have rebased commits. * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: Convert intx_mask_lock to spin lock KVM: x86: fix kvm_write_tsc() TSC matching thinko x86: kvmclock: abstract save/restore sched_clock_state KVM: nVMX: Fix erroneous exception bitmap check KVM: Ignore the writes to MSR_K7_HWCR(3) KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask KVM: PMU: add proper support for fixed counter 2 KVM: PMU: Fix raw event check KVM: PMU: warn when pin control is set in eventsel msr KVM: VMX: Fix delayed load of shared MSRs KVM: use correct tlbs dirty type in cmpxchg KVM: Allow host IRQ sharing for assigned PCI 2.3 devices KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: x86 emulator: Allow PM/VM86 switch during task switch KVM: SVM: Fix CPL updates KVM: x86 emulator: VM86 segments must have DPL 3 KVM: x86 emulator: Fix task switch privilege checks arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation KVM: mmu_notifier: Flush TLBs before releasing mmu_lock ...
2012-03-22Merge branch 'x86-fpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu changes from Ingo Molnar. * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: i387: Split up <asm/i387.h> into exported and internal interfaces i387: Uninline the generic FP helpers that we expose to kernel modules
2012-03-20x86: remove the second argument of k[un]map_atomic()Cong Wang
Acked-by: Avi Kivity <avi@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20KVM: x86: fix kvm_write_tsc() TSC matching thinkoMarcelo Tosatti
kvm_write_tsc() converts from guest TSC to microseconds, not nanoseconds as intended. The result is that the window for matching is 1000 seconds, not 1 second. Microsecond precision is enough for checking whether the TSC write delta is within the heuristic values, so use it instead of nanoseconds. Noted by Avi Kivity. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Ignore the writes to MSR_K7_HWCR(3)Nicolae Mogoreanu
When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource and fails. This patch will ignore this step and allow things to move on without having to fake CPUID value. Signed-off-by: Nicolae Mogoreanu <mogoreanu@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow host IRQ sharing for assigned PCI 2.3 devicesJan Kiszka
PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Ensure all vcpus are consistent with in-kernel irqchip settingsAvi Kivity
If some vcpus are created before KVM_CREATE_IRQCHIP, then irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading to potential NULL pointer dereferences. Fix by: - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP This is somewhat long winded because vcpu->arch.apic is created without kvm->lock held. Based on earlier patch by Michael Ellerman. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: Allow PM/VM86 switch during task switchKevin Wolf
Task switches can switch between Protected Mode and VM86. The current mode must be updated during the task switch emulation so that the new segment selectors are interpreted correctly. In order to let privilege checks succeed, rflags needs to be updated in the vcpu struct as this causes a CPL update. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: Fix task switch privilege checksKevin Wolf
Currently, all task switches check privileges against the DPL of the TSS. This is only correct for jmp/call to a TSS. If a task gate is used, the DPL of this take gate is used for the check instead. Exceptions, external interrupts and iret shouldn't perform any check. [avi: kill kvm-kmod remnants] Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Introduce kvm_memory_slot::arch and move lpage_info into itTakuya Yoshikawa
Some members of kvm_memory_slot are not used by every architecture. This patch is the first step to make this difference clear by introducing kvm_memory_slot::arch; lpage_info is moved into it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix write protection race during dirty loggingTakuya Yoshikawa
This patch fixes a race introduced by: commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Track TSC synchronization in generationsZachary Amsden
This allows us to track the original nanosecond and counter values at each phase of TSC writing by the guest. This gets us perfect offset matching for stable TSC systems, and perfect software computed TSC matching for machines with unstable TSC. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Dont mark TSC unstable due to S4 suspendZachary Amsden
During a host suspend, TSC may go backwards, which KVM interprets as an unstable TSC. Technically, KVM should not be marking the TSC unstable, which causes the TSC clocksource to go bad, but we need to be adjusting the TSC offsets in such a case. Dealing with this issue is a little tricky as the only place we can reliably do it is before much of the timekeeping infrastructure is up and running. On top of this, we are not in a KVM thread context, so we may not be able to safely access VCPU fields. Instead, we compute our best known hardware offset at power-up and stash it to be applied to all VCPUs when they actually start running. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow adjust_tsc_offset to be in host or guest cyclesMarcelo Tosatti
Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Add last_host_tsc tracking back to KVMZachary Amsden
The variable last_host_tsc was removed from upstream code. I am adding it back for two reasons. First, it is unnecessary to use guest TSC computation to conclude information about the host TSC. The guest may set the TSC backwards (this case handled by the previous patch), but the computation of guest TSC (and fetching an MSR) is significanlty more work and complexity than simply reading the hardware counter. In addition, we don't actually need the guest TSC for any part of the computation, by always recomputing the offset, we can eliminate the need to deal with the current offset and any scaling factors that may apply. The second reason is that later on, we are going to be using the host TSC value to restore TSC offsets after a host S4 suspend, so we need to be reading the host values, not the guest values here. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix last_guest_tsc / tsc_offset semanticsZachary Amsden
The variable last_guest_tsc was being used as an ad-hoc indicator that guest TSC has been initialized and recorded correctly. However, it may not have been, it could be that guest TSC has been set to some large value, the back to a small value (by, say, a software reboot). This defeats the logic and causes KVM to falsely assume that the guest TSC has gone backwards, marking the host TSC unstable, which is undesirable behavior. In addition, rather than try to compute an offset adjustment for the TSC on unstable platforms, just recompute the whole offset. This allows us to get rid of one callsite for adjust_tsc_offset, which is problematic because the units it takes are in guest units, but here, the computation was originally being done in host units. Doing this, and also recording last_guest_tsc when the TSC is written allow us to remove the tricky logic which depended on last_guest_tsc being zero to indicate a reset of uninitialized value. Instead, we now have the guarantee that the guest TSC offset is always at least something which will get us last_guest_tsc. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Leave TSC synchronization window open with each new syncZachary Amsden
Currently, when the TSC is written by the guest, the variable ns is updated to force the current write to appear to have taken place at the time of the first write in this sync phase. This leaves a cliff at the end of the match window where updates will fall of the end. There are two scenarios where this can be a problem in practe - first, on a system with a large number of VCPUs, the sync period may last for an extended period of time. The second way this can happen is if the VM reboots very rapidly and we catch a VCPU TSC synchronization just around the edge. We may be unaware of the reboot, and thus the first VCPU might synchronize with an old set of the timer (at, say 0.97 seconds ago, when first powered on). The second VCPU can come in 0.04 seconds later to try to synchronize, but it misses the window because it is just over the threshold. Instead, stop doing this artificial setback of the ns variable and just update it with every write of the TSC. It may be observed that doing so causes values computed by compute_guest_tsc to diverge slightly across CPUs - note that the last_tsc_ns and last_tsc_write variable are used here, and now they last_tsc_ns will be different for each VCPU, reflecting the actual time of the update. However, compute_guest_tsc is used only for guests which already have TSC stability issues, and further, note that the previous patch has caused last_tsc_write to be incremented by the difference in nanoseconds, converted back into guest cycles. As such, only boundary rounding errors should be visible, which given the resolution in nanoseconds, is going to only be a few cycles and only visible in cross-CPU consistency tests. The problem can be fixed by adding a new set of variables to track the start offset and start write value for the current sync cycle. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Improve TSC offset matchingZachary Amsden
There are a few improvements that can be made to the TSC offset matching code. First, we don't need to call the 128-bit multiply (especially on a constant number), the code works much nicer to do computation in nanosecond units. Second, the way everything is setup with software TSC rate scaling, we currently have per-cpu rates. Obviously this isn't too desirable to use in practice, but if for some reason we do change the rate of all VCPUs at runtime, then reset the TSCs, we will only want to match offsets for VCPUs running at the same rate. Finally, for the case where we have an unstable host TSC, but rate scaling is being done in hardware, we should call the platform code to compute the TSC offset, so the math is reorganized to recompute the base instead, then transform the base into an offset using the existing API. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> KVM: Fix 64-bit division in kvm_write_tsc() Breaks i386 build. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Infrastructure for software and hardware based TSC rate scalingZachary Amsden
This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: SVM: Add support for AMD's OSVW feature in guestsBoris Ostrovsky
In some cases guests should not provide workarounds for errata even when the physical processor is affected. For example, because of erratum 400 on family 10h processors a Linux guest will read an MSR (resulting in VMEXIT) before going to idle in order to avoid getting stuck in a non-C0 state. This is not necessary: HLT and IO instructions are intercepted and therefore there is no reason for erratum 400 workaround in the guest. This patch allows us to present a guest with certain errata as fixed, regardless of the state of actual hardware. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: ucontrol: export SIE control block to userCarsten Otte
This patch exports the s390 SIE hardware control block to userspace via the mapping of the vcpu file descriptor. In order to do so, a new arch callback named kvm_arch_vcpu_fault is introduced for all architectures. It allows to map architecture specific pages. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: add parameter for KVM_CREATE_VMCarsten Otte
This patch introduces a new config option for user controlled kernel virtual machines. It introduces a parameter to KVM_CREATE_VM that allows to set bits that alter the capabilities of the newly created virtual machine. The parameter is passed to kvm_arch_init_vm for all architectures. The only valid modifier bit for now is KVM_VM_S390_UCONTROL. This requires CAP_SYS_ADMIN privileges and creates a user controlled virtual machine on s390 architectures. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-21i387: Split up <asm/i387.h> into exported and internal interfacesLinus Torvalds
While various modules include <asm/i387.h> to get access to things we actually *intend* for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-01KVM: do not #GP on perf MSR writes when vPMU is disabledGleb Natapov
Return to behaviour perf MSR had before introducing vPMU in case vPMU is disabled. Some guests access those registers unconditionally and do not expect it to fail. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"Stephan Bärwolf
In order to be able to proceed checks on CPU-specific properties within the emulator, function "get_cpuid" is introduced. With "get_cpuid" it is possible to virtually call the guests "cpuid"-opcode without changing the VM's context. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-01-13module_param: make bool parameters really bool (arch)Rusty Russell
module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-12-27KVM: x86 emulator: implement RDPMC (0F 33)Avi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Add generic RDPMC supportAvi Kivity
Add a helper function that emulates the RDPMC instruction operation. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Expose a version 2 architectural PMU to a guestsGleb Natapov
Use perf_events to emulate an architectural PMU, version 2. Based on PMU version 1 emulation by Avi Kivity. [avi: adjust for cpuid.c] [jan: fix anonymous field initialization for older gcc] Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Use memdup_user instead of kmalloc/copy_from_userSasha Levin
Switch to using memdup_user when possible. This makes code more smaller and compact, and prevents errors. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Use kmemdup() instead of kmalloc/memcpySasha Levin
Switch to kmemdup() in two places to shorten the code and avoid possible bugs. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: use this_cpu_xxx replace percpu_xxx funcsAlex,Shi
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: MMU: move the relevant mmu code to mmu.cXiao Guangrong
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create() Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: x86: remove the dead code of KVM_EXIT_HYPERCALLXiao Guangrong
KVM_EXIT_HYPERCALL is not used anymore, so remove the code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Move cpuid code to new fileAvi Kivity
The cpuid code has grown; put it into a separate file. Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: introduce id_to_memslot functionXiao Guangrong
Introduce id_to_memslot to get memslot by slot id Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: introduce update_memslots functionXiao Guangrong
Introduce update_memslots to update slot which will be update to kvm->memslots Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Optimize dirty logging by rmap_write_protect()Takuya Yoshikawa
Currently, write protecting a slot needs to walk all the shadow pages and checks ones which have a pte mapping a page in it. The walk is overly heavy when dirty pages in that slot are not so many and checking the shadow pages would result in unwanted cache pollution. To mitigate this problem, we use rmap_write_protect() and check only the sptes which can be reached from gfns marked in the dirty bitmap when the number of dirty pages are less than that of shadow pages. This criterion is reasonable in its meaning and worked well in our test: write protection became some times faster than before when the ratio of dirty pages are low and was not worse even when the ratio was near the criterion. Note that the locking for this write protection becomes fine grained. The reason why this is safe is descripted in the comments. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Count the number of dirty pages for dirty loggingTakuya Yoshikawa
Needed for the next patch which uses this number to decide how to write protect a slot. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: MMU: remove KVM host pv mmu supportChris Wright
The host side pv mmu support has been marked for feature removal in January 2011. It's not in use, is slower than shadow or hardware assisted paging, and a maintenance burden. It's November 2011, time to remove it. Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: MMU: fast prefetch spte on invlpg pathXiao Guangrong
Fast prefetch spte for the unsync shadow page on invlpg path Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: x86: cleanup port-in/port-out emulatedXiao Guangrong
Remove the same code between emulator_pio_in_emulated and emulator_pio_out_emulated Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: x86: retry non-page-table writing instructionsXiao Guangrong
If the emulation is caused by #PF and it is non-page_table writing instruction, it means the VM-EXIT is caused by shadow page protected, we can zap the shadow page and retry this instruction directly The idea is from Avi Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXITNadav Har'El
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. This bit requests that when next entering the guest, we should run it only for as little as possible, and exit again. We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 to continue running so it can inject an event to it, we unfortunately cannot just pretend to have run L2 for a little while - We must really launch L2, otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2) will be lost. So the existing code runs L2 in this case. But L2 could potentially run for a long time until it exits, and the injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us to request that L2 will be entered, as necessary, but will exit as soon as possible after entry. Our implementation of this request uses smp_send_reschedule() to send a self-IPI, with interrupts disabled. The interrupts remain disabled until the guest is entered, and then, after the entry is complete (often including processing an injection and jumping to the relevant handler), the physical interrupt is noticed and causes an exit. On recent Intel processors, we could have achieved the same goal by using MTF instead of a self-IPI. Another technique worth considering in the future is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to slightly improve performance by avoiding the useless interrupt handler which ends up being called when smp_send_reschedule() is used. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-26KVM: Don't automatically expose the TSC deadline timer in cpuidJan Kiszka
Unlike all of the other cpuid bits, the TSC deadline timer bit is set unconditionally, regardless of what userspace wants. This is broken in several ways: - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC deadline timer feature, a guest that uses the feature will break - live migration to older host kernels that don't support the TSC deadline timer will cause the feature to be pulled from under the guest's feet; breaking it - guests that are broken wrt the feature will fail. Fix by not enabling the feature automatically; instead report it to userspace. Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not KVM_GET_SUPPORTED_CPUID. Fixes the Illumos guest kernel, which uses the TSC deadline timer feature. [avi: add the KVM_CAP + documentation] Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-10-30Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommuLinus Torvalds
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits) iommu/core: Remove global iommu_ops and register_iommu iommu/msm: Use bus_set_iommu instead of register_iommu iommu/omap: Use bus_set_iommu instead of register_iommu iommu/vt-d: Use bus_set_iommu instead of register_iommu iommu/amd: Use bus_set_iommu instead of register_iommu iommu/core: Use bus->iommu_ops in the iommu-api iommu/core: Convert iommu_found to iommu_present iommu/core: Add bus_type parameter to iommu_domain_alloc Driver core: Add iommu_ops to bus_type iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API iommu/amd: Fix wrong shift direction iommu/omap: always provide iommu debug code iommu/core: let drivers know if an iommu fault handler isn't installed iommu/core: export iommu_set_fault_handler() iommu/omap: Fix build error with !IOMMU_SUPPORT iommu/omap: Migrate to the generic fault report mechanism iommu/core: Add fault reporting mechanism iommu/core: Use PAGE_SIZE instead of hard-coded value iommu/core: use the existing IS_ALIGNED macro iommu/msm: ->unmap() should return order of unmapped page ... Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config options" just happened to touch lines next to each other.