summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2012-02-03block: fail SCSI passthrough ioctls on partition devicesPaolo Bonzini
commit 0bfc96cb77224736dfa35c3c555d37b3646ef35e upstream. [ Changes with respect to 3.3: return -ENOTTY from scsi_verify_blk_ioctl and -ENOIOCTLCMD from sd_compat_ioctl. ] Linux allows executing the SG_IO ioctl on a partition or LVM volume, and will pass the command to the underlying block device. This is well-known, but it is also a large security problem when (via Unix permissions, ACLs, SELinux or a combination thereof) a program or user needs to be granted access only to part of the disk. This patch lets partitions forward a small set of harmless ioctls; others are logged with printk so that we can see which ioctls are actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred. Of course it was being sent to a (partition on a) hard disk, so it would have failed with ENOTTY and the patch isn't changing anything in practice. Still, I'm treating it specially to avoid spamming the logs. In principle, this restriction should include programs running with CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and /dev/sdb, it still should not be able to read/write outside the boundaries of /dev/sda2 independent of the capabilities. However, for now programs with CAP_SYS_RAWIO will still be allowed to send the ioctls. Their actions will still be logged. This patch does not affect the non-libata IDE driver. That driver however already tests for bd != bd->bd_contains before issuing some ioctl; it could be restricted further to forbid these ioctls even for programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO. Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: James Bottomley <JBottomley@parallels.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [ Make it also print the command name when warning - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backport to 2.6.32 - ENOIOCTLCMD does not get converted to ENOTTY, so we must return ENOTTY directly] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03drm: Fix authentication kernel crashThomas Hellstrom
commit 598781d71119827b454fd75d46f84755bca6f0c6 upstream. If the master tries to authenticate a client using drm_authmagic and that client has already closed its drm file descriptor, either wilfully or because it was terminated, the call to drm_authmagic will dereference a stale pointer into kmalloc'ed memory and corrupt it. Typically this results in a hard system hang. This patch fixes that problem by removing any authentication tokens (struct drm_magic_entry) open for a file descriptor when that file descriptor is closed. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-01-25kernel.h: add printk_ratelimited and pr_<level>_rlJoe Perches
commit 8a64f336bc1d4aa203b138d29d5a9c414a9fbb47 upstream. Add a printk_ratelimited statement expression macro that uses a per-call ratelimit_state so that multiple subsystems output messages are not suppressed by a global __ratelimit state. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: s/_rl/_ratelimited/g] Signed-off-by: Joe Perches <joe@perches.com> Cc: Naohiro Ooiwa <nooiwa@miraclelinux.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25block: add and use scsi_blk_cmd_ioctlPaolo Bonzini
commit 577ebb374c78314ac4617242f509e2f5e7156649 upstream. Introduce a wrapper around scsi_cmd_ioctl that takes a block device. The function will then be enhanced to detect partition block devices and, in that case, subject the ioctls to whitelisting. Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: James Bottomley <JBottomley@parallels.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> [bwh: Backport to 2.6.32 - adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-25svcrpc: destroy server sockets all at onceJ. Bruce Fields
commit 2fefb8a09e7ed251ae8996e0c69066e74c5aa560 upstream. There's no reason I can see that we need to call sv_shutdown between closing the two lists of sockets. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25V4L/DVB: v4l2-ioctl: integer overflow in video_usercopy()Dan Carpenter
commit 6c06108be53ca5e94d8b0e93883d534dd9079646 upstream. If ctrls->count is too high the multiplication could overflow and array_size would be lower than expected. Mauro and Hans Verkuil suggested that we cap it at 1024. That comes from the maximum number of controls with lots of room for expantion. $ grep V4L2_CID include/linux/videodev2.h | wc -l 211 Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.Ian Campbell
commit 9e7860cee18241633eddb36a4c34c7b61d8cecbc upstream. Haogang Chen found out that: There is a potential integer overflow in process_msg() that could result in cross-domain attack. body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH); When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent call to xb_read() would write to a zero-length buffer. The other end of this connection is always the xenstore backend daemon so there is no guest (malicious or otherwise) which can do this. The xenstore daemon is a trusted component in the system. However this seem like a reasonable robustness improvement so we should have it. And Ian when read the API docs found that: The payload length (len field of the header) is limited to 4096 (XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the limit, its xenstored connection will be immediately killed by xenstored, which is usually catastrophic from the client's point of view. Clients (particularly domains, which cannot just reconnect) should avoid this. so this patch checks against that instead. This also avoids a potential integer overflow pointed out by Haogang Chen. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Haogang Chen <haogangchen@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25PCI: Fix PCI_EXP_TYPE_RC_EC valueAlex Williamson
commit 1830ea91c20b06608f7cdb2455ce05ba834b3214 upstream. Spec shows this as 1010b = 0xa Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21linux/log2.h: Fix rounddown_pow_of_two(1)Linus Torvalds
commit 13c07b0286d340275f2d97adf085cecda37ede37 upstream. Exactly like roundup_pow_of_two(1), the rounddown version was buggy for the case of a compile-time constant '1' argument. Probably because it originated from the same code, sharing history with the roundup version from before the bugfix (for that one, see commit 1a06a52ee1b0: "Fix roundup_pow_of_two(1)"). However, unlike the roundup version, the fix for rounddown is to just remove the broken special case entirely. It's simply not needed - the generic code 1UL << ilog2(n) does the right thing for the constant '1' argment too. The only reason roundup needed that special case was because rounding up does so by subtracting one from the argument (and then adding one to the result) causing the obvious problems with "ilog2(0)". But rounddown doesn't do any of that, since ilog2() naturally truncates (ie "rounds down") to the right rounded down value. And without the ilog2(0) case, there's no reason for the special case that had the wrong value. tl;dr: rounddown_pow_of_two(1) should be 1, not 0. Acked-by: Dmitry Torokhov <dtor@vmware.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26tty: Make tiocgicount a handlerAlan Cox
commit d281da7ff6f70efca0553c288bb883e8605b3862 upstream Dan Rosenberg noted that various drivers return the struct with uncleared fields. Instead of spending forever trying to stomp all the drivers that get it wrong (and every new driver) do the job in one place. This first patch adds the needed operations and hooks them up, including the needed USB midlayer and serial core plumbing. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2011-11-26mm: avoid null pointer access in vm_struct via /proc/vmallocinfoMitsuo Hayasaka
commit f5252e009d5b87071a919221e4f6624184005368 upstream The /proc/vmallocinfo shows information about vmalloc allocations in vmlist that is a linklist of vm_struct. It, however, may access pages field of vm_struct where a page was not allocated. This results in a null pointer access and leads to a kernel panic. Why this happen: In __vmalloc_node() called from vmalloc(), newly allocated vm_struct is added to vmlist at __get_vm_area_node() and then, some fields of vm_struct such as nr_pages and pages are set at __vmalloc_area_node(). In other words, it is added to vmlist before it is fully initialized. At the same time, when the /proc/vmallocinfo is read, it accesses the pages field of vm_struct according to the nr_pages field at show_numa_info(). Thus, a null pointer access happens. Patch: This patch adds newly allocated vm_struct to the vmlist *after* it is fully initialized. So, it can avoid accessing the pages field with unallocated page when show_numa_info() is called. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlierIan Campbell
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream This adds a mechanism to resume selected IRQs during syscore_resume instead of dpm_resume_noirq. Under Xen we need to resume IRQs associated with IPIs early enough that the resched IPI is unmasked and we can therefore schedule ourselves out of the stop_machine where the suspend/resume takes place. This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME". Back ported to 2.6.32 (which lacks syscore support) by calling the relavant resume function directly from sysdev_resume). v2: Fixed non-x86 build errors. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-08Revert "genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlier"Greg Kroah-Hartman
This reverts commit 0f12a6ad9fa3a03f2bcee36c9cb704821e244c40. It causes too many build errors and needs to be done properly. Reported-by: Jiri Slaby <jslaby@suse.cz> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodesTheodore Ts'o
commit 1cd9f0976aa4606db8d6e3dc3edd0aca8019372a upstream. This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07NLM: Don't hang forever on NLM unlock requestsTrond Myklebust
commit 0b760113a3a155269a3fba93a409c640031dd68f upstream. If the NLM daemon is killed on the NFS server, we can currently end up hanging forever on an 'unlock' request, instead of aborting. Basically, if the rpcbind request fails, or the server keeps returning garbage, we really want to quit instead of retrying. Tested-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07scm: lower SCM_MAX_FDEric Dumazet
commit bba14de98753cb6599a2dae0e520714b2153522d upstream. Lower SCM_MAX_FD from 255 to 253 so that allocations for scm_fp_list are halved. (commit f8d570a4 added two pointers in this structure) scm_fp_dup() should not copy whole structure (and trigger kmemcheck warnings), but only the used part. While we are at it, only allocate needed size. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlierIan Campbell
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream This adds a mechanism to resume selected IRQs during syscore_resume instead of dpm_resume_noirq. Under Xen we need to resume IRQs associated with IPIs early enough that the resched IPI is unmasked and we can therefore schedule ourselves out of the stop_machine where the suspend/resume takes place. This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME". Back ported to 2.6.32 (which lacks syscore support) by calling the relavant resume function directly from sysdev_resume). Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07cfq: calculate the seek_mean per cfq_queue not per cfq_io_contextJeff Moyer
commit b2c18e1e08a5a9663094d57bb4be2f02226ee61c upstream. async cfq_queue's are already shared between processes within the same priority, and forthcoming patches will change the mapping of cic to sync cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process based on the cfq_queue instead of the cfq_io_context. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07time: Change jiffies_to_clock_t() argument type to unsigned longhank
commit cbbc719fccdb8cbd87350a05c0d33167c9b79365 upstream. The parameter's origin type is long. On an i386 architecture, it can easily be larger than 0x80000000, causing this function to convert it to a sign-extended u64 type. Change the type to unsigned long so we get the correct result. Signed-off-by: hank <pyu@redhat.com> Cc: John Stultz <john.stultz@linaro.org> [ build fix ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15net: Compute protocol sequence numbers and fragment IDs using MD5.David S. Miller
Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15crypto: Move md5_transform to lib/md5.cDavid S. Miller
We are going to use this for TCP/IP sequence number and fragment ID generation. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-08gro: Only reset frag0 when skb can be pulledHerbert Xu
commit 17dd759c67f21e34f2156abcf415e1f60605a188 upstream. Currently skb_gro_header_slow unconditionally resets frag0 and frag0_len. However, when we can't pull on the skb this leaves the GRO fields in an inconsistent state. This patch fixes this by only resetting those fields after the pskb_may_pull test. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13mm: prevent concurrent unmap_mapping_range() on the same inodeMiklos Szeredi
commit 2aa15890f3c191326678f1bd68af61ec6b8753ec upstream. Michael Leun reported that running parallel opens on a fuse filesystem can trigger a "kernel BUG at mm/truncate.c:475" Gurudas Pai reported the same bug on NFS. The reason is, unmap_mapping_range() is not prepared for more than one concurrent invocation per inode. For example: thread1: going through a big range, stops in the middle of a vma and stores the restart address in vm_truncate_count. thread2: comes in with a small (e.g. single page) unmap request on the same vma, somewhere before restart_address, finds that the vma was already unmapped up to the restart address and happily returns without doing anything. Another scenario would be two big unmap requests, both having to restart the unmapping and each one setting vm_truncate_count to its own value. This could go on forever without any of them being able to finish. Truncate and hole punching already serialize with i_mutex. Other callers of unmap_mapping_range() do not, and it's difficult to get i_mutex protection for all callers. In particular ->d_revalidate(), which calls invalidate_inode_pages2_range() in fuse, may be called with or without i_mutex. This patch adds a new mutex to 'struct address_space' to prevent running multiple concurrent unmap_mapping_range() on the same mapping. [ We'll hopefully get rid of all this with the upcoming mm preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex lockbreak" patch in particular. But that is for 2.6.39 ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Michael Leun <lkml20101129@newton.leun.net> Reported-by: Gurudas Pai <gurudas.pai@oracle.com> Tested-by: Gurudas Pai <gurudas.pai@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13af_packet: prevent information leakEric Dumazet
[ Upstream commit 13fcb7bd322164c67926ffe272846d4860196dc6 ] In 2.6.27, commit 393e52e33c6c2 (packet: deliver VLAN TCI to userspace) added a small information leak. Add padding field and make sure its zeroed before copy to user. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13bug.h: Add WARN_RATELIMITJoe Perches
[ Upstream commit b3eec79b0776e5340a3db75b34953977c7e5086e ] Add a generic mechanism to ratelimit WARN(foo, fmt, ...) messages using a hidden per call site static struct ratelimit_state. Also add an __WARN_RATELIMIT variant to be able to use a specific struct ratelimit_state. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13netlink: Make nlmsg_find_attr take a const nlmsghdr*.Nelson Elhage
commit 6b8c92ba07287578718335ce409de8e8d7217e40 upstream. This will let us use it on a nlmsghdr stored inside a netlink_callback. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13clocksource: Make watchdog robust vs. interruptionThomas Gleixner
commit b5199515c25cca622495eb9c6a8a1d275e775088 upstream. The clocksource watchdog code is interruptible and it has been observed that this can trigger false positives which disable the TSC. The reason is that an interrupt storm or a long running interrupt handler between the read of the watchdog source and the read of the TSC brings the two far enough apart that the delta is larger than the unstable treshold. Move both reads into a short interrupt disabled region to avoid that. Reported-and-tested-by: Vernon Mauery <vernux@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23genirq: Add IRQF_FORCE_RESUMEThomas Gleixner
commit dc5f219e88294b93009eef946251251ffffb6d60 upstream. Xen needs to reenable interrupts which are marked IRQF_NO_SUSPEND in the resume path. Add a flag to force the reenabling in the resume code. Tested-and-acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23pata_cm64x: fix boot crash on pariscJames Bottomley
commit 9281b16caac1276817b77033c5b8a1f5ca30102c upstream. The old IDE cmd64x checks the status of the CNTRL register to see if the ports are enabled before probing them. pata_cmd64x doesn't do this, which causes a HPMC on parisc when it tries to poke at the secondary port because apparently the BAR isn't wired up (and a non-responding piece of memory causes a HPMC). Fix this by porting the CNTRL register port detection logic from IDE cmd64x. In addition, following converns from Alan Cox, add a check to see if a mobility electronics bridge is the immediate parent and forgo the check if it is (prevents problems on hotplug controllers). Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23seqlock: Don't smp_rmb in seqlock reader spin loopMilton Miller
commit 5db1256a5131d3b133946fa02ac9770a784e6eb2 upstream. Move the smp_rmb after cpu_relax loop in read_seqlock and add ACCESS_ONCE to make sure the test and return are consistent. A multi-threaded core in the lab didn't like the update from 2.6.35 to 2.6.36, to the point it would hang during boot when multiple threads were active. Bisection showed af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents: Remove the per cpu tick skew) as the culprit and it is supported with stack traces showing xtime_lock waits including tick_do_update_jiffies64 and/or update_vsyscall. Experimentation showed the combination of cpu_relax and smp_rmb was significantly slowing the progress of other threads sharing the core, and this patch is effective in avoiding the hang. A theory is the rmb is affecting the whole core while the cpu_relax is causing a resource rebalance flush, together they cause an interfernce cadance that is unbroken when the seqlock reader has interrupts disabled. At first I was confused why the refactor in 3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise seqlock) didn't affect this patch application, but after some study that affected seqcount not seqlock. The new seqcount was not factored back into the seqlock. I defer that the future. While the removal of the timer interrupt offset created contention for the xtime lock while a cpu does the additonal work to update the system clock, the seqlock implementation with the tight rmb spin loop goes back much further, and is just waiting for the right trigger. Signed-off-by: Milton Miller <miltonm@bga.com> Cc: <linuxppc-dev@lists.ozlabs.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Anton Blanchard <anton@samba.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Link: http://lkml.kernel.org/r/%3Cseqlock-rmb%40mdm.bga.com%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-09af_unix: limit recursion levelEric Dumazet
commit 25888e30319f8896fc656fc68643e6a078263060 upstream. Its easy to eat all kernel memory and trigger NMI watchdog, using an exploit program that queues unix sockets on top of others. lkml ref : http://lkml.org/lkml/2010/11/25/8 This mechanism is used in applications, one choice we have is to have a recursion limit. Other limits might be needed as well (if we queue other types of files), since the passfd mechanism is currently limited by socket receive queue sizes only. Add a recursion_level to unix socket, allowing up to 4 levels. Each time we send an unix socket through sendfd mechanism, we copy its recursion level (plus one) to receiver. This recursion level is cleared when socket receive queue is emptied. Reported-by: Марк Коренберг <socketpair@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Adjust for 2.6.32] Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-09mmc: fix all hangs related to mmc/sd card insert/removal during suspend/resumeMaxim Levitsky
commit 4c2ef25fe0b847d2ae818f74758ddb0be1c27d8e upstream. If you don't use CONFIG_MMC_UNSAFE_RESUME, as soon as you attempt to suspend, the card will be removed, therefore this patch doesn't change the behavior of this option. However the removal will be done by pm notifier, which runs while userspace is still not frozen and thus can freely use del_gendisk, without the risk of deadlock which would happen otherwise. Card detect workqueue is now disabled while userspace is frozen, Therefore if you do use CONFIG_MMC_UNSAFE_RESUME, and remove the card during suspend, the removal will be detected as soon as userspace is unfrozen, again at the moment it is safe to call del_gendisk. Tested with and without CONFIG_MMC_UNSAFE_RESUME with suspend and hibernate. [akpm@linux-foundation.org: clean up function prototype] [akpm@linux-foundation.org: fix CONFIG_PM-n linkage, small cleanups] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com> Cc: David Brownell <david-b@pacbell.net> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: <linux-mmc@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Adjust for 2.6.32] Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-09mac80211: Add define for TX headroom reserved by mac80211 itself.Gertjan van Wingerde
commit d24deb2580823ab0b8425790c6f5d18e2ff749d8 upstream. Add a definition of the amount of TX headroom reserved by mac80211 itself for its own purposes. Also add BUILD_BUG_ON to validate the value. This define can then be used by drivers to request additional TX headroom in the most efficient manner. Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> [bwh: Adjust context for 2.6.32] Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-22next_pidmap: fix overflow conditionLinus Torvalds
commit c78193e9c7bcbf25b8237ad0dec82f805c4ea69b upstream. next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Analyzed-by: Robert Święcki <robert@swiecki.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-14exec: copy-and-paste the fixes into compat_do_execve() pathsOleg Nesterov
commit 114279be2120a916e8a04feeb2ac976a10016f2f upstream. Note: this patch targets 2.6.37 and tries to be as simple as possible. That is why it adds more copy-and-paste horror into fs/compat.c and uglifies fs/exec.c, this will be cleanuped later. compat_copy_strings() plays with bprm->vma/mm directly and thus has two problems: it lacks the RLIMIT_STACK check and argv/envp memory is not visible to oom killer. Export acct_arg_size() and get_arg_page(), change compat_copy_strings() to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0) as do_execve() does. Add the fatal_signal_pending/cond_resched checks into compat_count() and compat_copy_strings(), this matches the code in fs/exec.c and certainly makes sense. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Moritz Muehlenhoff <jmm@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-14exec: make argv/envp memory visible to oom-killerOleg Nesterov
commit 3c77f845722158206a7209c45ccddc264d19319c upstream. Brad Spengler published a local memory-allocation DoS that evades the OOM-killer (though not the virtual memory RLIMIT): http://www.grsecurity.net/~spender/64bit_dos.c execve()->copy_strings() can allocate a lot of memory, but this is not visible to oom-killer, nobody can see the nascent bprm->mm and take it into account. With this patch get_arg_page() increments current's MM_ANONPAGES counter every time we allocate the new page for argv/envp. When do_execve() succeds or fails, we change this counter back. Technically this is not 100% correct, we can't know if the new page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but I don't think this really matters and everything becomes correct once exec changes ->mm or fails. Reported-by: Brad Spengler <spender@grsecurity.net> Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Moritz Muehlenhoff <jmm@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-14ASoC: Explicitly say registerless widgets have no registerMark Brown
commit 0ca03cd7d0fa3bfbd56958136a10f19733c4ce12 upstream. This stops code that handles widgets generically from attempting to access registers for these widgets. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Liam Girdwood <lrg@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-14ses: Avoid kernel panic when lun 0 is not mappedKrishnasamy, Somasundaram
commit d1e12de804f9d8ad114786ca7c2ce593cba79891 upstream. During device discovery, scsi mid layer sends INQUIRY command to LUN 0. If the LUN 0 is not mapped to host, it creates a temporary scsi_device with LUN id 0 and sends REPORT_LUNS command to it. After the REPORT_LUNS succeeds, it walks through the LUN table and adds each LUN found to sysfs. At the end of REPORT_LUNS lun table scan, it will delete the temporary scsi_device of LUN 0. When scsi devices are added to sysfs, it calls add_dev function of all the registered class interfaces. If ses driver has been registered, ses_intf_add() of ses module will be called. This function calls scsi_device_enclosure() to check the inquiry data for EncServ bit. Since inquiry was not allocated for temporary LUN 0 scsi_device, it will cause NULL pointer exception. To fix the problem, sdev->inquiry is checked for NULL before reading it. Signed-off-by: Somasundaram Krishnasamy <Somasundaram.Krishnasamy@lsi.com> Signed-off-by: Babu Moger <babu.moger@lsi.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23ftrace: Fix memory leak with function graph and cpu hotplugSteven Rostedt
commit 868baf07b1a259f5f3803c1dc2777b6c358f83cf upstream. When the fuction graph tracer starts, it needs to make a special stack for each task to save the real return values of the tasks. All running tasks have this stack created, as well as any new tasks. On CPU hot plug, the new idle task will allocate a stack as well when init_idle() is called. The problem is that cpu hotplug does not create a new idle_task. Instead it uses the idle task that existed when the cpu went down. ftrace_graph_init_task() will add a new ret_stack to the task that is given to it. Because a clone will make the task have a stack of its parent it does not check if the task's ret_stack is already NULL or not. When the CPU hotplug code starts a CPU up again, it will allocate a new stack even though one already existed for it. The solution is to treat the idle_task specially. In fact, the function_graph code already does, just not at init_idle(). Instead of using the ftrace_graph_init_task() for the idle task, which that function expects the task to be a clone, have a separate ftrace_graph_init_idle_task(). Also, we will create a per_cpu ret_stack that is used by the idle task. When we call ftrace_graph_init_idle_task() it will check if the idle task's ret_stack is NULL, if it is, then it will assign it the per_cpu ret_stack. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-14net: don't allow CAP_NET_ADMIN to load non-netdev kernel modulesVasiliy Kulikov
commit 8909c9ad8ff03611c9c96c9a92656213e4bb495b upstream. Since a8f80e8ff94ecba629542d9b4b5f5a8ee3eb565c any process with CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't allow anybody load any module not related to networking. This patch restricts an ability of autoloading modules to netdev modules with explicit aliases. This fixes CVE-2011-1019. Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior of loading netdev modules by name (without any prefix) for processes with CAP_SYS_MODULE to maintain the compatibility with network scripts that use autoloading netdev modules by aliases like "eth0", "wlan0". Currently there are only three users of the feature in the upstream kernel: ipip, ip_gre and sit. root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) -- root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffff800001000 CapEff: fffffff800001000 CapBnd: fffffff800001000 root@albatros:~# modprobe xfs FATAL: Error inserting xfs (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted root@albatros:~# lsmod | grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit sit: error fetching interface information: Device not found root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit0 sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 root@albatros:~# lsmod | grep sit sit 10457 0 tunnel4 2957 1 sit For CAP_SYS_MODULE module loading is still relaxed: root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs xfs 745319 0 Reference: https://lkml.org/lkml/2011/2/24/203 Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-14RxRPC: Fix v1 keysAnton Blanchard
commit f009918a1c1bbf8607b8aab3959876913a30193a upstream. commit 339412841d7 (RxRPC: Allow key payloads to be passed in XDR form) broke klog for me. I notice the v1 key struct had a kif_version field added: -struct rxkad_key { - u16 security_index; /* RxRPC header security index */ - u16 ticket_len; /* length of ticket[] */ - u32 expiry; /* time at which expires */ - u32 kvno; /* key version number */ - u8 session_key[8]; /* DES session key */ - u8 ticket[0]; /* the encrypted ticket */ -}; +struct rxrpc_key_data_v1 { + u32 kif_version; /* 1 */ + u16 security_index; + u16 ticket_length; + u32 expiry; /* time_t */ + u32 kvno; + u8 session_key[8]; + u8 ticket[0]; +}; However the code in rxrpc_instantiate strips it away: data += sizeof(kver); datalen -= sizeof(kver); Removing kif_version fixes my problem. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07sctp: Fix oops when sending queued ASCONF chunksVlad Yasevich
commit c0786693404cffd80ca3cb6e75ee7b35186b2825 upstream. When we finish processing ASCONF_ACK chunk, we try to send the next queued ASCONF. This action runs the sctp state machine recursively and it's not prepared to do so. kernel BUG at kernel/timer.c:790! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/ipv6/initstate Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0 EIP is at add_timer+0xd/0x1b EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4 ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000) Stack: c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004 <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14 00000004 <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14 000000d0 Call Trace: [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp] [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp] [<d1863386>] ? sctp_pname+0x0/0x1d [sctp] [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp] [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp] [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp] [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp] [<d1863334>] ? sctp_cname+0x0/0x52 [sctp] [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp] [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp] [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp] Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: maximilian attems <max@stro.at> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07drm: fix unsigned vs signed comparison issue in modeset ctl ioctl.Dave Airlie
commit 1922756124ddd53846877416d92ba4a802bc658f upstream. This fixes CVE-2011-1013. Reported-by: Matthiew Herrb (OpenBSD X.org team) Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-02CRED: Fix get_task_cred() and task_state() to not resurrect dead credentialsDavid Howells
commit de09a9771a5346029f4d11e4ac886be7f9bfdd75 upstream. It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Use group weight, idle cpu metrics to fix imbalances during idleSuresh Siddha
Commit: aae6d3ddd8b90f5b2c8d79a2b914d1706d124193 upstream Currently we consider a sched domain to be well balanced when the imbalance is less than the domain's imablance_pct. As the number of cores and threads are increasing, current values of imbalance_pct (for example 25% for a NUMA domain) are not enough to detect imbalances like: a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads), 24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another socket. Leading to an idle HT cpu. b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and 16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one socket and 7 on another socket. Leaving one core in a socket idle whereas in another socket we have a core having both its HT siblings busy. While this issue can be fixed by decreasing the domain's imbalance_pct (by making it a function of number of logical cpus in the domain), it can potentially cause more task migrations across sched groups in an overloaded case. Fix this by using imbalance_pct only during newly_idle and busy load balancing. And during idle load balancing, check if there is an imbalance in number of idle cpu's across the busiest and this sched_group or if the busiest group has more tasks than its weight that the idle cpu in this_group can pull. Reported-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched, cgroup: Fixup broken cgroup movementPeter Zijlstra
Commit: b2b5ce022acf5e9f52f7b78c5579994fdde191d4 upstream Dima noticed that we fail to correct the ->vruntime of sleeping tasks when we move them between cgroups. Reported-by: Dima Zavin <dima@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Mike Galbraith <efault@gmx.de> LKML-Reference: <1287150604.29097.1513.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi
Commit: b52bfee445d315549d41eacf2fa7c156e7d153d5 upstream s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Add a PF flag for ksoftirqd identificationVenkatesh Pallipadi
Commit: 6cdd5199daf0cb7b0fcc8dca941af08492612887 upstream To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Remove unused PF_ALIGNWARN flagDave Young
Commit: 637bbdc5b83615ef9f45f50399d1c7f27473c713 upstream PF_ALIGNWARN is not implemented and it is for 486 as the comment. It is not likely someone will implement this flag feature. So here remove this flag and leave the valuable 0x00000001 for future use. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20100913121903.GB22238@darkstar> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Consolidate account_system_vtime extern declarationVenkatesh Pallipadi
Commit: e1e10a265d28273ab8c70be19d43dcbdeead6c5a upstream Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>