summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2017-07-27vfio: New external user group/file matchAlex Williamson
commit 5d6dee80a1e94cc284d03e06d930e60e8d3ecf7d upstream. At the point where the kvm-vfio pseudo device wants to release its vfio group reference, we can't always acquire a new reference to make that happen. The group can be in a state where we wouldn't allow a new reference to be added. This new helper function allows a caller to match a file to a group to facilitate this. Given a file and group, report if they match. Thus the caller needs to already have a group reference to match to the file. This allows the deletion of a group without acquiring a new reference. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27iscsi-target: Add login_keys_workaround attribute for non RFC initiatorsNicholas Bellinger
commit 138d351eefb727ab9e41a3dc5f112ceb4f6e59f2 upstream. This patch re-introduces part of a long standing login workaround that was recently dropped by: commit 1c99de981f30b3e7868b8d20ce5479fa1c0fea46 Author: Nicholas Bellinger <nab@linux-iscsi.org> Date: Sun Apr 2 13:36:44 2017 -0700 iscsi-target: Drop work-around for legacy GlobalSAN initiator Namely, the workaround for FirstBurstLength ended up being required by Mellanox Flexboot PXE boot ROMs as reported by Robert. So this patch re-adds the work-around for FirstBurstLength within iscsi_check_proposer_for_optional_reply(), and makes the key optional to respond when the initiator does not propose, nor respond to it. Also as requested by Arun, this patch introduces a new TPG attribute named 'login_keys_workaround' that controls the use of both the FirstBurstLength workaround, as well as the two other existing workarounds for gPXE iSCSI boot client. By default, the workaround is enabled with login_keys_workaround=1, since Mellanox FlexBoot requires it, and Arun has verified the Qlogic MSFT initiator already proposes FirstBurstLength, so it's uneffected by this re-adding this part of the original work-around. Reported-by: Robert LeBlanc <robert@leblancnet.us> Cc: Robert LeBlanc <robert@leblancnet.us> Reviewed-by: Arun Easi <arun.easi@cavium.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27scsi: Add STARGET_CREATED_REMOVE state to scsi_target_stateEwan D. Milne
commit f9279c968c257ee39b0d7bd2571a4d231a67bcc1 upstream. The addition of the STARGET_REMOVE state had the side effect of introducing a race condition that can cause a crash. scsi_target_reap_ref_release() checks the starget->state to see if it still in STARGET_CREATED, and if so, skips calling transport_remove_device() and device_del(), because the starget->state is only set to STARGET_RUNNING after scsi_target_add() has called device_add() and transport_add_device(). However, if an rport loss occurs while a target is being scanned, it can happen that scsi_remove_target() will be called while the starget is still in the STARGET_CREATED state. In this case, the starget->state will be set to STARGET_REMOVE, and as a result, scsi_target_reap_ref_release() will take the wrong path. The end result is a panic: [ 1255.356653] Oops: 0000 [#1] SMP [ 1255.360154] Modules linked in: x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel ghash_clmulni_i [ 1255.393234] CPU: 5 PID: 149 Comm: kworker/u96:4 Tainted: G W 4.11.0+ #8 [ 1255.401879] Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013 [ 1255.410327] Workqueue: scsi_wq_6 fc_scsi_scan_rport [scsi_transport_fc] [ 1255.417720] task: ffff88060ca8c8c0 task.stack: ffffc900048a8000 [ 1255.424331] RIP: 0010:kernfs_find_ns+0x13/0xc0 [ 1255.429287] RSP: 0018:ffffc900048abbf0 EFLAGS: 00010246 [ 1255.435123] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1255.443083] RDX: 0000000000000000 RSI: ffffffff8188d659 RDI: 0000000000000000 [ 1255.451043] RBP: ffffc900048abc10 R08: 0000000000000000 R09: 0000012433fe0025 [ 1255.459005] R10: 0000000025e5a4b5 R11: 0000000025e5a4b5 R12: ffffffff8188d659 [ 1255.466972] R13: 0000000000000000 R14: ffff8805f55e5088 R15: 0000000000000000 [ 1255.474931] FS: 0000000000000000(0000) GS:ffff880616b40000(0000) knlGS:0000000000000000 [ 1255.483959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1255.490370] CR2: 0000000000000068 CR3: 0000000001c09000 CR4: 00000000000406e0 [ 1255.498332] Call Trace: [ 1255.501058] kernfs_find_and_get_ns+0x31/0x60 [ 1255.505916] sysfs_unmerge_group+0x1d/0x60 [ 1255.510498] dpm_sysfs_remove+0x22/0x60 [ 1255.514783] device_del+0xf4/0x2e0 [ 1255.518577] ? device_remove_file+0x19/0x20 [ 1255.523241] attribute_container_class_device_del+0x1a/0x20 [ 1255.529457] transport_remove_classdev+0x4e/0x60 [ 1255.534607] ? transport_add_class_device+0x40/0x40 [ 1255.540046] attribute_container_device_trigger+0xb0/0xc0 [ 1255.546069] transport_remove_device+0x15/0x20 [ 1255.551025] scsi_target_reap_ref_release+0x25/0x40 [ 1255.556467] scsi_target_reap+0x2e/0x40 [ 1255.560744] __scsi_scan_target+0xaa/0x5b0 [ 1255.565312] scsi_scan_target+0xec/0x100 [ 1255.569689] fc_scsi_scan_rport+0xb1/0xc0 [scsi_transport_fc] [ 1255.576099] process_one_work+0x14b/0x390 [ 1255.580569] worker_thread+0x4b/0x390 [ 1255.584651] kthread+0x109/0x140 [ 1255.588251] ? rescuer_thread+0x330/0x330 [ 1255.592730] ? kthread_park+0x60/0x60 [ 1255.596815] ret_from_fork+0x29/0x40 [ 1255.600801] Code: 24 08 48 83 42 40 01 5b 41 5c 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 [ 1255.621876] RIP: kernfs_find_ns+0x13/0xc0 RSP: ffffc900048abbf0 [ 1255.628479] CR2: 0000000000000068 [ 1255.632756] ---[ end trace 34a69ba0477d036f ]--- Fix this by adding another scsi_target state STARGET_CREATED_REMOVE to distinguish this case. Fixes: f05795d3d771 ("scsi: Add intermediate STARGET_REMOVE state to scsi_target_state") Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21sched/fair, cpumask: Export for_each_cpu_wrap()Peter Zijlstra
commit c6508a39640b9a27fc2bc10cb708152672c82045 upstream. commit c743f0a5c50f2fcbc628526279cfa24f3dabe182 upstream. More users for for_each_cpu_wrap() have appeared. Promote the construct to generic cpumask interface. The implementation is slightly modified to reduce arguments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Lauro Ramos Venancio <lvenanci@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21mm/list_lru.c: fix list_lru_count_node() to be race freeSahitya Tummala
commit 2c80cd57c74339889a8752b20862a16c28929c3a upstream. list_lru_count_node() iterates over all memcgs to get the total number of entries on the node but it can race with memcg_drain_all_list_lrus(), which migrates the entries from a dead cgroup to another. This can return incorrect number of entries from list_lru_count_node(). Fix this by keeping track of entries per node and simply return it in list_lru_count_node(). Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21net: ipv6: Compare lwstate in detecting duplicate nexthopsDavid Ahern
commit f06b7549b79e29a672336d4e134524373fb7a232 upstream. Lennert reported a failure to add different mpls encaps in a multipath route: $ ip -6 route add 1234::/16 \ nexthop encap mpls 10 via fe80::1 dev ens3 \ nexthop encap mpls 20 via fe80::1 dev ens3 RTNETLINK answers: File exists The problem is that the duplicate nexthop detection does not compare lwtunnel configuration. Add it. Fixes: 19e42e451506 ("ipv6: support for fib route lwtunnel encap attributes") Signed-off-by: David Ahern <dsahern@gmail.com> Reported-by: João Taveira Araújo <joao.taveira@gmail.com> Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Tested-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21vxlan: fix hlist corruptionJiri Benc
[ Upstream commit 69e766612c4bcb79e19cebed9eed61d4222c1d47 ] It's not a good idea to add the same hlist_node to two different hash lists. This leads to various hard to debug memory corruptions. Fixes: b1be00a6c39f ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21net/mlx5: Cancel delayed recovery work when unloading the driverMohamad Haj Yahia
commit 2a0165a034ac024b60cca49c61e46f4afa2e4d98 upstream. Draining the health workqueue will ignore future health works including the one that report hardware failure and thus we can't enter error state Instead cancel the recovery flow and make sure only recovery flow won't be scheduled. Fixes: 5e44fca50470 ('net/mlx5: Only cancel recovery work when cleaning up device') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15Add "shutdown" to "struct class".Josh Zimmerman
commit f77af15165847406b15d8f70c382c4cb15846b2a upstream. The TPM class has some common shutdown code that must be executed for all drivers. This adds some needed functionality for that. Signed-off-by: Josh Zimmerman <joshz@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 74d6b3ceaa17 ("tpm: fix suspend/resume paths for TPM 2.0") Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-12usb: Fix typo in the definition of Endpoint[out]RequestBenjamin Herrenschmidt
commit 7cf916bd639bd26db7214f2205bccdb4b9306256 upstream. The current definition is wrong. This breaks my upcoming Aspeed virtual hub driver. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-12fs: add a VALID_OPEN_FLAGSChristoph Hellwig
commit 80f18379a7c350c011d30332658aa15fe49a8fa5 upstream. Add a central define for all valid open flags, and use it in the uniqueness check. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICYSabrina Dubroca
commit 9b3eb54106cf6acd03f07cf0ab01c13676a226c2 upstream. When CONFIG_XFRM_SUB_POLICY=y, xfrm_dst stores a copy of the flowi for that dst. Unfortunately, the code that allocates and fills this copy doesn't care about what type of flowi (flowi, flowi4, flowi6) gets passed. In multiple code paths (from raw_sendmsg, from TCP when replying to a FIN, in vxlan, geneve, and gre), the flowi that gets passed to xfrm is actually an on-stack flowi4, so we end up reading stuff from the stack past the end of the flowi4 struct. Since xfrm_dst->origin isn't used anywhere following commit ca116922afa8 ("xfrm: Eliminate "fl" and "pol" args to xfrm_bundle_ok()."), just get rid of it. xfrm_dst->partner isn't used either, so get rid of that too. Fixes: 9d6ec938019c ("ipv4: Use flowi4 in public route lookup interfaces.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05coredump: Ensure proper size of sparse core filesDave Kleikamp
[ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ] If the last section of a core file ends with an unmapped or zero page, the size of the file does not correspond with the last dump_skip() call. gdb complains that the file is truncated and can be confusing to users. After all of the vma sections are written, make sure that the file size is no smaller than the current file position. This problem can be demonstrated with gdb's bigcore testcase on the sparc architecture. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05bpf: don't trigger OOM killer under pressure with map allocDaniel Borkmann
[ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ] This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(), that are to be used for map allocations. Using kmalloc() for very large allocations can cause excessive work within the page allocator, so i) fall back earlier to vmalloc() when the attempt is considered costly anyway, and even more importantly ii) don't trigger OOM killer with any of the allocators. Since this is based on a user space request, for example, when creating maps with element pre-allocation, we really want such requests to fail instead of killing other user space processes. Also, don't spam the kernel log with warnings should any of the allocations fail under pressure. Given that, we can make backend selection in bpf_map_area_alloc() generic, and convert all maps over to use this API for spots with potentially large allocation requests. Note, replacing the one kmalloc_array() is fine as overflow checks happen earlier in htab_map_alloc(), since it must also protect the multiplication for vmalloc() should kmalloc_array() fail. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05dt: bindings: net: use boolean dt properties for eee broken modesjbrunet
[ Upstream commit 308d3165d8b2b98d3dc3d97d6662062735daea67 ] The patches regarding eee-broken-modes was merged before all people involved could find an agreement on the best way to move forward. While we agreed on having a DT property to mark particular modes as broken, the value used for eee-broken-modes mapped the phy register in very direct way. Because of this, the concern is that it could be used to implement configuration policies instead of describing a broken HW. In the end, having a boolean property for each mode seems to be preferred over one bit field value mapping the register (too) directly. Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05dt-bindings: net: add EEE capability constantsjbrunet
[ Upstream commit 1fc31357ad194fb98691f3d122bcd47e59239e83 ] Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Tested-by: Yegor Yefremov <yegorslists@googlemail.com> Tested-by: Andreas Färber <afaerber@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05net: phy: add an option to disable EEE advertisementjbrunet
[ Upstream commit d853d145ea3e63387a2ac759aa41d5e43876e561 ] This patch adds an option to disable EEE advertisement in the generic PHY by providing a mask of prohibited modes corresponding to the value found in the MDIO_AN_EEE_ADV register. On some platforms, PHY Low power idle seems to be causing issues, even breaking the link some cases. The patch provides a convenient way for these platforms to disable EEE advertisement and work around the issue. Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Tested-by: Yegor Yefremov <yegorslists@googlemail.com> Tested-by: Andreas Färber <afaerber@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05net: ethtool: add support for 2500BaseT and 5000BaseT link modesPavel Belous
[ Upstream commit 94842b4fc4d6b1691cfc86c6f5251f299d27f4ba ] This patch introduce support for 2500BaseT and 5000BaseT link modes. These modes are included in the new IEEE 802.3bz standard. Signed-off-by: Pavel Belous <pavel.s.belous@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz
commit 3d88d56c5873f6eebe23e05c3da701960146b801 upstream. Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Daniel Mentz <danielmentz@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29time: Fix clock->read(clock) race around clocksource changesJohn Stultz
commit ceea5e3771ed2378668455fa21861bead7504df5 upstream. In tests, which excercise switching of clocksources, a NULL pointer dereference can be observed on AMR64 platforms in the clocksource read() function: u64 clocksource_mmio_readl_down(struct clocksource *c) { return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask; } This is called from the core timekeeping code via: cycle_now = tkr->read(tkr->clock); tkr->read is the cached tkr->clock->read() function pointer. When the clocksource is changed then tkr->clock and tkr->read are updated sequentially. The code above results in a sequential load operation of tkr->read and tkr->clock as well. If the store to tkr->clock hits between the loads of tkr->read and tkr->clock, then the old read() function is called with the new clock pointer. As a consequence the read() function dereferences a different data structure and the resulting 'reg' pointer can point anywhere including NULL. This problem was introduced when the timekeeping code was switched over to use struct tk_read_base. Before that, it was theoretically possible as well when the compiler decided to reload clock in the code sequence: now = tk->clock->read(tk->clock); Add a helper function which avoids the issue by reading tk_read_base->clock once into a local variable clk and then issue the read function via clk->read(clk). This guarantees that the read() function always gets the proper clocksource pointer handed in. Since there is now no use for the tkr.read pointer, this patch also removes it, and to address stopping the fast timekeeper during suspend/resume, it introduces a dummy clocksource to use rather then just a dummy read function. Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24mm: larger stack guard gap, between vmasHugh Dickins
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24USB: hub: fix SS max number of portsJohan Hovold
commit 93491ced3c87c94b12220dbac0527e1356702179 upstream. Add define for the maximum number of ports on a SuperSpeed hub as per USB 3.1 spec Table 10-5, and use it when verifying the retrieved hub descriptor. This specifically avoids benign attempts to update the DeviceRemovable mask for non-existing ports (should we get that far). Fixes: dbe79bbe9dcb ("USB 3.0 Hub Changes") Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17netfilter: nft_log: restrict the log prefix length to 127Liping Zhang
[ Upstream commit 5ce6b04ce96896e8a79e6f60740ced911eaac7a4 ] First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/watchdog: prevent false hardlockup on overloaded systemDon Zickus
[ Upstream commit b94f51183b0617e7b9b4fb4137d4cf1cab7547c2 ] On an overloaded system, it is possible that a change in the watchdog threshold can be delayed long enough to trigger a false positive. This can easily be achieved by having a cpu spinning indefinitely on a task, while another cpu updates watchdog threshold. What happens is while trying to park the watchdog threads, the hrtimers on the other cpus trigger and reprogram themselves with the new slower watchdog threshold. Meanwhile, the nmi watchdog is still programmed with the old faster threshold. Because the one cpu is blocked, it prevents the thread parking on the other cpus from completing, which is needed to shutdown the nmi watchdog and reprogram it correctly. As a result, a false positive from the nmi watchdog is reported. Fix this by setting a park_in_progress flag to block all lockups until the parking is complete. Fix provided by Ulrich Obergfell. [akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/] Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com Signed-off-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/watchdog.c: move shared definitions to nmi.hBabu Moger
[ Upstream commit 249e52e35580fcfe5dad53a7dcd7c1252788749c ] Patch series "Clean up watchdog handlers", v2. This is an attempt to cleanup watchdog handlers. Right now, kernel/watchdog.c implements both softlockup and hardlockup detectors. Softlockup code is generic. Hardlockup code is arch specific. Some architectures don't use hardlockup detectors. They use their own watchdog detectors. To make both these combination work, we have numerous #ifdefs in kernel/watchdog.c. We are trying here to make these handlers independent of each other. Also provide an interface for architectures to implement their own handlers. watchdog_nmi_enable and watchdog_nmi_disable will be defined as weak such that architectures can override its definitions. Thanks to Don Zickus for his suggestions. Here are our previous discussions http://www.spinics.net/lists/sparclinux/msg16543.html http://www.spinics.net/lists/sparclinux/msg16441.html This patch (of 3): Move shared macros and definitions to nmi.h so that watchdog.c, new file watchdog_hld.c or any other architecture specific handler can use those definitions. Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17net: phy: micrel: add support for KSZ8795Sean Nyekjaer
[ Upstream commit 9d162ed69f51cbd9ee5a0c7e82aba7acc96362ff ] This is adds support for the PHYs in the KSZ8795 5port managed switch. It will allow to detect the link between the switch and the soc and uses the same read_status functions as the KSZ8873MLL switch. Signed-off-by: Sean Nyekjaer <sean.nyekjaer@prevas.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17drm: Don't race connector registrationDaniel Vetter
[ Upstream commit e6e7b48b295afa5a5ab440de0a94d9ad8b3ce2d0 ] I was under the misconception that the sysfs dev stuff can be fully set up, and then registered all in one step with device_add. That's true for properties and property groups, but not for parents and child devices. Those must be fully registered before you can register a child. Add a bit of tracking to make sure that asynchronous mst connector hotplugging gets this right. For consistency we rely upon the implicit barriers of the connector->mutex, which is taken anyway, to ensure that at least either the connector or device registration call will work out. Mildly tested since I can't reliably reproduce this on my mst box here. Reported-by: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1484237756-2720-1-git-send-email-daniel.vetter@ffwll.ch Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17drm: prevent double-(un)registration for connectorsDaniel Vetter
[ Upstream commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e ] If we're unlucky then the registration from a hotplugged connector might race with the final registration step on driver load. And since MST topology discover is asynchronous that's even somewhat likely. v2: Also update the kerneldoc for @registered! v3: Review from Chris: - Improve kerneldoc for late_register/early_unregister callbacks. - Use mutex_destroy. Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Sean Paul <seanpaul@chromium.org> Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161218133545.2106-1-daniel.vetter@ffwll.ch Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17ipv6: fix flow labels when the traffic class is non-0Dimitris Michailidis
[ Upstream commit 90427ef5d2a4b9a24079889bf16afdcdaebc4240 ] ip6_make_flowlabel() determines the flow label for IPv6 packets. It's supposed to be passed a flow label, which it returns as is if non-0 and in some other cases, otherwise it calculates a new value. The problem is callers often pass a flowi6.flowlabel, which may also contain traffic class bits. If the traffic class is non-0 ip6_make_flowlabel() mistakes the non-0 it gets as a flow label and returns the whole thing. Thus it can return a 'flow label' longer than 20b and the low 20b of that is typically 0 resulting in packets with 0 label. Moreover, different packets of a flow may be labeled differently. For a TCP flow with ECN non-payload and payload packets get different labels as exemplified by this pair of consecutive packets: (pure ACK) Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2:: 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... .... .... 0001 1100 1110 0100 1001 = Flow Label: 0x1ce49 Payload Length: 32 Next Header: TCP (6) (payload) Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2:: 0110 .... = Version: 6 .... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0)) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2) .... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0x00000 Payload Length: 688 Next Header: TCP (6) This patch allows ip6_make_flowlabel() to be passed more than just a flow label and has it extract the part it really wants. This was simpler than modifying the callers. With this patch packets like the above become Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2:: 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e Payload Length: 32 Next Header: TCP (6) Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2:: 0110 .... = Version: 6 .... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0)) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2) .... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e Payload Length: 688 Next Header: TCP (6) Signed-off-by: Dimitris Michailidis <dmichail@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17fscache: Fix dead object requeueDavid Howells
[ Upstream commit e26bfebdfc0d212d366de9990a096665d5c0209a ] Under some circumstances, an fscache object can become queued such that it fscache_object_work_func() can be called once the object is in the OBJECT_DEAD state. This results in the kernel oopsing when it tries to invoke the handler for the state (which is hard coded to 0x2). The way this comes about is something like the following: (1) The object dispatcher is processing a work state for an object. This is done in workqueue context. (2) An out-of-band event comes in that isn't masked, causing the object to be queued, say EV_KILL. (3) The object dispatcher finishes processing the current work state on that object and then sees there's another event to process, so, without returning to the workqueue core, it processes that event too. It then follows the chain of events that initiates until we reach OBJECT_DEAD without going through a wait state (such as WAIT_FOR_CLEARANCE). At this point, object->events may be 0, object->event_mask will be 0 and oob_event_mask will be 0. (4) The object dispatcher returns to the workqueue processor, and in due course, this sees that the object's work item is still queued and invokes it again. (5) The current state is a work state (OBJECT_DEAD), so the dispatcher jumps to it - resulting in an OOPS. When I'm seeing this, the work state in (1) appears to have been either LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is fscache_osm_lookup_oob). The window for (2) is very small: (A) object->event_mask is cleared whilst the event dispatch process is underway - though there's no memory barrier to force this to the top of the function. The window, therefore is from the time the object was selected by the workqueue processor and made requeueable to the time the mask was cleared. (B) fscache_raise_event() will only queue the object if it manages to set the event bit and the corresponding event_mask bit was set. The enqueuement is then deferred slightly whilst we get a ref on the object and get the per-CPU variable for workqueue congestion. This slight deferral slightly increases the probability by allowing extra time for the workqueue to make the item requeueable. Handle this by giving the dead state a processor function and checking the for the dead state address rather than seeing if the processor function is address 0x2. The dead state processor function can then set a flag to indicate that it's occurred and give a warning if it occurs more than once per object. If this race occurs, an oops similar to the following is seen (note the RIP value): BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 IP: [<0000000000000002>] 0x1 PGD 0 Oops: 0010 [#1] SMP Modules linked in: ... CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000 RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1 RSP: 0018:ffff880717547df8 EFLAGS: 00010202 RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200 RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480 RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510 R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000 R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600 FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900 Call Trace: [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816460d8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeremy McNicoll <jeremymc@redhat.com> Tested-by: Frank Sorenson <sorenson@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17net: fix ndo_features_check/ndo_fix_features comment orderingDimitris Michailidis
[ Upstream commit 1a2a14444d32b89b28116daea86f63ced1716668 ] Commit cdba756f5803a2 ("net: move ndo_features_check() close to ndo_start_xmit()") inadvertently moved the doc comment for .ndo_fix_features instead of .ndo_features_check. Fix the comment ordering. Fixes: cdba756f5803a2 ("net: move ndo_features_check() close to ndo_start_xmit()") Signed-off-by: Dimitris Michailidis <dmichail@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17log2: make order_base_2() behave correctly on const input value zeroArd Biesheuvel
commit 29905b52fad0854351f57bab867647e4982285bf upstream. The function order_base_2() is defined (according to the comment block) as returning zero on input zero, but subsequently passes the input into roundup_pow_of_two(), which is explicitly undefined for input zero. This has gone unnoticed until now, but optimization passes in GCC 7 may produce constant folded function instances where a constant value of zero is passed into order_base_2(), resulting in link errors against the deliberately undefined '____ilog2_NaN'. So update order_base_2() to adhere to its own documented interface. [ See http://marc.info/?l=linux-kernel&m=147672952517795&w=2 and follow-up discussion for more background. The gcc "optimization pass" is really just broken, but now the GCC trunk problem seems to have escaped out of just specially built daily images, so we need to work around it in mainline. - Linus ] Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17PCI/PM: Add needs_resume flag to avoid suspend complete optimizationImre Deak
commit 4d071c3238987325b9e50e33051a40d1cce311cc upstream. Some drivers - like i915 - may not support the system suspend direct complete optimization due to differences in their runtime and system suspend sequence. Add a flag that when set resumes the device before calling the driver's system suspend handlers which effectively disables the optimization. Needed by a future patch fixing suspend/resume on i915. Suggested by Rafael. Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: stable@vger.kernel.org (rebased on v4.8, added kernel version to commit message stable tag) Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14fs: add i_blocksize()Fabian Frederick
commit 93407472a21b82f39c955ea7787e5bc7da100642 upstream. Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cpuset: consider dying css as offlineTejun Heo
commit 41c25707d21716826e3c1f60967f5550610ec1c9 upstream. In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cgroup: Prevent kill_css() from being called more than onceWaiman Long
commit 33c35aa4817864e056fd772230b0c6b552e36ea2 upstream. The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14ptrace: Properly initialize ptracer_cred on forkEric W. Biederman
commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream. When I introduced ptracer_cred I failed to consider the weirdness of fork where the task_struct copies the old value by default. This winds up leaving ptracer_cred set even when a process forks and the child process does not wind up being ptraced. Because ptracer_cred is not set on non-ptraced processes whose parents were ptraced this has broken the ability of the enlightenment window manager to start setuid children. Fix this by properly initializing ptracer_cred in ptrace_init_task This must be done with a little bit of care to preserve the current value of ptracer_cred when ptrace carries through fork. Re-reading the ptracer_cred from the ptracing process at this point is inconsistent with how PT_PTRACE_CAP has been maintained all of these years. Tested-by: Takashi Iwai <tiwai@suse.de> Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14net: ping: do not abuse udp_poll()Eric Dumazet
[ Upstream commit 77d4b1d36926a9b8387c6b53eeba42bcaaffcea3 ] Alexander reported various KASAN messages triggered in recent kernels The problem is that ping sockets should not use udp_poll() in the first place, and recent changes in UDP stack finally exposed this old bug. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <alexander.levin@verizon.com> Cc: Solar Designer <solar@openwall.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Lorenzo Colitti <lorenzo@google.com> Acked-By: Lorenzo Colitti <lorenzo@google.com> Tested-By: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mm: consider memblock reservations for deferred memory initialization sizingMichal Hocko
commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream. We have seen an early OOM killer invocation on ppc64 systems with crashkernel=4096M: kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0 kthreadd cpuset=/ mems_allowed=7 CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1 Call Trace: dump_stack+0xb0/0xf0 (unreliable) dump_header+0xb0/0x258 out_of_memory+0x5f0/0x640 __alloc_pages_nodemask+0xa8c/0xc80 kmem_getpages+0x84/0x1a0 fallback_alloc+0x2a4/0x320 kmem_cache_alloc_node+0xc0/0x2e0 copy_process.isra.25+0x260/0x1b30 _do_fork+0x94/0x470 kernel_thread+0x48/0x60 kthreadd+0x264/0x330 ret_from_kernel_thread+0x5c/0xa4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5 slab_unreclaimable:73 mapped:0 shmem:0 pagetables:0 bounce:0 free:0 free_pcp:0 free_cma:0 Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 819200 pages RAM 0 pages HighMem/MovableOnly 817481 pages reserved 0 pages cma reserved 0 pages hwpoisoned the reason is that the managed memory is too low (only 110MB) while the rest of the the 50GB is still waiting for the deferred intialization to be done. update_defer_init estimates the initial memoty to initialize to 2GB at least but it doesn't consider any memory allocated in that range. In this particular case we've had Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB) so the low 2GB is mostly depleted. Fix this by considering memblock allocations in the initial static initialization estimation. Move the max_initialise to reset_deferred_meminit and implement a simple memblock_reserved_memory helper which iterates all reserved blocks and sums the size of all that start below the given address. The cumulative size is than added on top of the initial estimation. This is still not ideal because reset_deferred_meminit doesn't consider holes and so reservation might be above the initial estimation whihch we ignore but let's make the logic simpler until we really need to handle more complicated cases. Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07ipv4: add reference counting to metricsEric Dumazet
[ Upstream commit 3fb07daff8e99243366a081e5129560734de4ada ] Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07vlan: Fix tcp checksum offloads in Q-in-Q vlansVlad Yasevich
[ Upstream commit 35d2f80b07bbe03fb358afb0bdeff7437a7d67ff ] It appears that TCP checksum offloading has been broken for Q-in-Q vlans. The behavior was execerbated by the series commit afb0bc972b52 ("Merge branch 'stacked_vlan_tso'") that that enabled accleleration features on stacked vlans. However, event without that series, it is possible to trigger this issue. It just requires a lot more specialized configuration. The root cause is the interaction between how netdev_intersect_features() works, the features actually set on the vlan devices and HW having the ability to run checksum with longer headers. The issue starts when netdev_interesect_features() replaces NETIF_F_HW_CSUM with a combination of NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM, if the HW advertises IP|IPV6 specific checksums. This happens for tagged and multi-tagged packets. However, HW that enables IP|IPV6 checksum offloading doesn't gurantee that packets with arbitrarily long headers can be checksummed. This patch disables IP|IPV6 checksums on the packet for multi-tagged packets. CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> CC: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07net/mlx5: Avoid using pending command interface slotsMohamad Haj Yahia
[ Upstream commit 73dd3a4839c1d27c36d4dcc92e1ff44225ecbeb7 ] Currently when firmware command gets stuck or it takes long time to complete, the driver command will get timeout and the command slot is freed and can be used for new commands, and if the firmware receive new command on the old busy slot its behavior is unexpected and this could be harmful. To fix this when the driver command gets timeout we return failure, but we don't free the command slot and we wait for the firmware to explicitly respond to that command. Once all the entries are busy we will stop processing new firmware commands. Fixes: 9cba4ebcf374 ('net/mlx5: Fix potential deadlock in command mode change') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25tracing/kprobes: Enforce kprobes teardown after testingThomas Gleixner
commit 30e7d894c1478c88d50ce94ddcdbd7f9763d9cdd upstream. Enabling the tracer selftest triggers occasionally the warning in text_poke(), which warns when the to be modified page is not marked reserved. The reason is that the tracer selftest installs kprobes on functions marked __init for testing. These probes are removed after the tests, but that removal schedules the delayed kprobes_optimizer work, which will do the actual text poke. If the work is executed after the init text is freed, then the warning triggers. The bug can be reproduced reliably when the work delay is increased. Flush the optimizer work and wait for the optimizing/unoptimizing lists to become empty before returning from the kprobes tracer selftest. That ensures that all operations which were queued due to the probes removal have completed. Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 6274de498 ("kprobes: Support delayed unoptimizing") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25iio: hid-sensor: Store restore poll and hysteresis on S3Srinivas Pandruvada
commit 5d9854eaea776441b38a9a45b4e6879524c4f48c upstream. This change undo the change done by 'commit 3bec24747446 ("iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3")' as this breaks some USB/i2c sensor hubs. Instead of relying on HW for restoring poll and hysteresis, driver stores and restores on resume (S3). In this way user space modified settings are not lost for any kind of sensor hub behavior. In this change, whenever user space modifies sampling frequency or hysteresis driver will get the feature value from the hub and store in the per device hid_sensor_common data structure. On resume callback from S3, system will set the feature to sensor hub, if user space ever modified the feature value. Fixes: 3bec24747446 ("iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3") Reported-by: Ritesh Raj Sarraf <rrs@researchut.com> Tested-by: Ritesh Raj Sarraf <rrs@researchut.com> Tested-by: Song, Hongyan <hongyan.song@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14block: get rid of blk_integrity_revalidate()Ilya Dryomov
commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream. Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") introduced blk_integrity_revalidate(), which seems to assume ownership of the stable pages flag and unilaterally clears it if no blk_integrity profile is registered: if (bi->profile) disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; It's called from revalidate_disk() and rescan_partitions(), making it impossible to enable stable pages for drivers that support partitions and don't use blk_integrity: while the call in revalidate_disk() can be trivially worked around (see zram, which doesn't support partitions and hence gets away with zram_revalidate_disk()), rescan_partitions() can be triggered from userspace at any time. This breaks rbd, where the ceph messenger is responsible for generating/verifying CRCs. Since blk_integrity_{un,}register() "must" be used for (un)registering the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES setting there. This way drivers that call blk_integrity_register() and use integrity infrastructure won't interfere with drivers that don't but still want stable pages. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue] Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14xen: Revert commits da72ff5bfcb0 and 72a9b186292dBoris Ostrovsky
commit 84d582d236dc1f9085e741affc72e9ba061a67c2 upstream. Recent discussion (http://marc.info/?l=xen-devel&m=149192184523741) established that commit 72a9b186292d ("xen: Remove event channel notification through Xen PCI platform device") (and thus commit da72ff5bfcb0 ("partially revert "xen: Remove event channel notification through Xen PCI platform device"")) are unnecessary and, in fact, prevent HVM guests from booting on Xen releases prior to 4.0 Therefore we revert both of those commits. The summary of that discussion is below: Here is the brief summary of the current situation: Before the offending commit (72a9b186292): 1) INTx does not work because of the reset_watches path. 2) The reset_watches path is only taken if you have Xen > 4.0 3) The Linux Kernel by default will use vector inject if the hypervisor support. So even INTx does not work no body running the kernel with Xen > 4.0 would notice. Unless he explicitly disabled this feature either in the kernel or in Xen (and this can only be disabled by modifying the code, not user-supported way to do it). After the offending commit (+ partial revert): 1) INTx is no longer support for HVM (only for PV guests). 2) Any HVM guest The kernel will not boot on Xen < 4.0 which does not have vector injection support. Since the only other mode supported is INTx which. So based on this summary, I think before commit (72a9b186292) we were in much better position from a user point of view. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Julien Grall <julien.grall@arm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: xen-devel@lists.xenproject.org Cc: linux-kernel@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: Anthony Liguori <aliguori@amazon.com> Cc: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14f2fs: sanity check segment countJin Qian
commit b9dd46188edc2f0d1f37328637860bb65a771124 upstream. F2FS uses 4 bytes to represent block address. As a result, supported size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments. Signed-off-by: Jin Qian <jinqian@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notfWANG Cong
[ Upstream commit 242d3a49a2a1a71d8eb9f953db1bcaa9d698ce00 ] For each netns (except init_net), we initialize its null entry in 3 places: 1) The template itself, as we use kmemdup() 2) Code around dst_init_metrics() in ip6_route_net_init() 3) ip6_route_dev_notify(), which is supposed to initialize it after loopback registers Unfortunately the last one still happens in a wrong order because we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to net->loopback_dev's idev, thus we have to do that after we add idev to loopback. However, this notifier has priority == 0 same as ipv6_dev_notf, and ipv6_dev_notf is registered after ip6_route_dev_notifier so it is called actually after ip6_route_dev_notifier. This is similar to commit 2f460933f58e ("ipv6: initialize route null entry in addrconf_init()") which fixes init_net. Fix it by picking a smaller priority for ip6_route_dev_notifier. Also, we have to release the refcnt accordingly when unregistering loopback_dev because device exit functions are called before subsys exit functions. Acked-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14ipv6: initialize route null entry in addrconf_init()WANG Cong
[ Upstream commit 2f460933f58eee3393aba64f0f6d14acb08d1724 ] Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev since it is always NULL. This is clearly wrong, we have code to initialize it to loopback_dev, unfortunately the order is still not correct. loopback_dev is registered very early during boot, we lose a chance to re-initialize it in notifier. addrconf_init() is called after ip6_route_init(), which means we have no chance to correct it. Fix it by moving this initialization explicitly after ipv6_add_dev(init_net.loopback_dev) in addrconf_init(). Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14usb: chipidea: Handle extcon events properlyStephen Boyd
commit a89b94b53371bbfa582787c2fa3378000ea4263d upstream. We're currently emulating the vbus and id interrupts in the OTGSC read API, but we also need to make sure that if we're handling the events with extcon that we don't enable the interrupts for those events in the hardware. Therefore, properly emulate this register if we're using extcon, but don't enable the interrupts. This allows me to get my cable connect/disconnect working properly without getting spurious interrupts on my device that uses an extcon for these two events. Acked-by: Peter Chen <peter.chen@nxp.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Ivan T. Ivanov" <iivanov.xz@gmail.com> Fixes: 3ecb3e09b042 ("usb: chipidea: Use extcon framework for VBUS and ID detect") Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org> Signed-off-by: Peter Chen <peter.chen@nxp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>