summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2018-04-13blk-mq: fix kernel oops in blk_mq_tag_idle()Ming Lei
[ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ] HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(), then we need to check it before calling blk_mq_tag_idle(), otherwise the following kernel oops can be triggered, so fix it by checking if the hw queue is unmapped since it doesn't make sense to idle the tags any more after hw queues are unmapped. [ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] [ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000 [ 440.788359] RIP: 0010:[<ffffffffb730e2b4>] [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286 [ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f [ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00 [ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000 [ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008 [ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080 [ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000 [ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0 [ 440.876169] Call Trace: [ 440.879818] [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0 [ 440.887051] [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160 [ 440.894465] [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150 [ 440.901881] [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core] [ 440.910068] [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core] [ 440.919026] [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma] [ 440.928079] [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma] [ 440.937126] [<ffffffffb70ab58a>] process_one_work+0x17a/0x440 [ 440.944517] [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0 [ 440.951607] [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 440.959760] [<ffffffffb70b352f>] kthread+0xcf/0xe0 [ 440.966055] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.973715] [<ffffffffb76d8658>] ret_from_fork+0x58/0x90 [ 440.980586] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f [ 441.011620] RIP [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 441.019301] RSP <ffff893bf9bcbd10> [ 441.024052] CR2: 0000000000000008 Reported-by: Zhang Yi <yizhan@redhat.com> Tested-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13bio-integrity: Do not allocate integrity context for bio w/o dataDmitry Monakhov
[ Upstream commit 3116a23bb30272d74ea81baf5d0ee23f602dd15b ] If bio has no data, such as ones from blkdev_issue_flush(), then we have nothing to protect. This patch prevent bugon like follows: kfree_debugcheck: out of range ptr ac1fa1d106742a5ah kernel BUG at mm/slab.c:2773! invalid opcode: 0000 [#1] SMP Modules linked in: bcache CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43 Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014 task: ffff880137786440 task.stack: ffffc90000ba8000 RIP: 0010:kfree_debugcheck+0x25/0x2a RSP: 0018:ffffc90000babde0 EFLAGS: 00010082 RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40 RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282 R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001 FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0 Call Trace: kfree+0xc8/0x1b3 bio_integrity_free+0xc3/0x16b bio_free+0x25/0x66 bio_put+0x14/0x26 blkdev_issue_flush+0x7a/0x85 blkdev_fsync+0x35/0x42 vfs_fsync_range+0x8e/0x9f vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 entry_SYSCALL_64_fastpath+0x1f/0xc2 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13blk-mq: fix race between updating nr_hw_queues and switching io schedMing Lei
[ Upstream commit fb350e0ad99359768e1e80b4784692031ec340e4 ] In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags can be allocated, and q->nr_hw_queue is used, and race is inevitable, for example: blk_mq_init_sched() may trigger use-after-free on hctx, which is freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased. This patch fixes the race be holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13block: fix an error code in add_partition()Dan Carpenter
[ Upstream commit 7bd897cfce1eb373892d35d7f73201b0f9b221c4 ] We don't set an error code on this path. It means that we return NULL instead of an error pointer and the caller does a NULL dereference. Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split opWen Xiong
[ Upstream commit f36ea50ca0043e7b1204feaf1d2ba6bd68c08d36 ] When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns "Input/output error". Looks block layer split the bio after calling bio_integrity_prep(bio). This patch fixes the issue. Below is how we debug this issue: (1)format nvme to 4K block # size with type 2 DIF (2)dd with block size bigger than 1024k. oflag=direct dd: error writing '/dev/nvme0n1': Input/output error We added some debug code in nvme device driver. It showed us the first op and the second op have the same bi and pi address. This is not correct. 1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires.. Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828 With the fix, It showed us both of the first op and the second op have correct bi and pi address. 1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual 0x00003028 Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08Fix slab name "biovec-(1<<(21-12))"Mikulas Patocka
commit bd5c4facf59648581d2f1692dad7b107bf429954 upstream. I'm getting a slab named "biovec-(1<<(21-12))". It is caused by unintended expansion of the macro BIO_MAX_PAGES. This patch renames it to biovec-max. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08partitions/msdos: Unable to mount UFS 44bsd partitionsRichard Narron
commit 5f15684bd5e5ef39d4337988864fec8012471dda upstream. UFS partitions from newer versions of FreeBSD 10 and 11 use relative addressing for their subpartitions. But older versions of FreeBSD still use absolute addressing just like OpenBSD and NetBSD. Instead of simply testing for a FreeBSD partition, the code needs to also test if the starting offset of the C subpartition is zero. https://bugzilla.kernel.org/show_bug.cgi?id=197733 Signed-off-by: Richard Narron <comet.berkeley@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24block/mq: Cure cpu hotplug lock inversionPeter Zijlstra
[ Upstream commit eabe06595d62cfa9278e2cd012df614bc68a7042 ] By poking at /debug/sched_features I triggered the following splat: [] ====================================================== [] WARNING: possible circular locking dependency detected [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted [] ------------------------------------------------------ [] bash/2109 is trying to acquire lock: [] (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50 [] [] but task is already holding lock: [] (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] which lock already depends on the new lock. [] [] [] the existing dependency chain (in reverse order) is: [] [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}: [] lock_acquire+0x100/0x210 [] down_write+0x28/0x60 [] start_creating+0x5e/0xf0 [] debugfs_create_dir+0x13/0x110 [] blk_mq_debugfs_register+0x21/0x70 [] blk_mq_register_dev+0x64/0xd0 [] blk_register_queue+0x6a/0x170 [] device_add_disk+0x22d/0x440 [] loop_add+0x1f3/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] [] -> #1 (all_q_mutex){+.+.+.}: [] lock_acquire+0x100/0x210 [] __mutex_lock+0x6c/0x960 [] mutex_lock_nested+0x1b/0x20 [] blk_mq_init_allocated_queue+0x37c/0x4e0 [] blk_mq_init_queue+0x3a/0x60 [] loop_add+0xe5/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] *** DEADLOCK *** [] [] 3 locks held by bash/2109: [] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0 [] #1: (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0 [] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] stack backtrace: [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [] Call Trace: [] lock_acquire+0x100/0x210 [] get_online_cpus+0x2a/0x90 [] static_key_slow_dec+0x1b/0x50 [] static_key_disable+0x20/0x30 [] sched_feat_write+0x131/0x170 [] full_proxy_write+0x97/0xd0 [] __vfs_write+0x28/0x120 [] vfs_write+0xb5/0x1a0 [] SyS_write+0x49/0xa0 [] entry_SYSCALL_64_fastpath+0x23/0xc2 This is because of the cpu hotplug lock rework. Break the chain at #1 by reversing the lock acquisition order. This way i_mutex_key#4 no longer depends on cpu_hotplug_lock and things are good. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22blk-throttle: make sure expire time isn't too bigShaohua Li
[ Upstream commit 06cceedcca67a93ac7f7aa93bbd9980c7496d14e ] cgroup could be throttled to a limit but when all cgroups cross high limit, queue enters a higher state and so the group should be throttled to a higher limit. It's possible the cgroup is sleeping because of throttle and other cgroups don't dispatch IO any more. In this case, nobody can trigger current downgrade/upgrade logic. To fix this issue, we could either set up a timer to wakeup the cgroup if other cgroups are idle or make sure this cgroup doesn't sleep too long. Setting up a timer means we must change the timer very frequently. This patch chooses the latter. Making cgroup sleep time not too big wouldn't change cgroup bps/iops, but could make it wakeup more frequently, which isn't a big issue because throtl_slice * 8 is already quite big. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22blkcg: fix double free of new_blkg in blkcg_init_queueHou Tao
commit 9b54d816e00425c3a517514e0d677bb3cec49258 upstream. If blkg_create fails, new_blkg passed as an argument will be freed by blkg_create, so there is no need to free it again. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25blk_rq_map_user_iov: fix error overrideDouglas Gilbert
commit 69e0927b3774563c19b5fb32e91d75edc147fb62 upstream. During stress tests by syzkaller on the sg driver the block layer infrequently returns EINVAL. Closer inspection shows the block layer was trying to return ENOMEM (which is much more understandable) but for some reason overroad that useful error. Patch below does not show this (unchanged) line: ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy); That 'ret' was being overridden when that function failed. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20badblocks: fix wrong return value in badblocks_set if badblocks are disabledLiu Bo
[ Upstream commit 39b4954c0a1556f8f7f1fdcf59a227117fcd8a0b ] MD's rdev_set_badblocks() expects that badblocks_set() returns 1 if badblocks are disabled, otherwise, rdev_set_badblocks() will record superblock changes and return success in that case and md will fail to report an IO error which it should. This bug has existed since badblocks were introduced in commit 9e0e252a048b ("badblocks: Add core badblock management code"). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20blk-mq: Fix tagset reinit in the presence of cpu hot-unplugSagi Grimberg
[ Upstream commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0 ] In case cpu was unplugged, we need to make sure not to assume that the tags for that cpu are still allocated. so check for null tags when reinitializing a tagset. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14block: wake up all tasks blocked in get_request()Ming Lei
[ Upstream commit 34d9715ac1edd50285168dd8d80c972739a4f6a4 ] Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But if there are tasks blocked in get_request(), q->q_usage_counter can never become zero. So we have to wake up all these tasks in blk_set_queue_dying() first. Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()Ming Lei
[ Upstream commit 737f98cfe7de8df7433a4d846850aa8efa44bd48 ] Both q->mq_kobj and sw queues' kobjects should have been initialized once, instead of doing that each add_disk context. Also this patch removes clearing of ctx in blk_mq_init_cpu_queues() because percpu allocator fills zero to allocated variable. This patch fixes one issue[1] reported from Omar. [1] kernel wearning when doing unbind/bind on one scsi-mq device [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong. [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34 [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014 [ 19.350920] Workqueue: events_unbound async_run_entry_fn [ 19.350920] Call Trace: [ 19.350920] dump_stack+0x63/0x83 [ 19.350920] kobject_init+0x77/0x90 [ 19.350920] blk_mq_register_dev+0x40/0x130 [ 19.350920] blk_register_queue+0xb6/0x190 [ 19.350920] device_add_disk+0x1ec/0x4b0 [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod] [ 19.350920] async_run_entry_fn+0x48/0x150 [ 19.350920] process_one_work+0x1d0/0x480 [ 19.350920] worker_thread+0x48/0x4e0 [ 19.350920] kthread+0x101/0x140 [ 19.350920] ? process_one_work+0x480/0x480 [ 19.350920] ? kthread_create_on_node+0x60/0x60 [ 19.350920] ret_from_fork+0x2c/0x40 Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30block: Fix a race between blk_cleanup_queue() and timeout handlingBart Van Assche
commit 4e9b6f20828ac880dbc1fa2fdbafae779473d1af upstream. Make sure that if the timeout timer fires after a queue has been marked "dying" that the affected requests are finished. Reported-by: chenxiang (M) <chenxiang66@hisilicon.com> Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: chenxiang (M) <chenxiang66@hisilicon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21Revert "bsg-lib: don't free job in bsg_prepare_job"Greg Kroah-Hartman
This reverts commit eb4375e1969c48d454998b2a284c2e6a5dc9eb68 which was commit f507b54dccfd8000c517d740bc45f20c74532d18 upstream. Ben reports: That function doesn't exist here (it was introduced in 4.13). Instead, this backport has modified bsg_create_job(), creating a leak. Please revert this on the 3.18, 4.4 and 4.9 stable branches. So I'm dropping it from here. Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
2017-10-18bio_copy_user_iov(): don't ignore ->iov_offsetAl Viro
commit 1cfd0ddd82232804e03f3023f6a58b50dfef0574 upstream. Since "block: support large requests in blk_rq_map_user_iov" we started to call it with partially drained iter; that works fine on the write side, but reads create a copy of iter for completion time. And that needs to take the possibility of ->iov_iter != 0 into account... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18more bio_map_user_iov() leak fixesAl Viro
commit 2b04e8f6bbb196cab4b232af0f8d48ff2c7a8058 upstream. we need to take care of failure exit as well - pages already in bio should be dropped by analogue of bio_unmap_pages(), since their refcounts had been bumped only once per reference in bio. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18fix unbalanced page refcounting in bio_map_user_iovVitaly Mayatskikh
commit 95d78c28b5a85bacbc29b8dba7c04babb9b0d467 upstream. bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if IO vector has small consecutive buffers belonging to the same page. bio_add_pc_page merges them into one, but the page reference is never dropped. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-08partitions/efi: Fix integer overflow in GPT size calculationAlden Tondettar
[ Upstream commit c5082b70adfe8e1ea1cf4a8eff92c9f260e364d2 ] If a GUID Partition Table claims to have more than 2**25 entries, the calculation of the partition table size in alloc_read_gpt_entries() will overflow a 32-bit integer and not enough space will be allocated for the table. Nothing seems to get written out of bounds, but later efi_partition() will read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing information to /proc/partitions and uevents. The problem exists on both 64-bit and 32-bit platforms. Fix the overflow and also print a meaningful debug message if the table size is too large. Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05bsg-lib: don't free job in bsg_prepare_jobChristoph Hellwig
commit f507b54dccfd8000c517d740bc45f20c74532d18 upstream. The job structure is allocated as part of the request, so we should not free it in the error path of bsg_prepare_job. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27block: Relax a check in blk_start_queue()Bart Van Assche
commit 4ddd56b003f251091a67c15ae3fe4a5c5c5e390a upstream. Calling blk_start_queue() from interrupt context with the queue lock held and without disabling IRQs, as the skd driver does, is safe. This patch avoids that loading the skd driver triggers the following warning: WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0 RIP: 0010:blk_start_queue+0x84/0xa0 Call Trace: skd_unquiesce_dev+0x12a/0x1d0 [skd] skd_complete_internal+0x1e7/0x5a0 [skd] skd_complete_other+0xc2/0xd0 [skd] skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd] skd_isr+0x14f/0x180 [skd] irq_forced_thread_fn+0x2a/0x70 irq_thread+0x144/0x1a0 kthread+0x125/0x140 ret_from_fork+0x2a/0x40 Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Andrew Morton <akpm@osdl.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-24blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULLChristoph Hellwig
commit c005390374957baacbc38eef96ea360559510aa7 upstream. While pci_irq_get_affinity should never fail for SMP kernel that implement the affinity mapping, it will always return NULL in the UP case, so provide a fallback mapping of all queues to CPU 0 in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17partitions/msdos: FreeBSD UFS2 file systems are not recognizedRichard
commit 223220356d5ebc05ead9a8d697abb0c0a906fc81 upstream. The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD and NetBSD partitions and does a reasonable job picking out OpenBSD and NetBSD UFS subpartitions. But for FreeBSD the subpartitions are always "bad". Kernel: <bsd:bad subpartition - ignored Though all 3 of these BSD systems use UFS as a file system, only FreeBSD uses relative start addresses in the subpartition declarations. The following patch fixes this for FreeBSD partitions and leaves the code for OpenBSD and NetBSD intact: Signed-off-by: Richard Narron <comet.berkeley@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cfq-iosched: fix the delay of cfq_group's vdisktime under iops modeHou Tao
commit 5be6b75610cefd1e21b98a218211922c2feb6e08 upstream. When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY as the delay of cfq_group's vdisktime if there have been other cfq_groups already. When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds") could result in a large iops delay and lead to an abnormal io schedule delay for the added cfq_group. To fix it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5 when iops mode is enabled. Despite having the same value, the delay of a cfq_queue in idle class and the delay of cfq_group are different things, so I define two new macros for the delay of a cfq_group under time-slice mode and iops mode. Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20block: fix blk_integrity_register to use template's interval_exp if not 0Mike Snitzer
commit 2859323e35ab5fc42f351fbda23ab544eaa85945 upstream. When registering an integrity profile: if the template's interval_exp is not 0 use it, otherwise use the ilog2() of logical block size of the provided gendisk. This fixes a long-standing DM linear target bug where it cannot pass integrity data to the underlying device if its logical block size conflicts with the underlying device's logical block size. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14block: get rid of blk_integrity_revalidate()Ilya Dryomov
commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream. Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") introduced blk_integrity_revalidate(), which seems to assume ownership of the stable pages flag and unilaterally clears it if no blk_integrity profile is registered: if (bi->profile) disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; It's called from revalidate_disk() and rescan_partitions(), making it impossible to enable stable pages for drivers that support partitions and don't use blk_integrity: while the call in revalidate_disk() can be trivially worked around (see zram, which doesn't support partitions and hence gets away with zram_revalidate_disk()), rescan_partitions() can be triggered from userspace at any time. This breaks rbd, where the ceph messenger is responsible for generating/verifying CRCs. Since blk_integrity_{un,}register() "must" be used for (un)registering the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES setting there. This way drivers that call blk_integrity_register() and use integrity infrastructure won't interfere with drivers that don't but still want stable pages. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue] Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-18blk-mq: Avoid memory reclaim when remapping queuesGabriel Krisman Bertazi
commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream. While stressing memory and IO at the same time we changed SMT settings, we were able to consistently trigger deadlocks in the mm system, which froze the entire machine. I think that under memory stress conditions, the large allocations performed by blk_mq_init_rq_map may trigger a reclaim, which stalls waiting on the block layer remmaping completion, thus deadlocking the system. The trace below was collected after the machine stalled, waiting for the hotplug event completion. The simplest fix for this is to make allocations in this path non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the issue anymore. This should apply on top of Jens's for-next branch cleanly. Changes since v1: - Use GFP_NOIO instead of GFP_NOWAIT. Call Trace: [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable) [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430 [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0 [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0 [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30 [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0 [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0 [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs] [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs] [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs] [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210 [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0 [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0 [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520 [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250 [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0 [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380 [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360 [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0 [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250 [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100 [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0 [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0 [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250 [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120 [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0 [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150 [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0 [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120 [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0 [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0 [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0 [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250 [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0 [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270 [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110 [c000000f0160be30] [c000000000009204] system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com> Cc: linux-block@vger.kernel.org Cc: linux-scsi@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08blk: Ensure users for current->bio_list can see the full list.NeilBrown
commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream. Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") changed current->bio_list so that it did not contain *all* of the queued bios, but only those submitted by the currently running make_request_fn. There are two places which walk the list and requeue selected bios, and others that check if the list is empty. These are no longer correct. So redefine current->bio_list to point to an array of two lists, which contain all queued bios, and adjust various code to test or walk both lists. Signed-off-by: NeilBrown <neilb@suse.com> Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") Signed-off-by: Jens Axboe <axboe@fb.com> Cc: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08blk: improve order of bio handling in generic_make_request()NeilBrown
commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream. To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com> Cc: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-30blk-mq: don't complete un-started request in timeout handlerMing Lei
commit 95a49603707d982b25d17c5b70e220a05556a2f9 upstream. When iterating busy requests in timeout handler, if the STARTED flag of one request isn't set, that means the request is being processed in block layer or driver, and isn't submitted to hardware yet. In current implementation of blk_mq_check_expired(), if the request queue becomes dying, un-started requests are handled as being completed/freed immediately. This way is wrong, and can cause rq corruption or double allocation[1][2], when doing I/O and removing&resetting NVMe device at the sametime. This patch fixes several issues reported by Yi Zhang. [1]. oops log 1 [ 581.789754] ------------[ cut here ]------------ [ 581.789758] kernel BUG at block/blk-mq.c:374! [ 581.789760] invalid opcode: 0000 [#1] SMP [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror dm_region_hash dm_log dm_mod [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4 [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 581.789804] Workqueue: kblockd blk_mq_timeout_work [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000 [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70 [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202 [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88 [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00 [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80 [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0 [ 581.789820] Call Trace: [ 581.789825] blk_mq_check_expired+0x76/0x80 [ 581.789828] bt_iter+0x45/0x50 [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0 [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789840] ? __switch_to+0x140/0x450 [ 581.789841] blk_mq_timeout_work+0x88/0x170 [ 581.789845] process_one_work+0x165/0x410 [ 581.789847] worker_thread+0x137/0x4c0 [ 581.789851] kthread+0x101/0x140 [ 581.789853] ? rescuer_thread+0x3b0/0x3b0 [ 581.789855] ? kthread_park+0x90/0x90 [ 581.789860] ret_from_fork+0x2c/0x40 [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50 [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]--- [2]. oops log2 [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857373] PGD 0 [ 6984.857374] [ 6984.857376] Oops: 0000 [#1] SMP [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted 4.10.0-2.el7.bz1420297.x86_64 #1 [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000 [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246 [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000 [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950 [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000 [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000 [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780 [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000 [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 6984.857443] Call Trace: [ 6984.857451] ? mempool_free+0x2b/0x80 [ 6984.857455] ? bio_free+0x4e/0x60 [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230 [ 6984.857462] blk_mq_process_rq_list+0x133/0x170 [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0 [ 6984.857467] blk_mq_run_work_fn+0x12/0x20 [ 6984.857473] process_one_work+0x165/0x410 [ 6984.857475] worker_thread+0x137/0x4c0 [ 6984.857478] kthread+0x101/0x140 [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0 [ 6984.857481] ? kthread_park+0x90/0x90 [ 6984.857489] ret_from_fork+0x2c/0x40 [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50 [ 6984.857512] CR2: 0000000000000010 [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]--- Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22block: allow WRITE_SAME commands with the SG_IO ioctlMauricio Faria de Oliveira
[ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ] The WRITE_SAME commands are not present in the blk_default_cmd_filter write_ok list, and thus are failed with -EPERM when the SG_IO ioctl() is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users). [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ] The problem can be reproduced with the sg_write_same command # sg_write_same --num 1 --xferlen 512 /dev/sda # # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' Write same: pass through os error: Operation not permitted # For comparison, the WRITE_VERIFY command does not observe this problem, since it is in that list: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda' # So, this patch adds the WRITE_SAME commands to the list, in order for the SG_IO ioctl to finish successfully: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' # That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]), which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu). In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls, which are translated to write-same calls in the guest kernel, and then into SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest: [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current] [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00 [...] blk_update_request: I/O error, dev sda, sector 17096824 Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Brahadambal Srinivasan <latha@linux.vnet.ibm.com> Reported-by: Manjunatha H R <manjuhr1@in.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19blk-mq: Always schedule hctx->next_cpuGabriel Krisman Bertazi
commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream. Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") attempts to avoid triggering the WARN_ON in __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the last batch execution before round robin, blk_mq_hctx_next_cpu can schedule a dead CPU and also update next_cpu to the next alive CPU in the mask, which will trigger the WARN_ON despite the previous workaround. The following patch fixes this scenario by always scheduling the value in hctx->next_cpu. This changes the moment when we round-robin the CPU running the hctx, but it really doesn't matter, since it still executes BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU. Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19block: cfq_cpd_alloc() should use @gfpTejun Heo
commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream. cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was incorrectly hard coding GFP_KERNEL instead of using the mask specified through the @gfp parameter. This currently doesn't cause any actual issues because all current callers specify GFP_KERNEL. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09sg_write()/bsg_write() is not fit to be called under KERNEL_DSAl Viro
commit 128394eff343fc6d2f32172f03e24829539c5835 upstream. Both damn things interpret userland pointers embedded into the payload; worse, they are actually traversing those. Leaving aside the bad API design, this is very much _not_ safe to call with KERNEL_DS. Bail out early if that happens. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06blk-mq: Do not invoke .queue_rq() for a stopped queueBart Van Assche
commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream. The meaning of the BLK_MQ_S_STOPPED flag is "do not call .queue_rq()". Hence modify blk_mq_make_request() such that requests are queued instead of issued if a queue has been stopped. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-07Don't feed anything but regular iovec's to blk_rq_map_user_iovLinus Torvalds
In theory we could map other things, but there's a reason that function is called "user_iov". Using anything else (like splice can do) just confuses it. Reported-and-tested-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27blk-mq: update hardware and software queues for sleeping allocJens Axboe
If we end up sleeping due to running out of requests, we should update the hardware and software queues in the map ctx structure. Otherwise we could end up having rq->mq_ctx point to the pre-sleep context, and risk corrupting ctx->rq_list since we'll be grabbing the wrong lock when inserting the request. Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Chris Mason <clm@fb.com> Tested-by: Chris Mason <clm@fb.com> Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request") Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-26block: flush: fix IO hang in case of flood fua reqMing Lei
This patch fixes one issue reported by Kent, which can be triggered in bcachefs over sata disk. Actually it is a generic issue in block flush vs. blk-tag. Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-21badblocks: badblocks_set/clear update unacked_existShaohua Li
When bandblocks_set acknowledges a range or badblocks_clear a range, it's possible all badblocks are acknowledged. We should update unacked_exist if this occurs. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Tested-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes that missed the merge window, mostly due to me being away around that time. Nothing major here, a mix of nvme cleanups and fixes, and one fix for the badblocks handling" * 'for-linus' of git://git.kernel.dk/linux-block: nvmet: use symbolic constants for CNS values nvme: use symbolic constants for CNS values nvme.h: add an enum for cns values nvme.h: don't use uuid_be nvme.h: resync with nvme-cli nvme: Add tertiary number to NVME_VS nvme : Add sysfs entry for NVMe CMBs when appropriate nvme: don't schedule multiple resets nvme: Delete created IO queues on reset nvme: Stop probing a removed device badblocks: fix overlapping check for clearing
2016-10-15Merge tag 'gcc-plugins-v4.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin
2016-10-14Merge branch 'for-4.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-12badblocks: fix overlapping check for clearingTomasz Majchrzak
Current bad block clear implementation assumes the range to clear overlaps with at least one bad block already stored. If given range to clear precedes first bad block in a list, the first entry is incorrectly updated. Check not only if stored block end is past clear block end but also if stored block start is before clear block end. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-11block: require write_same and discard requests align to logical block sizeDarrick J. Wong
Make sure that the offset and length arguments that we're using to construct WRITE SAME and DISCARD requests are actually aligned to the logical block size. Failure to do this causes other errors in other parts of the block layer or the SCSI layer because disks don't support partial logical block writes. Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header Cc: Brian Foster <bfoster@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11block: invalidate the page cache when issuing BLKZEROOUTDarrick J. Wong
Patch series "fallocate for block devices", v11. This is a patchset to fix page cache coherency with BLKZEROOUT and implement fallocate for block devices. The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate the page cache if the zeroing command to the underlying device succeeds. Without this patch we still have the pagecache coherence bug that's been in the kernel forever. The second patch changes the internal block device functions to reject attempts to discard or zeroout that are not aligned to the logical block size. Previously, we only checked that the start/len parameters were 512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA devices. The third patch creates an fallocate handler for block devices, wires up the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent fallocate interface between files and block devices. It also allows the combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard. Test cases for the new block device fallocate are now in xfstests as generic/349-351. This patch (of 3): Invalidate the page cache (as a regular O_DIRECT write would do) to avoid returning stale cache contents at a later time. Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-10latent_entropy: Mark functions with __latent_entropyEmese Revfy
The __latent_entropy gcc attribute can be used only on functions and variables. If it is on a function then the plugin will instrument it for gathering control-flow entropy. If the attribute is on a variable then the plugin will initialize it with random contents. The variable must be an integer, an integer array type or a structure with integer fields. These specific functions have been selected because they are init functions (to help gather boot-time entropy), are called at unpredictable times, or they have variable loops, each of which provide some level of latent entropy. Signed-off-by: Emese Revfy <re.emese@gmail.com> [kees: expanded commit message] Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-09Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull blk-mq CPU hotplug update from Jens Axboe: "This is the conversion of blk-mq to the new hotplug state machine" * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block: blk-mq: fixup "Convert to new hotplug state machine" blk-mq: Convert to new hotplug state machine blk-mq/cpu-notif: Convert to new hotplug state machine
2016-10-09Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event