summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2017-09-10md/raid5: add thread_group worker async_tx_issue_pending_allOfer Heifetz
[ Upstream commit 7e96d559634b73a8158ee99a7abece2eacec2668 ] Since thread_group worker and raid5d kthread are not in sync, if worker writes stripe before raid5d then requests will be waiting for issue_pendig. Issue observed when building raid5 with ext4, in some build runs jbd2 would get hung and requests were waiting in the HW engine waiting to be issued. Fix this by adding a call to async_tx_issue_pending_all in the raid5_do_work. Signed-off-by: Ofer Heifetz <oferh@marvell.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10Raid5 should update rdev->sectors after reshapeXiao Ni
[ Upstream commit b5d27718f38843a74552e9a93d32e2391fd3999f ] The raid5 md device is created by the disks which we don't use the total size. For example, the size of the device is 5G and it just uses 3G of the devices to create one raid5 device. Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid and assemble it again. It fails. mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean mdadm /dev/md0 --grow --chunk=64 wait reshape to finish mdadm -S /dev/md0 mdadm -As The error messages: [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing! [197519.821686] md: md_import_device returned -22 After reshape the data offset is changed. It selects backwards direction in this condition. In function super_1_load it compares the available space of the underlying device with sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL. rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based on rdev->sectors. So add md_finish_reshape in end_reshape. Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10md: don't use flush_signals in userspace processesMikulas Patocka
[ Upstream commit f9c79bc05a2a91f4fba8bfd653579e066714b1ec ] The function flush_signals clears all pending signals for the process. It may be used by kernel threads when we need to prepare a kernel thread for responding to signals. However using this function for an userspaces processes is incorrect - clearing signals without the program expecting it can cause misbehavior. The raid1 and raid5 code uses flush_signals in its request routine because it wants to prepare for an interruptible wait. This patch drops flush_signals and uses sigprocmask instead to block all signals (including SIGKILL) around the schedule() call. The signals are not lost, but the schedule() call won't respond to them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31md: fix super_offset endianness in super_1_rdev_size_changeJason Yan
[ Upstream commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 ] The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-13md:raid1: fix a dead loop when read from a WriteMostly diskWei Fang
[ Upstream commit 816b0acf3deb6d6be5d0519b286fdd4bafade905 ] If first_bad == this_sector when we get the WriteMostly disk in read_balance(), valid disk will be returned with zero max_sectors. It'll lead to a dead loop in make_request(), and OOM will happen because of endless allocation of struct bio. Since we can't get data from this disk in this case, so continue for another disk. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-08dm bufio: make the parameter "retain_bytes" unsigned longMikulas Patocka
[ Upstream commit 13840d38016203f0095cd547b90352812d24b787 ] Change the type of the parameter "retain_bytes" from unsigned to unsigned long, so that on 64-bit machines the user can set more than 4GiB of data to be retained. Also, change the type of the variable "count" in the function "__evict_old_buffers" to unsigned long. The assignment "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];" could result in unsigned long to unsigned overflow and that could result in buffers not being freed when they should. While at it, avoid division in get_retain_buffers(). Division is slow, we can change it to shift because we have precalculated the log2 of block size. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-08dm space map disk: fix some book keeping in the disk space mapJoe Thornber
[ Upstream commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a ] When decrementing the reference count for a block, the free count wasn't being updated if the reference count went to zero. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-08dm thin metadata: call precommit before saving the rootsJoe Thornber
[ Upstream commit 91bcdb92d39711d1adb40c26b653b7978d93eb98 ] These calls were the wrong way round in __write_initial_superblock. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm bufio: check new buffer allocation watermark every 30 secondsMikulas Patocka
[ Upstream commit 390020ad2af9ca04844c4f3b1f299ad8746d84c8 ] dm-bufio checks a watermark when it allocates a new buffer in __bufio_new(). However, it doesn't check the watermark when the user changes /sys/module/dm_bufio/parameters/max_cache_size_bytes. This may result in a problem - if the watermark is high enough so that all possible buffers are allocated and if the user lowers the value of "max_cache_size_bytes", the watermark will never be checked against the new value because no new buffer would be allocated. To fix this, change __evict_old_buffers() so that it checks the watermark. __evict_old_buffers() is called every 30 seconds, so if the user reduces "max_cache_size_bytes", dm-bufio will react to this change within 30 seconds and decrease memory consumption. Depends-on: 1b0fb5a5b2 ("dm bufio: avoid a possible ABBA deadlock") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm bufio: avoid a possible ABBA deadlockMikulas Patocka
[ Upstream commit 1b0fb5a5b2dc0dddcfa575060441a7176ba7ac37 ] __get_memory_limit() tests if dm_bufio_cache_size changed and calls __cache_size_refresh() if it did. It takes dm_bufio_clients_lock while it already holds the client lock. However, lock ordering is violated because in cleanup_old_buffers() dm_bufio_clients_lock is taken before the client lock. This results in a possible deadlock and lockdep engine warning. Fix this deadlock by changing mutex_lock() to mutex_trylock(). If the lock can't be taken, it will be re-checked next time when a new buffer is allocated. Also add "unlikely" to the if condition, so that the optimizer assumes that the condition is false. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm ioctl: prevent stack leak in dm ioctl callAdrian Salido
[ Upstream commit 4617f564c06117c7d1b611be49521a4430042287 ] When calling a dm ioctl that doesn't process any data (IOCTL_FLAGS_NO_PARAMS), the contents of the data field in struct dm_ioctl are left initialized. Current code is incorrectly extending the size of data copied back to user, causing the contents of kernel stack to be leaked to user. Fix by only copying contents before data and allow the functions processing the ioctl to override. Cc: stable@vger.kernel.org Signed-off-by: Adrian Salido <salidoa@google.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm era: save spacemap metadata root after the pre-commitSomasundaram Krishnasamy
[ Upstream commit 117aceb030307dcd431fdcff87ce988d3016c34a ] When committing era metadata to disk, it doesn't always save the latest spacemap metadata root in superblock. Due to this, metadata is getting corrupted sometimes when reopening the device. The correct order of update should be, pre-commit (shadows spacemap root), save the spacemap root (newly shadowed block) to in-core superblock and then the final commit. Cc: stable@vger.kernel.org Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm btree: fix for dm_btree_find_lowest_key()Vinothkumar Raja
[ Upstream commit 7d1fedb6e96a960aa91e4ff70714c3fb09195a5a ] dm_btree_find_lowest_key() is giving incorrect results. find_key() traverses the btree correctly for finding the highest key, but there is an error in the way it traverses the btree for retrieving the lowest key. dm_btree_find_lowest_key() fetches the first key of the rightmost block of the btree instead of fetching the first key from the leftmost block. Fix this by conditionally passing the correct parameter to value64() based on the @find_highest flag. Cc: stable@vger.kernel.org Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu> Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu> Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17md: update slab_cache before releasing new stripes when stripes resizingDennis Yang
[ Upstream commit 583da48e388f472e8818d9bb60ef6a1d40ee9f9d ] When growing raid5 device on machine with small memory, there is chance that mdadm will be killed and the following bug report can be observed. The same bug could also be reproduced in linux-4.10.6. [57600.075774] BUG: unable to handle kernel NULL pointer dereference at (null) [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0 [57600.114678] Oops: 0002 [#1] SMP [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P O 4.2.8 #1 [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013 [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000 [57600.204963] RIP: 0010:[<ffffffff81a6aa87>] [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.213057] RSP: 0018:ffff880043073810 EFLAGS: 00010046 [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0 [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000 [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282 [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838 [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00 [57600.253999] FS: 00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 [57600.262078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0 [57600.274942] Stack: [57600.276949] ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f [57600.284383] ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98 [57600.291820] ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968 [57600.299254] Call Trace: [57600.301698] [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0 [57600.307523] [<ffffffff81119043>] ? __page_cache_release+0x23/0x110 [57600.313779] [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0 [57600.319344] [<ffffffff81579942>] drop_one_stripe+0x62/0x90 [57600.324915] [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0 [57600.330563] [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250 [57600.336650] [<ffffffff8111e38c>] shrink_zone+0x23c/0x250 [57600.342039] [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420 [57600.348210] [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0 [57600.353959] [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0 [57600.360303] [<ffffffff8157a30b>] check_reshape+0x62b/0x770 [57600.365866] [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0 [57600.371778] [<ffffffff81583df7>] update_raid_disks+0xc7/0x110 [57600.377604] [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10 [57600.382827] [<ffffffff81385380>] blkdev_ioctl+0x170/0x690 [57600.388307] [<ffffffff81195238>] block_ioctl+0x38/0x40 [57600.393525] [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480 [57600.399010] [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0 [57600.404400] [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70 [57600.409447] [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d [57600.435460] RIP [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.441208] RSP <ffff880043073810> [57600.444690] CR2: 0000000000000000 [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]--- The problem is that resize_stripes() releases new stripe_heads before assigning new slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called after resize_stripes() starting releasing new stripes but right before new slab cache being assigned, it is possible that these new stripe_heads will be freed with the old slab_cache which was already been destoryed and that triggers this bug. Signed-off-by: Dennis Yang <dennisyang@qnap.com> Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1+) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17md/raid1/10: fix potential deadlockShaohua Li
[ Upstream commit 61eb2b43b99ebdc9bc6bc83d9792257b243e7cb3 ] Neil Brown pointed out a potential deadlock in raid 10 code with bio_split/chain. The raid1 code could have the same issue, but recent barrier rework makes it less likely to happen. The deadlock happens in below sequence: 1. generic_make_request(bio), this will set current->bio_list 2. raid10_make_request will split bio to bio1 and bio2 3. __make_request(bio1), wait_barrer, add underlayer disk bio to current->bio_list 4. __make_request(bio2), wait_barrer If raise_barrier happens between 3 & 4, since wait_barrier runs at 3, raise_barrier waits for IO completion from 3. And since raise_barrier sets barrier, 4 waits for raise_barrier. But IO from 3 can't be dispatched because raid10_make_request() doesn't finished yet. The solution is to adjust the IO ordering. Quotes from Neil: " It is much safer to: if (need to split) { split = bio_split(bio, ...) bio_chain(...) make_request_fn(split); generic_make_request(bio); } else make_request_fn(mddev, bio); This way we first process the initial section of the bio (in 'split') which will queue some requests to the underlying devices. These requests will be queued in generic_make_request. Then we queue the remainder of the bio, which will be added to the end of the generic_make_request queue. Then we return. generic_make_request() will pop the lower-level device requests off the queue and handle them first. Then it will process the remainder of the original bio once the first section has been fully processed. " Note, this only happens in read path. In write path, the bio is flushed to underlaying disks either by blk flush (from schedule) or offladed to raid1/10d. It's queued in current->bio_list. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part) Suggested-by: NeilBrown <neilb@suse.com> Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-05-17dm cache: fix corruption seen when using cache > 2TBJoe Thornber
[ Upstream commit ca763d0a53b264a650342cee206512bc92ac7050 ] A rounding bug due to compiler generated temporary being 32bit was found in remap_to_cache(). A localized cast in remap_to_cache() fixes the corruption but this preferred fix (changing from uint32_t to sector_t) eliminates potential for future rounding errors elsewhere. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-01-12dm space map metadata: fix 'struct sm_metadata' leak on failed createBenjamin Marzinski
[ Upstream commit 314c25c56c1ee5026cf99c570bdfe01847927acb ] In dm_sm_metadata_create() we temporarily change the dm_space_map operations from 'ops' (whose .destroy function deallocates the sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't). If dm_sm_metadata_create() fails in sm_ll_new_metadata() or sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls dm_sm_destroy() with the intention of freeing the sm_metadata, but it doesn't (because the dm_space_map operations is still set to 'bootstrap_ops'). Fix this by setting the dm_space_map operations back to 'ops' if dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-01-12md/raid5: limit request size according to implementation limitsKonstantin Khlebnikov
[ Upstream commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5 ] Current implementation employ 16bit counter of active stripes in lower bits of bio->bi_phys_segments. If request is big enough to overflow this counter bio will be completed and freed too early. Fortunately this not happens in default configuration because several other limits prevent that: stripe_cache_size * nr_disks effectively limits count of active stripes. And small max_sectors_kb at lower disks prevent that during normal read/write operations. Overflow easily happens in discard if it's enabled by module parameter "devices_handle_discard_safely" and stripe_cache_size is set big enough. This patch limits requests size with 256Mb - 8Kb to prevent overflows. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Shaohua Li <shli@kernel.org> Cc: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-01-12dm crypt: mark key as invalid until properly loadedOndrej Kozina
[ Upstream commit 265e9098bac02bc5e36cda21fdbad34cb5b2f48d ] In crypt_set_key(), if a failure occurs while replacing the old key (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag set. Otherwise, the crypto layer would have an invalid key that still has DM_CRYPT_KEY_VALID flag set. Cc: stable@vger.kernel.org Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-11-25md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown
[ Upstream commit 1217e1d1999ed6c9c1e1b1acae0a74ab70464ae2 ] mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-11-25md: sync sync_completed has correct value as recovery finishes.NeilBrown
[ Upstream commit 5ed1df2eacc0ba92c8c7e2499c97594b5ef928a8 ] There can be a small window between the moment that recovery actually writes the last block and the time when various sysfs and /proc/mdstat attributes report that it has finished. During this time, 'sync_completed' can have the wrong value. This can confuse monitoring software. So: - don't set curr_resync_completed beyond the end of the devices, - set it correctly when resync/recovery has completed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-11-23dm table: fix missing dm_put_target_type() in dm_table_add_target()tang.junhui
[ Upstream commit dafa724bf582181d9a7d54f5cb4ca0bf8ef29269 ] dm_get_target_type() was previously called so any error returned from dm_table_add_target() must first call dm_put_target_type(). Otherwise the DM target module's reference count will leak and the associated kernel module will be unable to be removed. Also, leverage the fact that r is already -EINVAL and remove an extra newline. Fixes: 36a0456 ("dm table: add immutable feature") Fixes: cc6cbe1 ("dm table: add always writeable feature") Fixes: 3791e2f ("dm table: add singleton feature") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-09-15dm crypt: fix free of bad values after tfm allocation failureEric Biggers
[ Upstream commit 5d0be84ec0cacfc7a6d6ea548afdd07d481324cd ] If crypt_alloc_tfms() had to allocate multiple tfms and it failed before the last allocation, then it would call crypt_free_tfms() and could free pointers from uninitialized memory -- due to the crypt_free_tfms() check for non-zero cc->tfms[i]. Fix by allocating zeroed memory. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-09-15dm crypt: fix error with too large biosMikulas Patocka
[ Upstream commit 4e870e948fbabf62b78e8410f04c67703e7c816b ] When dm-crypt processes writes, it allocates a new bio in crypt_alloc_buffer(). The bio is allocated from a bio set and it can have at most BIO_MAX_PAGES vector entries, however the incoming bio can be larger (e.g. if it was allocated by bcache). If the incoming bio is larger, bio_alloc_bioset() fails and an error is returned. To avoid the error, we test for a too large bio in the function crypt_map() and use dm_accept_partial_bio() to split the bio. dm_accept_partial_bio() trims the current bio to the desired size and asks DM core to send another bio with the rest of the data. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-09-15dm log writes: fix check of kthread_run() return valueVladimir Zapolskiy
[ Upstream commit 91e630d9ae6de6f740ef7c8176736eb55366833e ] The kthread_run() function returns either a valid task_struct or ERR_PTR() value, check for NULL is invalid. This change fixes potential for oops, e.g. in OOM situation. Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-09-15dm log writes: fix bug with too large biosMikulas Patocka
[ Upstream commit 7efb367320f56fc4d549875b6f3a6940018ef2e5 ] bio_alloc() can allocate a bio with at most BIO_MAX_PAGES (256) vector entries. However, the incoming bio may have more vector entries if it was allocated by other means. For example, bcache submits bios with more than BIO_MAX_PAGES entries. This results in bio_alloc() failure. To avoid the failure, change the code so that it allocates bio with at most BIO_MAX_PAGES entries. If the incoming bio has more entries, bio_add_page() will fail and a new bio will be allocated - the code that handles bio_add_page() failure already exists in the dm-log-writes target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Josef Bacik <jbacik@fb,com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-09-15dm log writes: move IO accounting earlier to fix error pathMikulas Patocka
[ Upstream commit a5d60783df61fbb67b7596b8a0f6b4b2e05251d5 ] Move log_one_block()'s atomic_inc(&lc->io_blocks) before bio_alloc() to fix a bug that the target hangs if bio_alloc() fails. The error path does put_io_block(lc), so atomic_inc(&lc->io_blocks) must occur before invoking the error path to avoid underflow of lc->io_blocks. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Josef Bacik <jbacik@fb,com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-31dm flakey: fix reads to be issued if drop_writes configuredMike Snitzer
[ Upstream commit 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc ] v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the down_interval") overlooked the 'drop_writes' feature, which is meant to allow reads to be issued rather than errored, during the down_interval. Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval") Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-31bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.Kent Overstreet
[ Upstream commit acc9cf8c66c66b2cbbdb4a375537edee72be64df ] This patch fixes a cachedev registration-time allocation deadlock. This can deadlock on boot if your initrd auto-registeres bcache devices: Allocator thread: [ 720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds. [ 720.732361] [<ffffffff816eeac7>] schedule+0x37/0x90 [ 720.732963] [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache] [ 720.733538] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0 [ 720.734137] [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache] [ 720.734715] [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache] [ 720.735311] [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950 [ 720.735884] [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache] Registration thread: [ 720.710403] INFO: task bash:3531 blocked for more than 120 seconds. [ 720.715226] [<ffffffff816eeac7>] schedule+0x37/0x90 [ 720.715805] [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [ 720.716409] [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache] [ 720.717008] [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache] [ 720.717586] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0 [ 720.718191] [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache] [ 720.718766] [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70 [ 720.719369] [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350 [ 720.719968] [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache] [ 720.720553] [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache] [ 720.721153] [<ffffffff81354cef>] kobj_attr_store+0xf/0x20 [ 720.721730] [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50 [ 720.722327] [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180 [ 720.722904] [<ffffffff81225177>] __vfs_write+0x37/0x110 [ 720.723503] [<ffffffff81228048>] ? __sb_start_write+0x58/0x110 [ 720.724100] [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0 [ 720.724675] [<ffffffff812258a9>] vfs_write+0xa9/0x1b0 [ 720.725275] [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70 [ 720.725849] [<ffffffff81226755>] SyS_write+0x55/0xd0 [ 720.726451] [<ffffffff8106a390>] ? do_page_fault+0x30/0x80 [ 720.727045] [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71 The fifo code in upstream bcache can't use the last element in the buffer, which was the cause of the bug: if you asked for a power of two size, it'd give you a fifo that could hold one less than what you asked for rather than allocating a buffer twice as big. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-31bcache: register_bcache(): call blkdev_put() when cache_alloc() failsEric Wheeler
[ Upstream commit d9dc1702b297ec4a6bb9c0326a70641b322ba886 ] register_cache() is supposed to return an error string on error so that register_bcache() will will blkdev_put and cleanup other user counters, but it does not set 'char *err' when cache_alloc() fails (eg, due to memory pressure) and thus register_bcache() performs no cleanup. register_bcache() <----------\ <- no jump to err_close, no blkdev_put() | | +->register_cache() | <- fails to set char *err | | +->cache_alloc() ---/ <- returns error This patch sets `char *err` for this failure case so that register_cache() will cause register_bcache() to correctly jump to err_close and do cleanup. This was tested under OOM conditions that triggered the bug. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-19dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDINGMike Snitzer
[ Upstream commit eaf9a7361f47727b166688a9f2096854eef60fbe ] Otherwise, there is potential for both DMF_SUSPENDED* and DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is definitely _not_ a valid state. This fix, in conjuction with "dm rq: fix the starting and stopping of blk-mq queues", addresses the potential for request-based DM multipath's __multipath_map() to see !dm_noflush_suspending() during suspend. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-19dm rq: fix the starting and stopping of blk-mq queuesSasha Levin
[ Upstream commit 7d9595d848cdff5c7939f68eec39e0c5d36a1d67 ] Improve dm_stop_queue() to cancel any requeue_work. Also, have dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED for the blk-mq request_queue. On suspend dm_stop_queue() handles stopping the blk-mq request_queue BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point there is still a race that is allowing block/blk-mq.c to call ->queue_rq against a hctx that it really shouldn't. Add a check to dm_mq_queue_rq() that guards against this rarity (albeit _not_ race-free). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-19dm flakey: error READ bios during the down_intervalMike Snitzer
[ Upstream commit 99f3c90d0d85708e7401a81ce3314e50bf7f2819 ] When the corrupt_bio_byte feature was introduced it caused READ bios to no longer be errored with -EIO during the down_interval. This had to do with the complexity of needing to submit READs if the corrupt_bio_byte feature was used. Fix it so READ bios are properly errored with -EIO; doing so early in flakey_map() as long as there isn't a match for the corrupt_bio_byte feature. Fixes: a3998799fb4df ("dm flakey: add corrupt_bio_byte feature") Reported-by: Akira Hayakawa <ruby.wktk@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-19dm: fix second blk_delay_queue() parameter to be in msec units not jiffiesSasha Levin
[ Upstream commit bd9f55ea1cf6e14eb054b06ea877d2d1fa339514 ] Commit d548b34b062 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms") always intended the value to be 10 msecs -- it just expressed it in jiffies because earlier commit 7eaceaccab ("block: remove per-queue plugging") did. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: d548b34b062 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms") Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-07-10dm snapshot: disallow the COW and origin devices from being identicalDingXiang
[ Upstream commit 4df2bf466a9c9c92f40d27c4aa9120f4e8227bfc ] Otherwise loading a "snapshot" table using the same device for the origin and COW devices, e.g.: echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap will trigger: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1958.989655] PGD 0 [ 1958.991903] Oops: 0000 [#1] SMP ... [ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G IO 4.5.0-rc5.snitm+ #150 ... [ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000 [ 1959.091865] RIP: 0010:[<ffffffffa040efba>] [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.104295] RSP: 0018:ffff88032a957b30 EFLAGS: 00010246 [ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001 [ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00 [ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001 [ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80 [ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80 [ 1959.150021] FS: 00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000 [ 1959.159047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0 [ 1959.173415] Stack: [ 1959.175656] ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc [ 1959.183946] ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000 [ 1959.192233] ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf [ 1959.200521] Call Trace: [ 1959.203248] [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot] [ 1959.211986] [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot] [ 1959.219469] [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod] [ 1959.226656] [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod] [ 1959.234328] [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod] [ 1959.241129] [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod] [ 1959.248607] [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod] [ 1959.255307] [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20 [ 1959.261816] [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [ 1959.268615] [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0 [ 1959.274637] [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100 [ 1959.281726] [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [ 1959.288814] [<ffffffff81216449>] SyS_ioctl+0x79/0x90 [ 1959.294450] [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ... [ 1959.323277] RIP [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.333090] RSP <ffff88032a957b30> [ 1959.336978] CR2: 0000000000000098 [ 1959.344121] ---[ end trace b049991ccad1169e ]--- Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899 Cc: stable@vger.kernel.org Signed-off-by: Ding Xiang <dingxiang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-17MD: make bio mergeableShaohua Li
[ Upstream commit 9c573de3283af007ea11c17bde1e4568d9417328 ] blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18md: multipath: don't hardcopy bio in .make_request pathMing Lei
[ Upstream commit fafcde3ac1a418688a734365203a12483b83907a ] Inside multipath_make_request(), multipath maps the incoming bio into low level device's bio, but it is totally wrong to copy the bio into mapped bio via '*mapped_bio = *bio'. For example, .__bi_remaining is kept in the copy, especially if the incoming bio is chained to via bio splitting, so .bi_end_io can't be called for the mapped bio at all in the completing path in this kind of situation. This patch fixes the issue by using clone style. Cc: stable@vger.kernel.org (v3.14+) Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18dm thin metadata: don't issue prefetches if a transaction abort has failedJoe Thornber
[ Upstream commit 2eae9e4489b4cf83213fa3bd508b5afca3f01780 ] If a transaction abort has failed then we can no longer use the metadata device. Typically this happens if the superblock is unreadable. This fix addresses a crash seen during metadata device failure testing. Fixes: 8a01a6af75 ("dm thin: prefetch missing metadata pages") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_listNeilBrown
[ Upstream commit 550da24f8d62fe81f3c13e3ec27602d6e44d43dc ] break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: fix cache_set_flush() NULL pointer dereference on OOMEric Wheeler
[ Upstream commit f8b11260a445169989d01df75d35af0f56178f95 ] When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: cleaned up error handling around register_cache()Eric Wheeler
[ Upstream commit 9b299728ed777428b3908ac72ace5f8f84b97789 ] Fix null pointer dereference by changing register_cache() to return an int instead of being void. This allows it to return -ENOMEM or -ENODEV and enables upper layers to handle the OOM case without NULL pointer issues. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521 Fixes this error: gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register bcache: register_cache() error opening sdh2: cannot allocate memory BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8 IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] PGD 120dff067 PUD 1119a3067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: veth ip6table_filter ip6_tables (...) CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cache_set_flush [bcache] task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000 RIP: 0010:[<ffffffffc05a7e8d>] [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: fix race of writeback thread starting before complete initializationEric Wheeler
[ Upstream commit 07cc6ef8edc47f8b4fc1e276d31127a0a5863d4d ] The bch_writeback_thread might BUG_ON in read_dirty() if dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed its related initialization. This patch downs the dc->writeback_lock until after initialization is complete, thus preventing bch_writeback_thread from proceeding prematurely. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453 Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18RAID5: check_reshape() shouldn't call mddev_suspendShaohua Li
[ Upstream commit 27a353c026a879a1001e5eac4bda75b16262c44a ] check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by 738a273 ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18md/raid5: Compare apples to apples (or sectors to sectors)Jes Sorensen
[ Upstream commit e7597e69dec59b65c5525db1626b9d34afdfa678 ] 'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: 620125f2bf8 ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18dm: fix excessive dm-mq context switchingMike Snitzer
[ Upstream commit 6acfe68bac7e6f16dc312157b1fa6e2368985013 ] Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower than if an underlying null_blk device were used directly. One of the reasons for this drop in performance is that blk_insert_clone_request() was calling blk_mq_insert_request() with @async=true. This forced the use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues which ushered in ping-ponging between process context (fio in this case) and kblockd's kworker to submit the cloned request. The ftrace function_graph tracer showed: kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to _not_ use kblockd to submit the cloned requests isn't enough to eliminate the observed context switches. In addition to this dm-mq specific blk-core fix, there are 2 DM core fixes to dm-mq that (when paired with the blk-core fix) completely eliminate the observed context switching: 1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd just increases context switches. In my testing against a really fast null_blk device there was no benefit to running blk_mq_run_hw_queues() on completion (and no other blk-mq driver does this). So hopefully this change doesn't induce the need for yet another revert like commit 621739b00e16ca2d ! 2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs .request_fn branching pattern that other historic block interfaces do (e.g. blk_get_request). Using blk_mq_complete_request() for blk-mq requests is important for performance. It should be noted that, like blk_complete_request(), blk_mq_complete_request() doesn't natively handle partial completions -- but the request-based DM-multipath target does provide the required partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. dm-mq fix #2 is _much_ more important than #1 for eliminating the context switches. Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472 With these changes multithreaded async read IOPs improved from ~950K to ~1350K for this dm-mq stacked on null_blk test-case. The raw read IOPs of the underlying null_blk device for the same workload is ~1950K. Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()") Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM") Cc: stable@vger.kernel.org # 4.1+ Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-07dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq pathsMike Snitzer
[ Upstream commit 4328daa2e79ed904a42ce00a9f38b9c36b44b21a ] Using request-based DM mpath configured with the following stacking (.request_fn DM mpath ontop of scsi-mq paths): echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq 'struct dm_rq_target_io' would leak if a request is requeued before a blk-mq clone is allocated (or fails to allocate). free_rq_tio() wasn't being called. kmemleak reported: unreferenced object 0xffff8800b90b98c0 (size 112): comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s) hex dump (first 32 bytes): 00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@....... e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............ backtrace: [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0 [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20 [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170 [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod] [<ffffffff812fbd78>] blk_peek_request+0x168/0x290 [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod] [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40 [<ffffffff812f9585>] blk_delay_work+0x25/0x40 [<ffffffff81096fff>] process_one_work+0x14f/0x3d0 [<ffffffff81097715>] worker_thread+0x125/0x4b0 [<ffffffff8109ce88>] kthread+0xd8/0xf0 [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff crash> struct -o dm_rq_target_io struct dm_rq_target_io { ... } SIZE: 112 Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-03dm snapshot: fix hung bios when copy error occursMikulas Patocka
[ Upstream commit 385277bfb57faac44e92497104ba542cdd82d5fe ] When there is an error copying a chunk dm-snapshot can incorrectly hold associated bios indefinitely, resulting in hung IO. The function copy_callback sets pe->error if there was error copying the chunk, and then calls complete_exception. complete_exception calls pending_complete on error, otherwise it calls commit_exception with commit_callback (and commit_callback calls complete_exception). The persistent exception store (dm-snap-persistent.c) assumes that calls to prepare_exception and commit_exception are paired. persistent_prepare_exception increases ps->pending_count and persistent_commit_exception decreases it. If there is a copy error, persistent_prepare_exception is called but persistent_commit_exception is not. This results in the variable ps->pending_count never returning to zero and that causes some pending exceptions (and their associated bios) to be held forever. Fix this by unconditionally calling commit_exception regardless of whether the copy was successful. A new "valid" parameter is added to commit_exception -- when the copy fails this parameter is set to zero so that the chunk that failed to copy (and all following chunks) is not recorded in the snapshot store. Also, remove commit_callback now that it is merely a wrapper around pending_complete. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-01bcache: Change refill_dirty() to always scan entire disk if necessaryKent Overstreet
[ Upstream commit 627ccd20b4ad3ba836472468208e2ac4dfadbf03 ] Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-01bcache: prevent crash on changing writeback_runningStefan Bader
[ Upstream commit 8d16ce540c94c9d366eb36fc91b7154d92d6397b ] Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-01bcache: allows use of register in udev to avoid "device_busy" error.Gabriel de Perthuis
[ Upstream commit d7076f21629f8f329bca4a44dc408d94670f49e2 ] Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>