summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2015-06-06md: fix race when unfreezing sync_actionNeilBrown
commit 56ccc1125bc141cf63927eda7febff4216dea2d3 upstream. A recent change removed the need for locking around writing to "sync_action" (and various other places), but introduced a subtle race. When e.g. setting 'reshape' on a 'frozen' array, the 'frozen' flag is cleared before 'reshape' is set, so the md thread can get in and start trying recovery - which isn't wanted. So instead of clearing MD_RECOVERY_FROZEN for any command except 'frozen', only clear it when each specific command is parsed. This allows the handling of 'reshape' to clear the bit while a lock is held. Also remove some places where we set MD_RECOVERY_NEEDED, as it is always set on non-error exit of the function. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06dm: fix reload failure of 0 path multipath mapping on blk-mq devicesJunichi Nomura
commit 15b94a690470038aa08247eedbebbe7e2218d5ee upstream. dm-multipath accepts 0 path mapping. # echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev Such a mapping can be used to release underlying devices while still holding requests in its queue until working paths come back. However, once the multipath device is created over blk-mq devices, it rejects reloading of 0 path mapping: # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \ | dmsetup create mpath1 # echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1 device-mapper: reload ioctl on mpath1 failed: Invalid argument Command failed With following kernel message: device-mapper: ioctl: can't change device type after initial table load. DM tries to inherit the current table type using dm_table_set_type() but it doesn't work as expected because of unnecessary check about whether the target type is hybrid or not. Hybrid type is for targets that work as either request-based or bio-based and not required for blk-mq or non blk-mq checking. Fixes: 65803c205983 ("dm table: train hybrid target type detection to select blk-mq if appropriate") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06dm: fix casting bug in dm_merge_bvec()Joe Thornber
commit 1c220c69ce0dcc0f234a9f263ad9c0864f971852 upstream. dm_merge_bvec() was originally added in f6fccb ("dm: introduce merge_bvec_fn"). In that commit a value in sectors is converted to bytes using << 9, and then assigned to an int. This code made assumptions about the value of BIO_MAX_SECTORS. A later commit 148e51 ("dm: improve documentation and code clarity in dm_merge_bvec") was meant to have no functional change but it removed the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At this point the cast from sector_t to int resulted in a zero value. The fallout being dm_merge_bvec() would only allow a single page to be added to a bio. This interim fix is minimal for the benefit of stable@ because the more comprehensive cleanup of passing a sector_t to all DM targets' merge function will impact quite a few DM targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06md/raid0: fix restore to sector variable in raid0_make_requestEric Work
commit a81157768a00e8cf8a7b43b5ea5cac931262374f upstream. The variable "sector" in "raid0_make_request()" was improperly updated by a call to "sector_div()" which modifies its first argument in place. Commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd restored this variable after the call for later re-use. Unfortunetly the restore was done after the referenced variable "bio" was advanced. This lead to the original value and the restored value being different. Here we move this line to the proper place. One observed side effect of this bug was discarding a file though unlinking would cause an unrelated file's contents to be discarded. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 47d68979cc96 ("md/raid0: fix bug with chunksize not a power of 2.") URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06md/raid5: don't record new size if resize_stripes fails.NeilBrown
commit 6e9eac2dcee5e19f125967dd2be3e36558c42fff upstream. If any memory allocation in resize_stripes fails we will return -ENOMEM, but in some cases we update conf->pool_size anyway. This means that if we try again, the allocations will be assumed to be larger than they are, and badness results. So only update pool_size if there is no error. This bug was introduced in 2.6.17 and the patch is suitable for -stable. Fixes: ad01c9e3752f ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17block: destroy bdi before blockdev is unregistered.NeilBrown
commit 6cd18e711dd8075da9d78cfc1239f912ff28968a upstream. Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit c4db59d31e39ea067c32163ac961e9c80198fd37 Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"Rabin Vincent
commit c0403ec0bb5a8c5b267fb7e16021bec0b17e4964 upstream. This reverts Linux 4.1-rc1 commit 0618764cb25f6fa9fb31152995de42a8a0496475. The problem which that commit attempts to fix actually lies in the Freescale CAAM crypto driver not dm-crypt. dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto driver should internally backlog requests which arrive when the queue is full and process them later. Until the crypto hw's queue becomes full, the driver returns -EINPROGRESS. When the crypto hw's queue if full, the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is expected to backlog the request and process it when the hardware has queue space. At the point when the driver takes the request from the backlog and starts processing it, it calls the completion function with a status of -EINPROGRESS. The completion function is called (for a second time, in the case of backlogged requests) with a status/err of 0 when a request is done. Crypto drivers for hardware without hardware queueing use the helpers, crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request() and crypto_get_backlog() helpers to implement this behaviour correctly, while others implement this behaviour without these helpers (ccp, for example). dm-crypt (before the patch that needs reverting) uses this API correctly. It queues up as many requests as the hw queues will allow (i.e. as long as it gets back -EINPROGRESS from the request function). Then, when it sees at least one backlogged request (gets -EBUSY), it waits till that backlogged request is handled (completion gets called with -EINPROGRESS), and then continues. The references to af_alg_wait_for_completion() and af_alg_complete() in that commit's commit message are irrelevant because those functions only handle one request at a time, unlink dm-crypt. The problem is that the Freescale CAAM driver, which that commit describes as having being tested with, fails to implement the backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware queue is full, it simply returns -EBUSY without backlogging the request. What the observed deadlock was is not described in the commit message but it is obviously the wait_for_completion() in crypto_convert() where dm-crypto would wait for the completion being called with -EINPROGRESS in the case of backlogged requests. This completion will never be completed due to the bug in the CAAM driver. Commit 0618764cb25 incorrectly made dm-crypt wait for every request, even when the driver/hardware queues are not full, which means that dm-crypt will never see -EBUSY. This means that that commit will cause a performance regression on all crypto drivers which implement the API correctly. Revert it. Correct backlog handling should be implemented in the CAAM driver instead. Cc'ing stable purely because commit 0618764cb25 did. If for some reason a stable@ kernel did pick up commit 0618764cb25 it should get reverted. Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Horia Geanta <horia.geanta@freescale.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06dm crypt: fix deadlock when async crypto algorithm returns -EBUSYBen Collins
commit 0618764cb25f6fa9fb31152995de42a8a0496475 upstream. I suspect this doesn't show up for most anyone because software algorithms typically don't have a sense of being too busy. However, when working with the Freescale CAAM driver it will return -EBUSY on occasion under heavy -- which resulted in dm-crypt deadlock. After checking the logic in some other drivers, the scheme for crypt_convert() and it's callback, kcryptd_async_done(), were not correctly laid out to properly handle -EBUSY or -EINPROGRESS. Fix this by using the completion for both -EBUSY and -EINPROGRESS. Now crypt_convert()'s use of completion is comparable to af_alg_wait_for_completion(). Similarly, kcryptd_async_done() follows the pattern used in af_alg_complete(). Before this fix dm-crypt would lockup within 1-2 minutes running with the CAAM driver. Fix was regression tested against software algorithms on PPC32 and x86_64, and things seem perfectly happy there as well. Signed-off-by: Ben Collins <ben.c@servergy.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06md/raid0: fix bug with chunksize not a power of 2.NeilBrown
commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd upstream. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because "sector_div()" modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable. Reported-by: Joe Landman <joe.landman@gmail.com> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06md: fix md io stats accounting brokenGu Zheng
commit 74672d069b298b03e9f657fd70915e055739882e upstream. Simon reported the md io stats accounting issue: " I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5. It's not abnormal other than it's 3-disk, with one being SSD (sdc) and the other two being write-mostly: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00 " The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the generic_start_io_acct to account the disk stats rather than the open code, but it also introduced the increase to .in_flight[rw] which is needless to md. So we re-use the open code here to fix it. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26Merge tag 'dm-4.0-fix-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: "Fix DM core device cleanup regression -- due to a latent race that was exposed by the bdi changes that were introduced during the 4.0 merge" * tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix add_disk() NULL pointer due to race with free_dev()
2015-03-23dm: fix add_disk() NULL pointer due to race with free_dev()Mike Snitzer
Commit c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info") exposed DM to a latent race in free_dev() vs add_disk() in relation to management of the device's minor number. Fix this by refactoring free_dev() to match cleanup order of the alloc_dev() error path. Move cleanup of the gendisk, queue, and bdev to _before_ the cleanup of the idr managed minor number. Also, purely due to cleanup that fell out during the free_dev() audit: - adjust dm_blk_close() to access the gendisk's private_data under the _minor_lock spinlock. - move __dm_destroy()'s dm_get_live_table() call out from under the _minor_lock spinlock. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449 Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-22Merge tag 'md/4.0-rc4-fix' of git://neil.brown.name/mdLinus Torvalds
Pull bugfix for md from Neil Brown: "One fix for md in 4.0-rc4 Regression in recent patch causes crash on error path" * tag 'md/4.0-rc4-fix' of git://neil.brown.name/md: md: fix problems with freeing private data after ->run failure.
2015-03-21Merge tag 'dm-4.0-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull devicemapper fixes from Mike Snitzer: "A handful of stable fixes for DM: - fix thin target to always zero-fill reads to unprovisioned blocks - fix to interlock device destruction's suspend from internal suspends - fix 2 snapshot exception store handover bugs - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing" * tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME dm snapshot: suspend merging snapshot when doing exception handover dm snapshot: suspend origin when doing exception handover dm: hold suspend_lock while suspending device during device deletion dm thin: fix to consistently zero-fill reads to unprovisioned blocks
2015-03-21md: fix problems with freeing private data after ->run failure.NeilBrown
If ->run() fails, it can either free the data structures it allocated, or leave that task to ->free() which will be called on failures. However: md.c calls ->free() even if ->private_data is NULL, which causes problems in some personalities. raid0.c frees the data, but doesn't clear ->private_data, which will become a problem when we fix md.c So better fix both these issues at once. Reported-by: Richard W.M. Jones <rjones@redhat.com> Fixes: 5aa61f427e4979be733e4847b9199ff9cc48a47e URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381 Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-27dm io: deal with wandering queue limits when handling REQ_DISCARD and ↵Darrick J. Wong
REQ_WRITE_SAME Since it's possible for the discard and write same queue limits to change while the upper level command is being sliced and diced, fix up both of them (a) to reject IO if the special command is unsupported at the start of the function and (b) read the limits once and let the commands error out on their own if the status happens to change. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-27dm snapshot: suspend merging snapshot when doing exception handoverMikulas Patocka
The "dm snapshot: suspend origin when doing exception handover" commit fixed a exception store handover bug associated with pending exceptions to the "snapshot-origin" target. However, a similar problem exists in snapshot merging. When snapshot merging is in progress, we use the target "snapshot-merge" instead of "snapshot-origin". Consequently, during exception store handover, we must find the snapshot-merge target and suspend its associated mapped_device. To avoid lockdep warnings, the target must be suspended and resumed without holding _origins_lock. Introduce a dm_hold() function that grabs a reference on a mapped_device, but unlike dm_get(), it doesn't crash if the device has the DMF_FREEING flag set, it returns an error in this case. In snapshot_resume() we grab the reference to the origin device using dm_hold() while holding _origins_lock (_origins_lock guarantees that the device won't disappear). Then we release _origins_lock, suspend the device and grab _origins_lock again. NOTE to stable@ people: When backporting to kernels 3.18 and older, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-27dm snapshot: suspend origin when doing exception handoverMikulas Patocka
In the function snapshot_resume we perform exception store handover. If there is another active snapshot target, the exception store is moved from this target to the target that is being resumed. The problem is that if there is some pending exception, it will point to an incorrect exception store after that handover, causing a crash due to dm-snap-persistent.c:get_exception()'s BUG_ON. This bug can be triggered by repeatedly changing snapshot permissions with "lvchange -p r" and "lvchange -p rw" while there are writes on the associated origin device. To fix this bug, we must suspend the origin device when doing the exception store handover to make sure that there are no pending exceptions: - introduce _origin_hash that keeps track of dm_origin structures. - introduce functions __lookup_dm_origin, __insert_dm_origin and __remove_dm_origin that manipulate the origin hash. - modify snapshot_resume so that it calls dm_internal_suspend_fast() and dm_internal_resume_fast() on the origin device. NOTE to stable@ people: When backporting to kernels 3.12-3.18, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. When backporting to kernels older than 3.12, you need to pick functions dm_internal_suspend and dm_internal_resume from the commit fd2ed4d252701d3bbed4cd3e3d267ad469bb832a. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-27dm: hold suspend_lock while suspending device during device deletionMikulas Patocka
__dm_destroy() must take the suspend_lock so that its presuspend and postsuspend calls do not race with an internal suspend. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-27dm thin: fix to consistently zero-fill reads to unprovisioned blocksJoe Thornber
It was always intended that a read to an unprovisioned block will return zeroes regardless of whether the pool is in read-only or read-write mode. thin_bio_map() was inconsistent with its handling of such reads when the pool is in read-only mode, it now properly zero-fills the bios it returns in response to unprovisioned block reads. Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA and just allow the IO to be deferred to the worker which will result in pool->process_bio() handling the IO (which already properly zero-fills reads to unprovisioned blocks). Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-25md: mark some attributes as pre-allocNeilBrown
Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18 it can now be used by md. This ensure that writing to these sysfs attributes will never block due to a memory allocation. Such blocking could become a deadlock if mdmon is trying to reconfigure an array after a failure prior to re-enabling writes. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25raid5: check faulty flag for array status during recovery.Eric Mei
When we have more than 1 drive failure, it's possible we start rebuild one drive while leaving another faulty drive in array. To determine whether array will be optimal after building, current code only check whether a drive is missing, which could potentially lead to data corruption. This patch is to add checking Faulty flag. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25md/raid1: fix read balance when a drive is write-mostly.Tomáš Hodek
When a drive is marked write-mostly it should only be the target of reads if there is no other option. This behaviour was broken by commit 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc md/raid1: read balance chooses idlest disk for SSD which causes a write-mostly device to be *preferred* is some cases. Restore correct behaviour by checking and setting best_dist_disk and best_pending_disk rather than best_disk. We only need to test one of these as they are both changed from -1 or >=0 at the same time. As we leave min_pending and best_dist unchanged, any non-write-mostly device will appear better than the write-mostly device. Reported-by: Tomáš Hodek <tomas.hodek@volny.cz> Reported-by: Dark Penguin <darkpenguin@yandex.ru> Signed-off-by: NeilBrown <neilb@suse.de> Link: http://marc.info/?l=linux-raid&m=135982797322422 Fixes: 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc Cc: stable@vger.kernel.org (3.6+)
2015-02-21Merge tag 'dm-3.20-changes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper changes from Mike Snitzer: - Significant dm-crypt CPU scalability performance improvements thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here: https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html - A few additional stable fixes (to DM core, dm-snapshot and dm-mirror) and a small fix to the dm-space-map-disk. * tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm snapshot: fix a possible invalid memory access on unload dm: fix a race condition in dm_get_md dm crypt: sort writes dm crypt: add 'submit_from_crypt_cpus' option dm crypt: offload writes to thread dm crypt: remove unused io_pool and _crypt_io_pool dm crypt: avoid deadlock in mempools dm crypt: don't allocate pages for a partial request dm crypt: use unbound workqueue for request processing dm io: reject unsupported DISCARD requests with EOPNOTSUPP dm mirror: do not degrade the mirror on discard error dm space map disk: fix sm_disk_count_is_more_than_one()
2015-02-19Merge branch 'kconfig' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig updates from Michal Marek: "Yann E Morin was supposed to take over kconfig maintainership, but this hasn't happened. So I'm sending a few kconfig patches that I collected: - Fix for missing va_end in kconfig - merge_config.sh displays used if given too few arguments - s/boolean/bool/ in Kconfig files for consistency, with the plan to only support bool in the future" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kconfig: use va_end to match corresponding va_start merge_config.sh: Display usage if given too few arguments kconfig: use bool instead of boolean for type definition attributes
2015-02-18dm snapshot: fix a possible invalid memory access on unloadMikulas Patocka
When the snapshot target is unloaded, snapshot_dtr() waits until pending_exceptions_count drops to zero. Then, it destroys the snapshot. Therefore, the function that decrements pending_exceptions_count should not touch the snapshot structure after the decrement. pending_complete() calls free_pending_exception(), which decrements pending_exceptions_count, and then it performs up_write(&s->lock) and it calls retry_origin_bios() which dereferences s->origin. These two memory accesses to the fields of the snapshot may touch the dm_snapshot struture after it is freed. This patch moves the call to free_pending_exception() to the end of pending_complete(), so that the snapshot will not be destroyed while pending_complete() is in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-18dm: fix a race condition in dm_get_mdMikulas Patocka
The function dm_get_md finds a device mapper device with a given dev_t, increases the reference count and returns the pointer. dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the device, tests that the device doesn't have DMF_DELETING or DMF_FREEING flag, drops _minor_lock and returns pointer to the device. dm_get_md then calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag, otherwise it increments the reference count. There is a possible race condition - after dm_find_md exits and before dm_get is called, there are no locks held, so the device may disappear or DMF_FREEING flag may be set, which results in BUG. To fix this bug, we need to call dm_get while we hold _minor_lock. This patch renames dm_find_md to dm_get_md and changes it so that it calls dm_get while holding the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-17Merge tag 'md/3.20-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull md bugfixes from Neil Brown: "Three bug md fixes for 3.20 yet-another-livelock in raid5, and a problem with write errors to 4K-block devices" * tag 'md/3.20-fixes' of git://neil.brown.name/md: md/raid5: Fix livelock when array is both resyncing and degraded. md/raid10: round up to bdev_logical_block_size in narrow_write_error. md/raid1: round up to bdev_logical_block_size in narrow_write_error
2015-02-18md/raid5: Fix livelock when array is both resyncing and degraded.NeilBrown
Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f: md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Causes an RCW cycle to be forced even when the array is degraded. A degraded array cannot support RCW as that requires reading all data blocks, and one may be missing. Forcing an RCW when it is not possible causes a live-lock and the code spins, repeatedly deciding to do something that cannot succeed. So change the condition to only force RCW on non-degraded arrays. Reported-by: Manibalan P <pmanibalan@amiindia.co.in> Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f Cc: stable@vger.kernel.org (v3.7+)
2015-02-16dm crypt: sort writesMikulas Patocka
Write requests are sorted in a red-black tree structure and are submitted in the sorted order. In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting. The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally. Even on SSD sorting can be beneficial since in-order request dispatch promotes lower latency IO completion to the upper layers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: add 'submit_from_crypt_cpus' optionMikulas Patocka
Make it possible to disable offloading writes by setting the optional 'submit_from_crypt_cpus' table argument. There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly. The default is to offload write bios to the same thread because it benefits CFQ to have writes submitted using the same IO context. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: offload writes to threadMikulas Patocka
Submitting write bios directly in the encryption thread caused serious performance degradation. On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation. Move the submission of write requests to a separate thread so that the requests can be sorted before submitting. But this commit improves dm-crypt performance even without having dm-crypt perform request sorting (in particular it enables IO schedulers like CFQ to sort more effectively). Note: it is required that a previous commit ("dm crypt: don't allocate pages for a partial request") be applied before applying this patch. Otherwise, this commit could introduce a crash. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: remove unused io_pool and _crypt_io_poolMikulas Patocka
The previous commit ("dm crypt: don't allocate pages for a partial request") stopped using the io_pool slab mempool and backing _crypt_io_pool kmem cache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: avoid deadlock in mempoolsMikulas Patocka
Fix a theoretical deadlock introduced in the previous commit ("dm crypt: don't allocate pages for a partial request"). The function crypt_alloc_buffer may be called concurrently. If we allocate from the mempool concurrently, there is a possibility of deadlock. For example, if we have mempool of 256 pages, two processes, each wanting 256, pages allocate from the mempool concurrently, it may deadlock in a situation where both processes have allocated 128 pages and the mempool is exhausted. To avoid such a scenario we allocate the pages under a mutex. In order to not degrade performance with excessive locking, we try non-blocking allocations without a mutex first and if that fails, we fallback to a blocking allocations with a mutex. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: don't allocate pages for a partial requestMikulas Patocka
Change crypt_alloc_buffer so that it only ever allocates pages for a full request. This is a prerequisite for the commit "dm crypt: offload writes to thread". This change simplifies the dm-crypt code at the expense of reduced throughput in low memory conditions (where allocation for a partial request is most useful). Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed to fix a theoretical deadlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16dm crypt: use unbound workqueue for request processingMikulas Patocka
Use unbound workqueue by default so that work is automatically balanced between available CPUs. The original behavior of encrypting using the same cpu that IO was submitted on can still be enabled by setting the optional 'same_cpu_crypt' table argument. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16md/raid10: round up to bdev_logical_block_size in narrow_write_error.NeilBrown
RAID10 version of earlier fix for RAID1. We must never initiate IO with sizes less that logical_block_size. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-16md/raid1: round up to bdev_logical_block_size in narrow_write_errorNate Dailey
This modifies raid1's narrow_write_error to round up block_sectors to the device's logical block size. This prevents sd complaining about "Bad block number requested" for non-512-byte sector disks. Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-13dm io: reject unsupported DISCARD requests with EOPNOTSUPPDarrick J. Wong
I created a dm-raid1 device backed by a device that supports DISCARD and another device that does NOT support DISCARD with the following dm configuration: # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo # lsblk -D NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sda 0 4K 1G 0 `-moo (dm-0) 0 4K 1G 0 sdb 0 0B 0B 0 `-moo (dm-0) 0 4K 1G 0 Notice that the mirror device /dev/mapper/moo advertises DISCARD support even though one of the mirror halves doesn't. If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite loop in do_region() when it tries to issue a DISCARD request to sdb. The problem is that when we call do_region() against sdb, num_sectors is set to zero because q->limits.max_discard_sectors is zero. Therefore, "remaining" never decreases and the loop never terminates. To fix this: before entering the loop, check for the combination of REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up the mirror device. This bug was found by the unfortunate coincidence of pvmove and a discard operation in the RHEL 6.5 kernel; upstream is also affected. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-13dm mirror: do not degrade the mirror on discard errorMikulas Patocka
It may be possible that a device claims discard support but it rejects discards with -EOPNOTSUPP. It happens when using loopback on ext2/ext3 filesystem driven by the ext4 driver. It may also happen if the underlying devices are moved from one disk on another. If discard error happens, we reject the bio with -EOPNOTSUPP, but we do not degrade the array. This patch fixes failed test shell/lvconvert-repair-transient.sh in the lvm2 testsuite if the testsuite is extracted on an ext2 or ext3 filesystem and it is being driven by the ext4 driver. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-13dm space map disk: fix sm_disk_count_is_more_than_one()Mike Snitzer
dm_tm_shadow_block() is the only caller of dm_sm_count_is_more_than_one() which only ever operates on a metadata space-map. So in practice, sm_disk_count_is_more_than_one() isn't actually used (which explains why this bug never amounted to anything). But fix sm_disk_count_is_more_than_one() to properly set *result and return 0. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-12Merge tag 'dm-3.20-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - The most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. An early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues dm snapshot: remove unnecessary NULL checks before vfree() calls dm mpath: simplify failure path of dm_multipath_init() dm thin metadata: remove unused dm_pool_get_data_block_size() dm ioctl: fix stale comment above dm_get_inactive_table() dm crypt: update url in CONFIG_DM_CRYPT help text dm bufio: fix time comparison to use time_after_eq() dm: use time_in_range() and time_after() dm raid: fix a couple integer overflows dm table: train hybrid target type detection to select blk-mq if appropriate dm: allocate requests in target when stacking on blk-mq devices dm: prepare for allocating blk-mq clone requests in target dm: submit stacked requests in irq enabled context dm: split request structure out from dm_rq_target_io structure dm: remove exports for request-based interfaces without external callers
2015-02-12Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "This contains: - A series from Christoph that cleans up and refactors various parts of the REQ_BLOCK_PC handling. Contributions in that series from Dongsu Park and Kent Overstreet as well. - CFQ: - A bug fix for cfq for realtime IO scheduling from Jeff Moyer. - A stable patch fixing a potential crash in CFQ in OOM situations. From Konstantin Khlebnikov. - blk-mq: - Add support for tag allocation policies, from Shaohua. This is a prep patch enabling libata (and other SCSI parts) to use the blk-mq tagging, instead of rolling their own. - Various little tweaks from Keith and Mike, in preparation for DM blk-mq support. - Minor little fixes or tweaks from me. - A double free error fix from Tony Battersby. - The partition 4k issue fixes from Matthew and Boaz. - Add support for zero+unprovision for blkdev_issue_zeroout() from Martin" * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits) block: remove unused function blk_bio_map_sg block: handle the null_mapped flag correctly in blk_rq_map_user_iov blk-mq: fix double-free in error path block: prevent request-to-request merging with gaps if not allowed blk-mq: make blk_mq_run_queues() static dm: fix multipath regression due to initializing wrong request cfq-iosched: handle failure of cfq group allocation block: Quiesce zeroout wrapper block: rewrite and split __bio_copy_iov() block: merge __bio_map_user_iov into bio_map_user_iov block: merge __bio_map_kern into bio_map_kern block: pass iov_iter to the BLOCK_PC mapping functions block: add a helper to free bio bounce buffer pages block: use blk_rq_map_user_iov to implement blk_rq_map_user block: simplify bio_map_kern block: mark blk-mq devices as stackable block: keep established cmd_flags when cloning into a blk-mq request block: add blk-mq support to blk_insert_cloned_request() block: require blk_rq_prep_clone() be given an initialized clone request blk-mq: add tag allocation policy ...
2015-02-12Merge tag 'md/3.20' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: - assorted locking changes so that access to /proc/mdstat and much of /sys/block/mdXX/md/* is protected by a spinlock rather than a mutex and will never block indefinitely. - Make an 'if' condition in RAID5 - which has been implicated in recent bugs - more readable. - misc minor fixes * tag 'md/3.20' of git://neil.brown.name/md: (28 commits) md/raid10: fix conversion from RAID0 to RAID10 md: wakeup thread upon rdev_dec_pending() md: make reconfig_mutex optional for writes to md sysfs files. md: move mddev_lock and related to md.h md: use mddev->lock to protect updates to resync_{min,max}. md: minor cleanup in safe_delay_store. md: move GET_BITMAP_FILE ioctl out from mddev_lock. md: tidy up set_bitmap_file md: remove unnecessary 'buf' from get_bitmap_file. md: remove mddev_lock from rdev_attr_show() md: remove mddev_lock() from md_attr_show() md/raid5: use ->lock to protect accessing raid5 sysfs attributes. md: remove need for mddev_lock() in md_seq_show() md/bitmap: protect clearing of ->bitmap by mddev->lock md: protect ->pers changes with mddev->lock md: level_store: group all important changes into one place. md: rename ->stop to ->free md: split detach operation out from ->stop. md/linear: remove rcu protections in favour of suspend/resume md: make merge_bvec_fn more robust in face of personality changes. ...
2015-02-12md/raid10: fix conversion from RAID0 to RAID10NeilBrown
A RAID0 array (like a LINEAR array) does not have a concept of 'size' being the amount of each device that is in use. Rather, as much of each device as is available is used. So the 'size' is set to 0 and ignored. RAID10 does have this concept and needs it to be set correctly. So when we convert RAID0 to RAID10 we must determine the 'size' (that being the size of the first 'strip_zone' in the RAID0), and set it correctly. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-11dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queuesKeith Busch
A DM device must inherit the QUEUE_FLAG_SG_GAPS flags from its underlying block devices' request queues. This fixes problems when submitting cloned requests to multipathed devices requiring virtually contiguous buffers. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle are: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) rcu: Initialize tiny RCU stall-warning timeouts at boot rcu: Fix RCU CPU stall detection in tiny implementation rcu: Add GP-kthread-starvation checks to CPU stall warnings rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors rcu: Optionally run grace-period kthreads at real-time priority ksoftirqd: Use new cond_resched_rcu_qs() function ksoftirqd: Enable IRQs and call cond_resched() before poking RCU rcutorture: Add more diagnostics in rcu_barrier() test failure case torture: Flag console.log file to prevent holdovers from earlier runs torture: Add "-enable-kvm -soundhw pcspk" to qemu command line rcutorture: Handle different mpstat versions rcutorture: Check from beginning to end of grace period rcu: Remove redundant rcu_batches_completed() declaration rcutorture: Drop rcu_torture_completed() and friends rcu: Provide rcu_batches_completed_sched() for TINY_RCU rcutorture: Use unsigned for Reader Batch computations rcutorture: Make build-output parsing correctly flag RCU's warnings rcu: Make _batches_completed() functions return unsigned long rcutorture: Issue warnings on close calls due to Reader Batch blows documentation: Fix smp typo in memory-barriers.txt ...
2015-02-09dm snapshot: remove unnecessary NULL checks before vfree() callsMarkus Elfring
The vfree() function performs input parameter validation. Thus the NULL pointer test around vfree() calls is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09dm mpath: simplify failure path of dm_multipath_init()Johannes Thumshirn
Currently the cleanup of all error cases are open-coded. Introduce a common exit path and labels. Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09dm thin metadata: remove unused dm_pool_get_data_block_size()Rickard Strandqvist
The thin-pool target doesn't display the data block size as part of its table status, unlike the dm-cache target, so there is no need for dm_pool_get_data_block_size(). This was found using cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Mike Snitzer <snitzer@redhat.com>