summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)Author
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-24Merge tag 'md/3.14' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "All bug fixes, two tagged for -stable" * tag 'md/3.14' of git://neil.brown.name/md: md/raid5: close recently introduced race in stripe_head management. md/raid5: fix long-standing problem with bitmap handling on write failure. md: check command validity early in md_ioctl(). md: ensure metadata is writen after raid level change. md/raid10: avoid fullsync when not necessary. md: allow a partially recovered device to be hot-added to an array. md: Change handling of save_raid_disk and metadata update during recovery.
2014-01-16md: check command validity early in md_ioctl().Nicolas Schichan
Verify that the cmd parameter passed to md_ioctl() is valid before doing anything. This fixes mddev->hold_active being set to 0 when an invalid ioctl command is passed to md_ioctl() before the array has been configured. Clearing mddev->hold_active in that case can lead to a livelock situation when an invalid ioctl number is given to md_ioctl() by a process when the mddev is currently being opened by another process: Process 1 Process 2 --------- --------- md_alloc() mddev_find() -> returns a new mddev with hold_active == UNTIL_IOCTL add_disk() -> sends KOBJ_ADD uevent (sees KOBJ_ADD uevent for device) md_open() md_ioctl(INVALID_IOCTL) -> returns ENODEV and clears mddev->hold_active md_release() md_put() -> deletes the mddev as hold_active is 0 md_open() mddev_find() -> returns a newly allocated mddev with mddev->gendisk == NULL -> returns with ERESTARTSYS (kernel restarts the open syscall) Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-15Merge tag 'md/3.13-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull late md fixes from Neil Brown: "Half a dozen md bug fixes. All of these fix real bugs the people have hit, and are tagged for -stable. Sorry they are late .... Christmas holidays and all that. Hopefully they can still squeak into 3.13" * tag 'md/3.13-fixes' of git://neil.brown.name/md: md: fix problem when adding device to read-only array with bitmap. md/raid10: fix bug when raid10 recovery fails to recover a block. md/raid5: fix a recently broken BUG_ON(). md/raid1: fix request counting bug in new 'barrier' code. md/raid10: fix two bugs in handling of known-bad-blocks. md/raid5: Fix possible confusion when multiple write errors occur.
2014-01-14md: ensure metadata is writen after raid level change.NeilBrown
level_store() currently does not make sure the metadata is updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS. Any level with a ->thread will quickly notice this and update the metadata. However RAID0 and Linear do not have a thread so no metadata update happens until the array is stopped. At that point the metadata is written. This is later that we would like. While the delay doesn't risk any data it can cause confusion. So if there is no md thread, immediately update the metadata after a level change. Reported-by: Richard Michael <rmichael@edgeofthenet.org> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: allow a partially recovered device to be hot-added to an array.NeilBrown
When adding a new device into an array it is normally important to clear any stale data from ->recovery_offset else the new device may not be recovered properly. However when re-adding a device which is known to be nearly in-sync, this is not needed and can be detrimental. The (bitmap-based) resync will still happen, and further recovery is only needed from where-ever it was already up to. So if save_raid_disk is set, signifying a re-add, don't clear ->recovery_offset. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: Change handling of save_raid_disk and metadata update during recovery.NeilBrown
Since commit d70ed2e4fafdbef0800e739 MD: Allow restarting an interrupted incremental recovery. we don't write out the metadata to devices while they are recovering. This had a good reason, but has unfortunate consequences. This patch changes things to make them work better. At issue is what happens if the array is shut down while a recovery is happening, particularly a bitmap-guided recovery. Ideally the recovery should pick up where it left off. However the metadata cannot represent the state "A recovery is in process which is guided by the bitmap". Before the above mentioned commit, we wrote metadata to the device which said "this is being recovered and it is up to <here>". So after a restart, a full recovery (not bitmap-guided) would happen from where-ever it was up to. After the commit the metadata wasn't updated so it still said "This device is fully in sync with <this> event count". That leads to a bitmap-based recovery following the whole bitmap, which should be a lot less work than a full recovery from some starting point. So this was an improvement. However updates some metadata but not all leads to other problems. In particular, the metadata written to the fully-up-to-date device record that the array has all devices present (even though some are recovering). So on restart, mdadm wants to find all devices and expects them to have current event counts. Obviously it doesn't (some have old event counts) so (when assembling with --incremental) it waits indefinitely for the rest of the expected devices. It really is wrong to not update all the metadata together. Do that is bound to cause confusion. Instead, we should make it possible to record the truth in the metadata. i.e. we need to be able to record that a device is being recovered based on the bitmap. We already have a Feature flag to say that recovery is happening. We now add another one to say that it is a bitmap-based recovery. With this we can remove the code that disables the write-out of metadata on some devices. So this patch: - moves the setting of 'saved_raid_disk' from add_new_disk to the validate_super methods. This makes sure it is always set properly, both when adding a new device to an array, and when assembling an array from a collection of devices. - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a bitmap-based recovery is allowed. This is only present in v1.x metadata. v0.90 doesn't support devices which are in the middle of recovery at all. - Only skips writing metadata to Faulty devices. - Also allows rdev state to be set to "-insync" via sysfs. This can be used for external-metadata arrays. When the 'role' is set the device is assumed to be in-sync. If, after setting the role, we set the state to "-insync", the role is moved to saved_raid_disk which effectively says the device is partly in-sync with that slot and needs a bitmap recovery. Cc: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: fix problem when adding device to read-only array with bitmap.NeilBrown
If an array is started degraded, and then the missing device is found it can be re-added and a minimal bitmap-based recovery will bring it fully up-to-date. If the array is read-only a recovery would not be allowed. But also if the array is read-only and the missing device was present very recently, then there could be no need for any recovery at all, so we simply include the device in the read-only array without any recovery. However... if the missing device was removed a little longer ago it could be missing some updates, but if a bitmap is present it will be conditionally accepted pending a bitmap-based update. We don't currently detect this case properly and will include that old device into the read-only array with no recovery even though it really needs a recovery. This patch keeps track of whether a bitmap-based-recovery is really needed or not in the new Bitmap_sync rdev flag. If that is set, then the device will not be added to a read-only array. Cc: Andrei Warkentin <andreiw@vmware.com> Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b Cc: stable@vger.kernel.org (3.2+) Signed-off-by: NeilBrown <neilb@suse.de>
2013-12-31Merge tag 'v3.13-rc6' into for-3.14/coreJens Axboe
Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c
2013-12-05Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. It contains: - A fix for a use-after-free of a request in blk-mq. From Ming Lei - A fix for a blk-mq bug that could attempt to dereference a NULL rq if allocation failed - Two xen-blkfront small fixes - Cleanup of submit_bio_wait() type uses in the kernel, unifying that. From Kent - A fix for 32-bit blkg_rwstat reading. I apologize for this one looking mangled in the shortlog, it's entirely my fault for missing an empty line between the description and body of the text" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix use-after-free of request blk-mq: fix dereference of rq->mq_ctx if allocation fails block: xen-blkfront: Fix possible NULL ptr dereference xen-blkfront: Silence pfn maybe-uninitialized warning block: submit_bio_wait() conversions Update of blkg_stat and blkg_rwstat may happen in bh context
2013-11-28md: test mddev->flags more safely in md_check_recovery.NeilBrown
commit 7a0a5355cbc71efa md: Don't test all of mddev->flags at once. made most tests on mddev->flags safer, but missed one. When commit 260fa034ef7a4ff8b7306 md: avoid deadlock when dirty buffers during md_stop. added MD_STILL_CLOSED, this caused md_check_recovery to misbehave. It can think there is something to do but find nothing. This can lead to the md thread spinning during array shutdown. https://bugzilla.kernel.org/show_bug.cgi?id=65721 Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com> Fixes: 260fa034ef7a4ff8b7306 Cc: stable@vger.kernel.org (3.12) Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-24block: submit_bio_wait() conversionsKent Overstreet
It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-23block: Abstract out bvec iteratorKent Overstreet
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23block: submit_bio_wait() conversionsKent Overstreet
It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de>
2013-11-20Merge tag 'md/3.13' of git://neil.brown.name/mdLinus Torvalds
Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null
2013-11-19md: Convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19md/raid5: avoid deadlock when raid5 array has unack badblocks during ↵NeilBrown
md_stop_writes. When raid5 recovery hits a fresh badblock, this badblock will flagged as unack badblock until md_update_sb() is called. But md_stop will take reconfig lock which means raid5d can't call md_update_sb() in md_check_recovery(), the badblock will always be unack, so raid5d thread enters an infinite loop and md_stop_write() can never stop sync_thread. This causes deadlock. To solve this, when STOP_ARRAY ioctl is issued and sync_thread is running, we need set md->recovery FROZEN and INTR flags and wait for sync_thread to stop before we (re)take reconfig lock. This requires that raid5 reshape_request notices MD_RECOVERY_INTR (which it probably should have noticed anyway) and stops waiting for a metadata update in that case. Reported-by: Jianpeng Ma <majianpeng@gmail.com> Reported-by: Bian Yu <bianyu@kedacom.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.NeilBrown
We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19md: fix some places where mddev_lock return value is not checked.NeilBrown
Sometimes we need to lock and mddev and cannot cope with failure due to interrupt. In these cases we should use mutex_lock, not mutex_lock_interruptible. Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14md: fix calculation of stacking limits on level change.NeilBrown
The various ->run routines of md personalities assume that the 'queue' has been initialised by the blk_set_stacking_limits() call in md_alloc(). However when the level is changed (by level_store()) the ->run routine for the new level is called for an array which has already had the stacking limits modified. This can result in incorrect final settings. So call blk_set_stacking_limits() before ->run in level_store(). A specific consequence of this bug is that it causes discard_granularity to be set incorrectly when reshaping a RAID4 to a RAID0. This is suitable for any -stable kernel since 3.3 in which blk_set_stacking_limits() was introduced. Cc: stable@vger.kernel.org (3.3+) Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...
2013-11-08block: Consolidate duplicated bio_trim() implementationsKent Overstreet
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-07Merge tag 'driver-core-3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core / sysfs update for 3.13-rc1. There's lots of dev_groups updates for different subsystems, as they all get slowly migrated over to the safe versions of the attribute groups (removing userspace races with the creation of the sysfs files.) Also in here are some kobject updates, devres expansions, and the first round of Tejun's sysfs reworking to enable it to be used by other subsystems as a backend for an in-kernel filesystem. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits) sysfs: rename sysfs_assoc_lock and explain what it's about sysfs: use generic_file_llseek() for sysfs_file_operations sysfs: return correct error code on unimplemented mmap() mdio_bus: convert bus code to use dev_groups device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name sysfs: separate out dup filename warning into a separate function sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c sysfs: remove unused sysfs_get_dentry() prototype sysfs: honor bin_attr.attr.ignore_lockdep sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr devres: restore zeroing behavior of devres_alloc() sysfs: fix sysfs_write_file for bin file input: gameport: convert bus code to use dev_groups input: serio: remove bus usage of dev_attrs input: serio: use DEVICE_ATTR_RO() i2o: convert bus code to use dev_groups memstick: convert bus code to use dev_groups tifm: convert bus code to use dev_groups virtio: convert bus code to use dev_groups ipack: convert bus code to use dev_groups ...
2013-10-24md: avoid deadlock when md_set_badblocks.Bian Yu
When operate harddisk and hit errors, md_set_badblocks is called after scsi_restart_operations which already disabled the irq. but md_set_badblocks will call write_sequnlock_irq and enable irq. so softirq can preempt the current thread and that may cause a deadlock. I think this situation should use write_sequnlock_irqsave/irqrestore instead. I met the situation and the call trace is below: [ 638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010 [ 638.921923] lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0 [ 638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37 [ 638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013 [ 638.927816] ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007 [ 638.929829] ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8 [ 638.931848] ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8 [ 638.933884] Call Trace: [ 638.935867] <IRQ> [<ffffffff8172ff85>] dump_stack+0x55/0x76 [ 638.937878] [<ffffffff81730030>] spin_dump+0x8a/0x8f [ 638.939861] [<ffffffff81730056>] spin_bug+0x21/0x26 [ 638.941836] [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0 [ 638.943801] [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80 [ 638.945747] [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0 [ 638.947672] [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50 [ 638.949595] [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0 [ 638.951504] [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0 [ 638.953388] [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140 [ 638.955248] [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90 [ 638.957116] [<ffffffff8104fddd>] __do_softirq+0xfd/0x330 [ 638.958987] [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.960861] [<ffffffff8174a5cc>] call_softirq+0x1c/0x30 [ 638.962724] [<ffffffff81004c7d>] do_softirq+0x8d/0xc0 [ 638.964565] [<ffffffff8105024e>] irq_exit+0x10e/0x150 [ 638.966390] [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60 [ 638.968223] [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80 [ 638.970079] <EOI> [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.971899] [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50 [ 638.973691] [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50 [ 638.975475] [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0 [ 638.977243] [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80 [ 638.978988] [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456] [ 638.980723] [<ffffffff811b5a1d>] bio_endio+0x1d/0x40 [ 638.982463] [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0 [ 638.984214] [<ffffffff81306b9f>] blk_update_request+0x7f/0x350 [ 638.985967] [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90 [ 638.987710] [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50 [ 638.989439] [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30 [ 638.991149] [<ffffffff81308746>] blk_peek_request+0x106/0x250 [ 638.992861] [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130 [ 638.994561] [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0 [ 638.996251] [<ffffffff813040a7>] __blk_run_queue+0x37/0x50 [ 638.997900] [<ffffffff813045af>] blk_run_queue+0x2f/0x50 [ 638.999553] [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0 [ 639.001185] [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40 [ 639.002798] [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200 [ 639.004391] [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0 [ 639.005996] [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0 [ 639.007600] [<ffffffff81072f6b>] kthread+0xdb/0xe0 [ 639.009205] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 [ 639.010821] [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0 [ 639.012437] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 This bug was introduce in commit 2e8ac30312973dd20e68073653 (the first time rdev_set_badblock was call from interrupt context), so this patch is appropriate for 3.5 and subsequent kernels. Cc: <stable@vger.kernel.org> (3.5+) Signed-off-by: Bian Yu <bianyu@kedacom.com> Reviewed-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-09-26sysfs: clean up sysfs_get_dirent()Tejun Heo
The pre-existing sysfs interfaces which take explicit namespace argument are weird in that they place the optional @ns in front of @name which is contrary to the established convention. For example, we end up forcing vast majority of sysfs_get_dirent() users to do sysfs_get_dirent(parent, NULL, name), which is silly and error-prone especially as @ns and @name may be interchanged without causing compilation warning. This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the positions of @name and @ns, and sysfs_get_dirent() is now a wrapper around sysfs_get_dirent_ns(). This makes confusions a lot less likely. There are other interfaces which take @ns before @name. They'll be updated by following patches. This patch doesn't introduce any functional changes. v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol error on module builds. Reported by build test robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-27md: avoid deadlock when dirty buffers during md_stop.NeilBrown
When the last process closes /dev/mdX sync_blockdev will be called so that all buffers get flushed. So if it is then opened for the STOP_ARRAY ioctl to be sent there will be nothing to flush. However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just moments before some other process which was writing closes their file descriptor, then there won't be a 'last close' and the buffers might not get flushed. So do_md_stop() calls sync_blockdev(). However at this point it is holding ->reconfig_mutex. So if the array is currently 'clean' then the writes from sync_blockdev() will not complete until the array can be marked dirty and that won't happen until some other thread can get ->reconfig_mutex. So we deadlock. We need to move the sync_blockdev() call to before we take ->reconfig_mutex. However then some other thread could open /dev/mdX and write to it after we call sync_blockdev() and before we actually stop the array. This can leave dirty data in the page cache which is awkward. So introduce new flag MD_STILL_CLOSED. Set it before calling sync_blockdev(), clear it if anyone does open the file, and abort the STOP_ARRAY attempt if it gets set before we lock against further opens. It is still possible to get problems if you open /dev/mdX, write to it, then issue the STOP_ARRAY ioctl. Just don't do that. Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27md: Don't test all of mddev->flags at once.NeilBrown
mddev->flags is mostly used to record if an update of the metadata is needed. Sometimes the whole field is tested instead of just the important bits. This makes it difficult to introduce more state bits. So replace all bare tests of mddev->flags with tests for the bits that actually need testing. Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27md: Fix apparent cut-and-paste error in super_90_validateDave Jones
Setting a variable to itself probably wasn't the intention here. Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27md: fix safe_mode buglet.NeilBrown
Whe we set the safe_mode_timeout to a smaller value we trigger a timeout immediately - otherwise the small value might not be honoured. However if the previous timeout was 0 meaning "no timeout", we didn't. This would mean that no timeout happens until the next write completes, which could be a long time. Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27md: don't call md_allow_write in get_bitmap_file.NeilBrown
There is no really need as GFP_NOIO is very likely sufficient, and failure is not catastrophic. Calling md_allow_write here will convert a read-auto array to read/write which could be confusing when you are just performing a read operation. Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-18md: Remove recent change which allows devices to skip recovery.NeilBrown
commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. allowed a bit more than just that. It also allows devices to be added to a read-write array and to end up skipping recovery. This patch removes the offending piece of code pending a rewrite for a subsequent release. More specifically: If the array has a bitmap, then the device will still need a bitmap based resync ('saved_raid_disk' is set under different conditions is a bitmap is present). If the array doesn't have a bitmap, then this is correct as long as nothing has been written to the array since the metadata was checked by ->validate_super. However there is no locking to ensure that there was no write. Bug was introduced in 3.10 and causes data corruption so patch is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-26MD: Remember the last sync operation that was performedJonathan Brassow
MD: Remember the last sync operation that was performed This patch adds a field to the mddev structure to track the last sync operation that was performed. This is especially useful when it comes to what is recorded in mismatch_cnt in sysfs. If the last operation was "data-check", then it reports the number of descrepancies found by the user-initiated check. If it was a "repair" operation, then it is reporting the number of descrepancies repaired. etc. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14md: replace strict_strto*() with kstrto*()Jingoo Han
The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14md: Wait for md_check_recovery before attempting device removal.Hannes Reinecke
When a device has failed, it needs to be removed from the personality module before it can be removed from the array as a whole. The first step is performed by md_check_recovery() which is called from the raid management thread. So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery to have run. This increases the chance that the ioctl will succeed. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Neil Brown <nfbrown@suse.de>
2013-06-13Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
2013-06-13md: md_stop_writes() should always freeze recovery.NeilBrown
__md_stop_writes() will currently sometimes freeze recovery. So any caller must be ready for that to happen, and indeed they are. However if __md_stop_writes() doesn't freeze_recovery, then a recovery could start before mddev_suspend() is called, which could be awkward. This can particularly cause problems or dm-raid. So change __md_stop_writes() to always freeze recovery. This is safe and more predicatable. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Tested-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-05-08Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
2013-05-07block_device_operations->release() should return voidAl Viro
The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) *and* it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-30md: bad block list should default to disabled.NeilBrown
Maintenance of a bad-block-list currently defaults to 'enabled' and is then disabled when it cannot be supported. This is backwards and causes problem for dm-raid which didn't know to disable it. So fix the defaults, and only enabled for v1.x metadata which explicitly has bad blocks enabled. The problem with dm-raid has been present since badblock support was added in v3.1, so this patch is suitable for any -stable from 3.1 onwards. Cc: stable@vger.kernel.org (3.1+) Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24MD: Export 'md_reap_sync_thread' functionJonathan Brassow
MD: Export 'md_reap_sync_thread' function Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c. - rename reap_sync_thread to md_reap_sync_thread - move the fn after md_check_recovery to match md.h declaration placement - export md_reap_sync_thread Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md: don't update metadata when stopping a read-only array.NeilBrown
read-only arrays should stay that way as much as possible. Updating the metadata - which could be triggered by a re-add while assembling the array metadata - should be avoided. Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md: Allow devices to be re-added to a read-only array.NeilBrown
When assembling an array incrementally we might want to make it device available when "enough" devices are present, but maybe not "all" devices are present. If the remaining devices appear before the array is actually used, they should be added transparently. We do this by using the "read-auto" mode where the array acts like it is read-only until a write request arrives. Current an add-device request switches a read-auto array to active. This means that only one device can be added after the array is first made read-auto. This isn't a problem for RAID5, but is not ideal for RAID6 or RAID10. Also we don't really want to switch the array to read-auto at all when re-adding a device as this doesn't really imply any change. So: - remove the "md_update_sb()" call from add_new_disk(). This isn't really needed as just adding a disk doesn't require a metadata update. Instead, just set MD_CHANGE_DEVS. This will effect a metadata update soon enough, once the array is not read-only. - Allow the ADD_NEW_DISK ioctl to succeed without activating a read-auto array, providing the MD_DISK_SYNC flag is set. In this case, the device will be rejected if it cannot be added with the correct device number, or has an incorrect event count. - Teach remove_and_add_spares() to be careful about adding spares when the array is read-only (or read-mostly) - only add devices that are thought to be in-sync, and only do it if the array is in-sync itself. - In md_check_recovery, use remove_and_add_spares in the read-only case, rather than open coding just the 'remove' part of it. Reported-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md: HOT_DISK_REMOVE shouldn't make a read-auto device active.NeilBrown
If a fail device or a spare is removed from an array, there is not need to make the array 'active'. If/when the array does become active for some other reason the metadata will be update to reflect the removal. If that never happens and the array is stopped while still read-auto, then there is no loss in forgetting the that the device had 'failed'. A read-only array will leave failed devices attached to the array personality, so we need to explicitly call remove_and_add_spares() to free it (clearing Blocked just like we do in store_slot()). Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md: use common code for all calls to ->hot_remove_disk()NeilBrown
slot_store and remove_and_add_spares both call ->hot_remove_disk(), but with slightly different tests and consequences, which is at least untidy and might be buggy. So modify remove_and_add_spaces() so that it can be asked to remove a specific device, and call it from slot_store(). We also clear the Blocked flag to ensure that doesn't prevent removal. The purpose of Blocked is to prevent automatic removal by the kernel before an error is acknowledged. If the array is read/write then user-space would have not reason to remove a device unless it was known to be 'spare' or 'faulty' in which it would have already cleared the Blocked flag. If the array is read-only, the flag might still be blocked, but there is no harm in clearing the flag for read-only arrays. Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md: never update metadata when array is read-only.NeilBrown
Normally we don't even try to update the metadata if the array is read-only. However future patches will increase the number of things that can happen on a read-only array, so it is safest to explicitly disable this. Every time that mddev->ro is set to 0, either - md_update_sb will be called again (at least if MD_CHANGE_DEVS is set) or - the mddev->thread is scheduled, which will also run md_update_sb if needed. So this is safe: if the array ever become read-write the metadata will be updated. Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-02Merge branch 'writeback-workqueue' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-23md: Convert md_trim_bio() to use bio_advance()Kent Overstreet
Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> Acked-by: NeilBrown <neilb@suse.de>
2013-03-20MD: Prevent sysfs operations on uninitialized kobjectsJonathan Brassow
MD: Prevent sysfs operations on uninitialized kobjects Device-mapper does not use sysfs; but when device-mapper is leveraging MD's RAID personalities, MD sometimes attempts to update sysfs. This patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure it is about to operate on something valid. This patch also checks for 'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'. Although 'sysfs_notify' already makes this check, doing so in 'remove_and_add_spares' prevents an additional mutex operation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-28md: expedite metadata update when switching read-auto -> activeNeilBrown
If something has failed while the array was read-auto, then when we switch to 'active' we need to update the metadata. This will happen anyway but it is good to expedite it, and also to ensure any failed device has been released by the underlying device before we try to action the ioctl which caused us to switch to 'active' mode. Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md: fix two bugs when attempting to resize RAID0 array.NeilBrown
You cannot resize a RAID0 array (in terms of making the devices bigger), but the code doesn't entirely stop you. So: disable setting of the available size on each device for RAID0 and Linear devices. This must not change as doing so can change the effective layout of data. Make sure that the size that raid0_size() reports is accurate, but rounding devices sizes to chunk sizes. As the device sizes cannot change now, this isn't so important, but it is best to be safe. Without this change: mdadm --grow /dev/md0 -z max mdadm --grow /dev/md0 -Z max then read to the end of the array can cause a BUG in a RAID0 array. These bugs have been present ever since it became possible to resize any device, which is a long time. So the fix is suitable for any -stable kerenl. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>