summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2012-01-11block: initialize request_queue's numa node duringMike Snitzer
commit 5151412dd4338b273afdb107c3772528e9e67d92 upstream. struct request_queue is allocated with __GFP_ZERO so its "node" field is zero before initialization. This causes an oops if node 0 is offline in the page allocator because its zonelists are not initialized. From Dave Young's dmesg: SRAT: Node 1 PXM 2 0-d0000000 SRAT: Node 1 PXM 2 100000000-330000000 SRAT: Node 0 PXM 1 330000000-630000000 Initmem setup node 1 0000000000000000-000000000affb000 ... Built 1 zonelists in Node order, mobility grouping on. ... BUG: unable to handle kernel paging request at 0000000000001c08 IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870 and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on zonelist->_zonerefs. The fix is to initialize q->node at the time of allocation so the correct node is passed to the slab allocator later. Since blk_init_allocated_queue_node() is no longer needed, merge it with blk_init_allocated_queue(). [rientjes@google.com: changelog, initializing q->node] Reported-by: Dave Young <dyoung@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Dave Young <dyoung@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Change-Id: I24b14588aef6226f3bcdf37e78af61cbe9a31fd2 Reviewed-on: http://git-master/r/74168 Reviewed-by: Varun Wadekar <vwadekar@nvidia.com> Tested-by: Varun Wadekar <vwadekar@nvidia.com>
2012-01-11cfq-iosched: fix cfq_cic_link() race confitionYasuaki Ishimatsu
commit 5eb46851de3904cd1be9192fdacb8d34deadc1fc upstream. cfq_cic_link() has race condition. When some processes which shared ioc issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST sometimes. The race condition might stop I/O by following steps: step 1: Process A: Issue an I/O to /dev/sda step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not linked with a cic for the device step 3: Process A: Get a new cic for the device (cicA here) in cfq_alloc_io_context() step 4: Process B: Issue an I/O to /dev/sda step 5: Process B: Get iocA in get_io_context() since process A and B share the same ioc step 6: Process B: Get a new cic for the device (cicB here) in cfq_alloc_io_context() since iocA has not been linked with a cic for the device yet step 7: Process A: Link cicA to iocA in cfq_cic_link() step 8: Process A: Dispatch I/O to driver and finish it step 9: Process B: Try to link cicB to iocA in cfq_cic_link() But it fails with showing "cfq: cic link failed!" kernel message, since iocA has already linked with cicA at step 7. step 10: Process B: Wait for finishig I/O in get_request_wait() The function does not wake up, when there is no I/O to the device. When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic. So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup(). Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Change-Id: I679b98b517dcddd7c3568081b50948a786884ad1 Reviewed-on: http://git-master/r/74162 Reviewed-by: Varun Wadekar <vwadekar@nvidia.com> Tested-by: Varun Wadekar <vwadekar@nvidia.com>
2012-01-11cfq-iosched: free cic_index if blkio_alloc_blkg_stats failsmajianpeng
commit 2984ff38ccf6cbc02a7a996a36c7d6f69f3c6146 upstream. If we fail allocating the blkpg stats, we free cfqd and cfgq. But we need to free the IDA cfqd->cic_index as well. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Change-Id: Ie0b58526fabbd53e2343f9ee0474f2070d717967 Reviewed-on: http://git-master/r/74161 Reviewed-by: Varun Wadekar <vwadekar@nvidia.com> Tested-by: Varun Wadekar <vwadekar@nvidia.com>
2011-11-30block: genhd: Add disk/partition specific uevent callbacks for partition infoSan Mehat
For disk devices, a new uevent parameter 'NPARTS' specifies the number of partitions detected by the kernel. Partition devices get 'PARTN' which specifies the partitions index in the table. Signed-off-by: San Mehat <san@google.com>
2011-11-21block: Always check length of all iov entries in blk_rq_map_user_iov()Ben Hutchings
commit 6b76106d8ef31111d6fc469564b83b5f5542794f upstream. Even after commit 5478755616ae2ef1ce144dded589b62b2a50d575 ("block: check for proper length of iov entries earlier ...") we still won't check for zero-length entries after an unaligned entry. Remove the break-statement, so all entries are checked. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11blk-flush: move the queue kick intoJeff Moyer
commit e67b77c791ca2778198c9e7088f3266ed2da7a55 upstream. A dm-multipath user reported[1] a problem when trying to boot a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4 (block: fix flush machinery for stacking drivers with differring flush flags) applied. It turns out that an empty flush request can be sent into blk_insert_flush. When the BUG_ON was fixed to allow for this, I/O on the underlying device would stall. The reason is that blk_insert_cloned_request does not kick the queue. In the aforementioned commit, I had added a special case to kick the queue if data was sent down but the queue flags did not require a flush. A better solution is to push the queue kick up into blk_insert_cloned_request. This patch, along with a follow-on which fixes the BUG_ON, fixes the issue reported. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11blk-flush: fix invalid BUG_ON in blk_insert_flushJeff Moyer
commit 834f9f61a525d2f6d3d0c93894e26326c8d3ceed upstream. A user reported a regression due to commit 4853abaae7e4a2af938115ce9071ef8684fb7af4 (block: fix flush machinery for stacking drivers with differring flush flags). Part of the problem is that blk_insert_flush required a single bio be attached to the request. In reality, having no attached bio is also a valid case, as can be observed with an empty flush. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11block: make gendisk hold a reference to its queueTejun Heo
commit f992ae801a7dec34a4ed99a6598bbbbfb82af4fb upstream. The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-09-28block: Free queue resources at blk_release_queue()Hannes Reinecke
A kernel crash is observed when a mounted ext3/ext4 filesystem is physically removed. The problem is that blk_cleanup_queue() frees up some resources eg by calling elevator_exit(), which are not checked for in normal operation. So we should rather move these calls to the destructor function blk_release_queue() as at that point all remaining references are gone. However, in doing so we have to ensure that any externally supplied queue_lock is disconnected as the driver might free up the lock after the call of blk_cleanup_queue(), Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21blk-cgroup: be able to remove the record of unplugged deviceWanlong Gao
The bug is we're not able to remove the device from blkio cgroup's per-device control files if it gets unplugged. To reproduce the bug: # mount -t cgroup -o blkio xxx /cgroup # cd /cgroup # echo "8:0 1000" > blkio.throttle.read_bps_device # unplug the device # cat blkio.throttle.read_bps_device 8:0 1000 # echo "8:0 0" > blkio.throttle.read_bps_device -bash: echo: write error: No such device After patching, the device removal will succeed. Thanks for the comments of Paul, Zefan, and Vivek. Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-14block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_requestTao Ma
In __blk_complete_request, we check both QUEUE_FLAG_SAME_COMP and req->cpu to decide whether we should use req->cpu. Actually the user can also select the complete cpu by either setting BIO_CPU_AFFINE or by calling bio_set_completion_cpu. Current solution makes these 2 ways don't work any more. So we'd better just check req->cpu. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-08-24block: simplify force plug flush code a little bitShaohua Li
Cleaning up the code a little bit. attempt_plug_merge() traverses the plug list anyway, we can do the request counting there, so stack size is reduced a little bit. The motivation here is I suspect if we should count the requests for each queue (task could handle multiple disks in the meantime), but my test doesn't show it's worthy doing. If somebody proves we should do it, below change will make that more easier. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24block: change force plug flush call orderShaohua Li
Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New request can't be merged to existing requests, but later new requests might be merged with this new one. If blk_flush_plug_list() is done later, the merge doesn't happen. Believe it or not, this fixes a 10% regression running sysbench workload. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24block: Fix queue_flag update when rq_affinity goes from 2 to 1Eric Seppanen
Commit 5757a6d76cdf added the QUEUE_FLAG_SAME_FORCE flag, but fails to clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE) and the new state is '1' (SAME_COMP). Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Eric Seppanen <eric@purestorage.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23block: separate priority boosting from REQ_METAChristoph Hellwig
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...
2011-08-19Revert "cfq: Remove special treatment for metadata rqs."Jens Axboe
We have a kernel build regression since 3.1-rc1, which is about 10% regression. The kernel source is in an ext3 filesystem. Alex Shi bisect it to commit: commit a07405b7802691d29ab3b23bdc76ee6d006aad0b Author: Justin TerAvest <teravest@google.com> Date: Sun Jul 10 22:09:19 2011 +0200 cfq: Remove special treatment for metadata rqs. Apparently this is caused by lack metadata preemption, where ext3/ext4 do use READ_META. I didn't see a way to fix the issue, so suggest reverting the patch. This reverts commit a07405b7802691d29ab3b23bdc76ee6d006aad0b. Reported-by: Alex Shi<alex.shi@intel.com> Reported-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15block: fix flush machinery for stacking drivers with differring flush flagsJeff Moyer
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement FLUSH/FUA to support merge, introduced a performance regression when running any sort of fsyncing workload using dm-multipath and certain storage (in our case, an HP EVA). The test I ran was fs_mark, and it dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out that dm-multipath always advertised flush+fua support, and passed commands on down the stack, where those flags used to get stripped off. The above commit changed that behavior: static inline struct request *__elv_next_request(struct request_queue *q) { struct request *rq; while (1) { - while (!list_empty(&q->queue_head)) { + if (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) || - (rq->cmd_flags & REQ_FLUSH_SEQ)) - return rq; - rq = blk_do_flush(q, rq); - if (rq) - return rq; + return rq; } Note that previously, a command would come in here, have REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush: struct request *blk_do_flush(struct request_queue *q, struct request *rq) { unsigned int fflags = q->flush_flags; /* may change, cache it */ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA; bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH); bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA); unsigned skip = 0; ... if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) { rq->cmd_flags &= ~REQ_FLUSH; if (!has_fua) rq->cmd_flags &= ~REQ_FUA; return rq; } So, the flush machinery was bypassed in such cases (q->flush_flags == 0 && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)). Now, however, we don't get into the flush machinery at all. Instead, __elv_next_request just hands a request with flush and fua bits set to the scsi_request_fn, even if the underlying request_queue does not support flush or fua. The agreed upon approach is to fix the flush machinery to allow stacking. While this isn't used in practice (since there is only one request-based dm target, and that target will now reflect the flush flags of the underlying device), it does future-proof the solution, and make it function as designed. In order to make this work, I had to add a field to the struct request, inside the flush structure (to store the original req->end_io). Shaohua had suggested overloading the union with rb_node and completion_data, but the completion data is used by device mapper and can also be used by other drivers. So, I didn't see a way around the additional field. I tested this patch on an HP EVA with both ext4 and xfs, and it recovers the lost performance. Comments and other testers, as always, are appreciated. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-11block: improve rq_affinity placementShaohua Li
This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1 work as expected). The purpose is to avoid an unnecessary IPI. Let's take an example. My test box has cpu 0-7, one socket. Say request is added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU 0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and CPU 7 have no difference from cache sharing point view and we can avoid an ipi if doing it in CPU 7. An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is running in interrupt handler, and currently I/O controller doesn't support multiple interrupts (I checked several LSI cards and AHCI), so only one CPU can run blk_complete_request(). This is still quite different as QUEUE_FLAG_SAME_FORCE. Since only one CPU runs softirq, the only difference with below patch is softirq not always runs at the first CPU of a group. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-09allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSHJeff Moyer
blk_insert_flush has the following check: /* * If there's data but flush is not necessary, the request can be * processed directly without going through flush machinery. Queue * for normal execution. */ if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { list_add_tail(&rq->queuelist, &q->queue_head); return; } However, blk_flush_policy will not return with policy set to only REQ_FSEQ_DATA: static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq) { unsigned int policy = 0; if (fflags & REQ_FLUSH) { if (rq->cmd_flags & REQ_FLUSH) policy |= REQ_FSEQ_PREFLUSH; if (blk_rq_sectors(rq)) policy |= REQ_FSEQ_DATA; if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA)) policy |= REQ_FSEQ_POSTFLUSH; } return policy; } Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set. Fix this mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH check. Tejun notes: Hmmm... yes, this can become a correctness issue if (and only if) blk_queue_flush() is called to change q->flush_flags while requests are in-flight; otherwise, requests wouldn't reach the function at all. Also, I think it would be a generally good idea to always set FSEQ_DATA if the request has data. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-05block: Make rq_affinity = 1 work as expectedTao Ma
Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make the request completed in the __make_request cpu. But it makes the old rq_affinity = 1 not work any more. The root cause is that if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu, ccpu will be the same as group_cpu, so the completion will be excuted in the 'cpu' not 'group_cpu'. This patch fix problem by simpling removing group_cpu and the codes are more explicit now. If ccpu == cpu, we complete in cpu, otherwise we raise_blk_irq to ccpu. Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-03fault-injection: add ability to export fault_attr in arbitrary directoryAkinobu Mita
init_fault_attr_dentries() is used to export fault_attr via debugfs. But it can only export it in debugfs root directory. Per Forlin is working on mmc_fail_request which adds support to inject data errors after a completed host transfer in MMC subsystem. The fault_attr for mmc_fail_request should be defined per mmc host and export it in debugfs directory per mmc host like /sys/kernel/debug/mmc0/mmc_fail_request. init_fault_attr_dentries() doesn't help for mmc_fail_request. So this introduces fault_create_debugfs_attr() which is able to create a directory in the arbitrary directory and replace init_fault_attr_dentries(). [akpm@linux-foundation.org: extraneous semicolon, per Randy] Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Tested-by: Per Forlin <per.forlin@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-02block/genhd.c: remove useless cast in diskstats_show()Herbert Poetzl
Remove the (unsigned long long) cast in diskstats_show() and adjusts the seq_printf() format string to 'unsigned long' diskstats_show() uses part_stat_read() to get the stats, which either accesses the specified field in the struct disk_stats directly (non SMP) or sums up the per CPU values in a variable of the same type as the field, so in any case the result will have the same type and range as the specified field which for all disk_stats entries is unsigned long Also, for unsigned long ranges the output of %lu should be identical to the one of %llu, so no change in the actual proc entry contents. Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02bsg-lib: add module.h includeJens Axboe
Due to conflicts with the moduleh tree in linux-next, we run into an include file mess. We really need export.h in that tree, but if we add module.h locally then the issue is easier to resolve. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02cfq-iosched: Reduce linked group count upon group destructionVivek Goyal
FQ keeps track of number of groups which are linked on blkcg->blkg_list. This is useful to avoid races between queue exit and cgroup exit code paths. So if at the request queue exit time linked group count is not zero, that means there are some group out there which is yet to be deleted under rcu read period and queue exit code should wait for on rcu period. In my previous patch I forgot to decrease the number of group count. So in current form, we nr_blkcg_linked_grps is always non-zero and we will always wait one rcu period (if BLK_CGROUP=y). The side effect of this is that it can increase boot time. I am surprised, nobody complained so far. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-01blk-throttle: correctly determine sync bioShaohua Li
read request is always sync. Using rw_is_sync() to determine if a bio is sync. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31block: add bsg helper libraryMike Christie
This moves the FC classes bsg code to the block layer and makes it a lib so that other classes like iscsi and SAS can use it. It is helpful because working with the request queue, bios, creating scatterlists, etc are a pain that the LLD does not have to worry about with normal IOs and should not have to worry about for bsg requests. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-26fail_make_request: cleanup should_fail_requestAkinobu Mita
This changes should_fail_request() to more usable wrapper function of should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in the middle of a function. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26block: fix warning with calling smp_processor_id() in preemptible sectionJens Axboe
After commit 5757a6d7 introduced an unsafe calling of smp_processor_id(), with preempt debuggin turned on we spew a lot of: BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514 caller is __make_request+0x1b8/0x308 [<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308) [<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558) [<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138) [<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c) [<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224) [<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94) [<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8) Fix this by just using raw_smp_processor_id(), it's just a hint after all. There's no pinning of the CPU or accessing per-cpu structures involved. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-25Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block: cciss: do not attempt to read from a write-only register xen/blkback: Add module alias for autoloading xen/blkback: Don't let in-flight requests defer pending ones. bsg: fix address space warning from sparse bsg: remove unnecessary conditional expressions bsg: fix bsg_poll() to return POLLOUT properly
2011-07-25Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...
2011-07-23block: strict rq_affinityDan Williams
Some systems benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. This is because the first CPU in the group mask ends up being completely overloaded with work, while the others (including the original submitter) has power left to spare. Allow the strict mode to be set by writing '2' to the sysfs control file. This is identical to the scheme used for the nomerges file, where '2' is a more aggressive setting than just being turned on. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23block: fix patch import error in max_discard_sectors checkJens Axboe
A '!' snuck in before the unlikely, rendering it useless. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits) [SCSI] fix crash in scsi_dispatch_cmd() [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu [SCSI] bfa: Update the driver version to 3.0.2.1 [SCSI] bfa: Driver and BSG enhancements. [SCSI] bfa: Added support to query PHY. [SCSI] bfa: Added HBA diagnostics support. [SCSI] bfa: Added support for flash configuration [SCSI] bfa: Added support to obtain SFP info. [SCSI] bfa: Added support for CEE info and stats query. [SCSI] bfa: Extend BSG interface. [SCSI] bfa: FCS bug fixes. [SCSI] bfa: DMA memory allocation enhancement. [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support. [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes. [SCSI] bfa: Added Fabric Assigned Address(FAA) support [SCSI] bfa: IOC bug fixes. [SCSI] bfa: Enable ASIC block configuration and query. [SCSI] bnx2i: Updated copyright and bump version [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported ... Fix up some trivial conflicts in: - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}: Crazy broadcom version number conflicts - drivers/target/tcm_fc/tfc_cmd.c Just trivial cleanups done on adjacent lines
2011-07-21[SCSI] fix crash in scsi_dispatch_cmd()James Bottomley
USB surprise removal of sr is triggering an oops in scsi_dispatch_command(). What seems to be happening is that USB is hanging on to a queue reference until the last close of the upper device, so the crash is caused by surprise remove of a mounted CD followed by attempted unmount. The problem is that USB doesn't issue its final commands as part of the SCSI teardown path, but on last close when the block queue is long gone. The long term fix is probably to make sr do the teardown in the same way as sd (so remove all the lower bits on ejection, but keep the upper disk alive until last close of user space). However, the current oops can be simply fixed by not allowing any commands to be sent to a dead queue. Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-07-20block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()Lai Jiangshan
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-12CFQ: add think time check for groupShaohua Li
Currently when the last queue of a group has no request, we don't expire the queue to hope request from the group comes soon, so the group doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=30 direct=1 [test1] cgroup=test1 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file1 thinktime=9000 [test2] cgroup=test2 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file2 patched base test1 64k 39k test2 548k 540k total 604k 578k group1 gets much better throughput because it waits less time. To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test result is stable. The thoughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12CFQ: add think time check for service treeShaohua Li
Currently when the last queue of a service tree has no request, we don't expire the queue to hope request from the service tree comes soon, so the service tree doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=10 direct=1 [test1] rw=randread ioengine=libaio size=500m directory=/mnt filename=file1 thinktime=9000 [test2] rw=read ioengine=libaio size=1G directory=/mnt filename=file2 patched base test1 41k/s 33k/s test2 15868k/s 15789k/s total 15902k/s 15817k/s A slightly better To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test has variation even without the patch, but the average throughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12CFQ: move think time check variables to a separate structShaohua Li
Move the variables to do think time check to a sepatate struct. This is to prepare adding think time check for service tree and group. No functional change. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12fixlet: Remove fs_excl from struct task.Justin TerAvest
fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-10cfq: Remove special treatment for metadata rqs.Justin TerAvest
There is no consistency among filesystems from what bios (or requests) are marked as being metadata. It's interesting to expose this in traces, but we shouldn't schedule the requests differently based on whether or not they're marked as being metadata. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-08block: avoid building too big plug listShaohua Li
When I test fio script with big I/O depth, I found the total throughput drops compared to some relative small I/O depth. The reason is the thread accumulates big requests in its plug list and causes some delays (surely this depends on CPU speed). I thought we'd better have a threshold for requests. When a threshold reaches, this means there is no request merge and queue lock contention isn't severe when pushing per-task requests to queue, so the main advantages of blk plug don't exist. We can force a plug list flush in this case. With this, my test throughput actually increases and almost equals to small I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list() for big I/O depth. The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to reduce lock contention to me. But I'm open here, 32 is ok in my test too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-06block: eliminate potential for infinite loop in blkdev_issue_discardMike Snitzer
Due to the recently identified overflow in read_capacity_16() it was possible for max_discard_sectors to be zero but still have discards enabled on the associated device's queue. Eliminate the possibility for blkdev_issue_discard to infinitely loop. Interestingly this issue wasn't identified until a device, whose discard_granularity was 0 due to read_capacity_16 overflow, was consumed by blk_stack_limits() to construct limits for a higher-level DM multipath device. The multipath device's resulting limits never had the discard limits stacked because blk_stack_limits() will only do so if the bottom device's discard_granularity != 0. This resulted in the multipath device's limits.max_discard_sectors being 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01compat_ioctl: fix warning caused by qemuJohannes Stezenbach
On Linux x86_64 host with 32bit userspace, running qemu or even just "qemu-img create -f qcow2 some.img 1G" causes a kernel warning: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img ioctl 00005326 is CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. The warning appears because the Linux compat-ioctl handler for these ioctls only applies to block devices, while qemu also uses the ioctls on plain files. Signed-off-by: Johannes Stezenbach <js@sig21.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01block: flush MEDIA_CHANGE from drivers on close(2)Tejun Heo
Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01Merge branch 'for-linus' into for-3.1/coreJens Axboe
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27cfq-iosched: make code consistentShaohua Li
ioc->ioc_data is rcu protectd, so uses correct API to access it. This doesn't change any behavior, but just make code consistent. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after ab4bd22d Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27cfq-iosched: fix a rcu warningShaohua Li
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but doesn't hold rcu_read_lock. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after ab4bd22d Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20bsg: fix address space warning from sparseNamhyung Kim
copy_from/to_user() and blk_rq_map_user() want __user pointer. This patch fixes following warnings from sparse: CHECK block/bsg.c block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces) block/bsg.c:185:38: expected void const [noderef] <asn:1>*from block/bsg.c:185:38: got void *<noident> block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:295:58: expected void [noderef] <asn:1>*<noident> block/bsg.c:295:58: got void *[assigned] dxferp block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:311:52: expected void [noderef] <asn:1>*<noident> block/bsg.c:311:52: got void *[assigned] dxferp block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces) block/bsg.c:448:37: expected void [noderef] <asn:1>*dst block/bsg.c:448:37: got void *<noident> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20bsg: remove unnecessary conditional expressionsNamhyung Kim
Second condition in OR always implies first condition is false thus bytes_read in the second is not needed. The same goes to bytes_written. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>