summaryrefslogtreecommitdiff
path: root/drivers/infiniband/core/uverbs_main.c
AgeCommit message (Collapse)Author
2018-02-28RDMA/uverbs: Protect from command mask overflowLeon Romanovsky
commit 3f802b162dbf4a558ff98986449eddc717826209 upstream. The command number is not bounds checked against the command mask before it is shifted, resulting in an ubsan hit. This does not cause malfunction since the command number is eventually bounds checked, but we can make this ubsan clean by moving the bounds check to before the mask check. ================================================================================ UBSAN: Undefined behaviour in drivers/infiniband/core/uverbs_main.c:647:21 shift exponent 207 is too large for 64-bit type 'long long unsigned int' CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 Call Trace: dump_stack+0xde/0x164 ? dma_virt_map_sg+0x22c/0x22c ubsan_epilogue+0xe/0x81 __ubsan_handle_shift_out_of_bounds+0x293/0x2f7 ? debug_check_no_locks_freed+0x340/0x340 ? __ubsan_handle_load_invalid_value+0x19b/0x19b ? lock_acquire+0x440/0x440 ? lock_acquire+0x19d/0x440 ? __might_fault+0xf4/0x240 ? ib_uverbs_write+0x68d/0xe20 ib_uverbs_write+0x68d/0xe20 ? __lock_acquire+0xcf7/0x3940 ? uverbs_devnode+0x110/0x110 ? cyc2ns_read_end+0x10/0x10 ? sched_clock_cpu+0x18/0x200 ? sched_clock_cpu+0x18/0x200 __vfs_write+0x10d/0x700 ? uverbs_devnode+0x110/0x110 ? kernel_read+0x170/0x170 ? __fget+0x35b/0x5d0 ? security_file_permission+0x93/0x260 vfs_write+0x1b0/0x550 SyS_write+0xc7/0x1a0 ? SyS_read+0x1a0/0x1a0 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL_64_fastpath+0x18/0x85 RIP: 0033:0x448e29 RSP: 002b:00007f033f567c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007f033f5686bc RCX: 0000000000448e29 RDX: 0000000000000060 RSI: 0000000020001000 RDI: 0000000000000012 RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000056a0 R14: 00000000006e8740 R15: 0000000000000000 ================================================================================ Cc: syzkaller <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> # 4.5 Fixes: 2dbd5186a39c ("IB/core: IB/core: Allow legacy verbs through extended interfaces") Reported-by: Noa Osherovich <noaos@mellanox.com> Reviewed-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-16IB/uverbs: Fix leak of XRC target QPsTariq Toukan
The real QP is destroyed in case of the ref count reaches zero, but for XRC target QPs this call was missed and caused to QP leaks. Let's call to destroy for all flows. Fixes: 0e0ec7e0638e ('RDMA/core: Export ib_open_qp() to share XRC...') Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-08-04Merge branches 'misc' and 'rxe' into k.o/for-4.8-1Doug Ledford
2016-08-03IB/uverbs: Fix race between uverbs_close and remove_oneJason Gunthorpe
Fixes an oops that might happen if uverbs_close races with remove_one. Both contexts may run ib_uverbs_cleanup_ucontext, it depends on the flow. Currently, there is no protection for a case that remove_one didn't make the cleanup it runs to its end, the underlying ib_device was freed then uverbs_close will call ib_uverbs_cleanup_ucontext and OOPs. Above might happen if uverbs_close deleted the file from the list then remove_one didn't find it and runs to its end. Fixes to protect against that case by a new cleanup lock so that ib_uverbs_cleanup_ucontext will be called always before that remove_one is ended. Fixes: 35d4a0b63dc0 ("IB/uverbs: Fix race between ib_uverbs_open and remove_one") Reported-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-23IB/uverbs: Introduce RWQ Indirection tableYishai Hadas
User applications that want to spread traffic on several WQs, need to create an indirection table, by using already created WQs. Adding uverbs API in order to create and destroy this table. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-23IB/uverbs: Add WQ supportYishai Hadas
User space applications which use RSS functionality need to create a work queue object (WQ). The lifetime of such an object is: * Create a WQ * Modify the WQ from reset to init state. * Use the WQ (by downstream patches). * Destroy the WQ. These commands are added to the uverbs API. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@rimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/security: Restrict use of the write() interfaceJason Gunthorpe
The drivers/infiniband stack uses write() as a replacement for bi-directional ioctl(). This is not safe. There are ways to trigger write calls that result in the return structure that is normally written to user space being shunted off to user specified kernel memory instead. For the immediate repair, detect and deny suspicious accesses to the write API. For long term, update the user space libraries and the kernel API to something that doesn't present the same security vulnerabilities (likely a structured ioctl() interface). The impacted uAPI interfaces are generally only available if hardware from drivers/infiniband is installed in the system. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> [ Expanded check to all known write() entry points ] Cc: stable@vger.kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-03IB/core: trivial prink cleanup.Parav Pandit
1. Replaced printk with appropriate pr_warn, pr_err, pr_info. 2. Removed unnecessary prints around memory allocation failure which are not required, as reported by the checkpatch script. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-03IB/core: Modify conditional on ucontext existenceEli Cohen
Since we allow to call legacy verbs using their extended counterpart, the check on ucontext has to move up to a common area in case this verb is ever extended. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-03IB/core: IB/core: Allow legacy verbs through extended interfacesEli Cohen
When an extended verb is an extension to a legacy verb, the original functionality is preserved. Hence we do not require each hardware driver to set the extended capability. This will allow the use of the extended verb in its simple form with drivers that do not support the extended capability. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-03IB/core: Avoid duplicate codeEli Cohen
Move the check on the validity of the command to a common area. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-23IB: remove in-kernel support for memory windowsChristoph Hellwig
Remove the unused ib_allow_mw and ib_bind_mw functions, remove the unused IB_WR_BIND_MW and IB_WC_BIND_MW opcodes and move ib_dealloc_mw into the uverbs module. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> [core] Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-21IB/core: Extend ib_uverbs_create_qpEran Ben Elisha
ib_uverbs_ex_create_qp follows the extension verbs mechanism. New features (for example, QP creation flags field which is added in a downstream patch) could used via user-space libraries without breaking the ABI. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/uverbs: Enable device removal when there are active user space applicationsYishai Hadas
Enables the uverbs_remove_one to succeed despite the fact that there are running IB applications working with the given ib device. This functionality enables a HW device to be unbind/reset despite the fact that there are running user space applications using it. It exposes a new IB kernel API named 'disassociate_ucontext' which lets a driver detaching its HW resources from a given user context without crashing/terminating the application. In case a driver implemented the above API and registered with ib_uverb there will be no dependency between its device to its uverbs_device. Upon calling remove_one of ib_uverbs the call should return after disassociating the open HW resources without waiting to clients disconnecting. In case driver didn't implement this API there will be no change to current behaviour and uverbs_remove_one will return only when last client has disconnected and reference count on uverbs device became 0. In case the lower driver device was removed any application will continue working over some zombie HCA, further calls will ended with an immediate error. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/uverbs: Explicitly pass ib_dev to uverbs commandsYishai Hadas
Done in preparation for deploying RCU for the device removal flow. Allows isolating the RCU handling to the uverb_main layer and keeping the uverbs_cmd code as is. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/uverbs: Fix race between ib_uverbs_open and remove_oneYishai Hadas
Fixes: 2a72f212263701b927559f6850446421d5906c41 ("IB/uverbs: Remove dev_table") Before this commit there was a device look-up table that was protected by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When it was dropped and container_of was used instead, it enabled the race with remove_one as dev might be freed just after: dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but before the kref_get. In addition, this buggy patch added some dead code as container_of(x,y,z) can never be NULL and so dev can never be NULL. As a result the comment above ib_uverbs_open saying "the open method will either immediately run -ENXIO" is wrong as it can never happen. The solution follows Jason Gunthorpe suggestion from below URL: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html cdev will hold a kref on the parent (the containing structure, ib_uverbs_device) and only when that kref is released it is guaranteed that open will never be called again. In addition, fixes the active count scheme to use an atomic not a kref to prevent WARN_ON as pointed by above comment from Jason. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/uverbs: Fix reference counting usage of event filesYishai Hadas
Fix the reference counting usage to be handled in the event file creation/destruction function, instead of being done by the caller. This is done for both async/non-async event files. Based on Jason Gunthorpe report at https://www.mail-archive.com/ linux-rdma@vger.kernel.org/msg24680.html: "The existing code for this is broken, in ib_uverbs_get_context all the error paths between ib_uverbs_alloc_event_file and the kref_get(file->ref) are wrong - this will result in fput() which will call ib_uverbs_event_close, which will try to do kref_put and ib_unregister_event_handler - which are no longer paired." Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/core: lock client data with lists_rwsemHaggai Eran
An ib_client callback that is called with the lists_rwsem locked only for read is protected from changes to the IB client lists, but not from ib_unregister_device() freeing its client data. This is because ib_unregister_device() will remove the device from the device list with lists_rwsem locked for write, but perform the rest of the cleanup, including the call to remove() without that lock. Mark client data that is undergoing de-registration with a new going_down flag in the client data context. Lock the client data list with lists_rwsem for write in addition to using the spinlock, so that functions calling the callback would be able to lock only lists_rwsem for read and let callbacks sleep. Since ib_unregister_client() now marks the client data context, no need for remove() to search the context again, so pass the client data directly to remove() callbacks. Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-06-12IB/core: Extend ib_uverbs_create_cqMatan Barak
ib_uverbs_ex_create_cq follows the extension verbs mechanism. New features (for example, CQ creation flags field which is added in a downstream patch) could used via user-space libraries without breaking the ABI. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15ib_uverbs: Fix pages leak when using XRC SRQsSébastien Dugué
Hello, When an application using XRCs abruptly terminates, the mmaped pages of the CQ buffers are leaked. This comes from the fact that when resources are released in ib_uverbs_cleanup_ucontext(), we fail to release the CQs because their refcount is not 0. When creating an XRC SRQ, we increment the associated CQ refcount. This refcount is only decremented when the SRQ is released. Therefore we need to release the SRQs prior to the CQs to make sure that all references to the CQs are gone before trying to release these. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-02-18IB/core: Add support for extended query device capsEli Cohen
Add extensible query device capabilities verb to allow adding new features. ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy capability fields to be used by both ib_uverbs_query_device and ib_uverbs_ex_query_device. Following the discussion about this patch [1], the code now validates the command's comp_mask is zero, returning -EINVAL for unknown values, in order to allow extending the verb in the future. The verb also checks the user-space provided response buffer size and only fills in capabilities that will fit in the buffer. In attempt to follow the spirit of presentation [2] by Tzahi Oved that was presented during OpenFabrics Alliance International Developer Workshop 2013, the comp_mask bits will only describe which fields are valid. Furthermore, fields that can simply be cleared when they are not supported, do not require a comp_mask bit at all. The verb returns a response_length field containing the actual number of bytes written by the kernel, so that a newer version running on an older kernel can tell which fields were actually returned. [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19 http://thread.gmane.org/gmane.linux.kernel.api/7889/ [2] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Reviewed-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-02-03IB/core: Temporarily disable ex_query_device uverbHaggai Eran
Commit 5a77abf9a97a ("IB/core: Add support for extended query device caps") added a new extended verb to query the capabilities of RDMA devices, but the semantics of this verb are still under debate [1]. Don't expose this verb to userspace until the ABI is nailed down. [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19 http://www.spinics.net/lists/linux-rdma/msg22904.html Signed-off-by: Haggai Eran <haggaie@mellanox.com> Reviewed-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IB/core: Add support for on demand paging regionsShachar Raindel
* Extend the umem struct to keep the ODP related data. * Allocate and initialize the ODP related information in the umem (page_list, dma_list) and freeing as needed in the end of the run. * Store a reference to the process PID struct in the ucontext. Used to safely obtain the task_struct and the mm during fault handling, without preventing the task destruction if needed. * Add 2 helper functions: ib_umem_odp_map_dma_pages and ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses of specific pages of the umem (and, currently, pin them). * Support for page faults only - IB core will keep the reference on the pages used and call put_page when freeing an ODP umem area. Invalidations support will be added in a later patch. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IB/core: Add support for extended query device capsEli Cohen
Add extensible query device capabilities verb to allow adding new features. ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy capability fields to be used by both ib_uverbs_query_device and ib_uverbs_ex_query_device. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-10-14Merge branches 'core', 'cxgb4', 'iser', 'mlx5' and 'ocrdma' into for-nextRoland Dreier
2014-10-14IB/core: Fix XRC race condition in ib_uverbs_open_qpJack Morgenstein
In ib_uverbs_open_qp, the sharable xrc target qp is created as a "pseudo" qp and added to a list of qp's sharing the same physical QP. This is done before the "pseudo" qp is assigned a uobject. There is a race condition here if an async event arrives at the physical qp. If the event is handled after the pseudo qp is added to the list, but before it is assigned a uobject, the kernel crashes in ib_uverbs_qp_event_handler, due to trying to dereference a NULL uobject pointer. Note that simply checking for non-NULL is not enough, due to error flows in ib_uverbs_open_qp. If the failure is after assigning the uobject, but before the qp has fully been created, we still have a problem. Thus, in ib_uverbs_qp_event_handler, we test that the uobject is present, and also that it is live. Reported-by: Matthew Finlay <matt@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-10-09IB/core: Avoid leakage from kernel to user spaceEli Cohen
Clear the reserved field of struct ib_uverbs_async_event_desc which is copied to user space. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-08-01IB/core: Add user MR re-registration supportMatan Barak
Memory re-registration is a feature that enables changing the attributes of a memory region registered by user-space, including PD, translation (address and length) and access flags. Add the required support in uverbs and the kernel verbs API. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20IB/uverbs: Check access to userspace response buffer in extended commandYann Droneaud
This patch adds a check on the output buffer with access_ok(VERIFY_WRITE, ...) to ensure the whole buffer is in userspace memory before using the pointer in uverbs functions. If the buffer or a subset of it is not valid, returns -EFAULT to the caller. This will also catch invalid buffer before the final call to copy_to_user() which happen late in most uverb functions. Just like the check in read(2) syscall, it's a sanity check to detect invalid parameters provided by userspace. This particular check was added in vfs_read() by Linus Torvalds for v2.6.12 with following commit message: https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/?id=fd770e66c9a65b14ce114e171266cf6f393df502 Make read/write always do the full "access_ok()" tests. The actual user copy will do them too, but only for the range that ends up being actually copied. That hides bugs when the range has been clamped by file size or other issues. Note: there's no need to check input buffer since vfs_write() already does access_ok(VERIFY_READ, ...) as part of write() syscall. Link: http://marc.info/?i=cover.1387273677.git.ydroneaud@opteya.com Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20IB/uverbs: Check reserved field in extended command headerYann Droneaud
As noted by Daniel Vetter in its article "Botching up ioctls"[1] "Check *all* unused fields and flags and all the padding for whether it's 0, and reject the ioctl if that's not the case. Otherwise your nice plan for future extensions is going right down the gutters since someone *will* submit an ioctl struct with random stack garbage in the yet unused parts. Which then bakes in the ABI that those fields can never be used for anything else but garbage." It's important to ensure that reserved fields are set to known value, so that it will be possible to use them latter to extend the ABI. The same reasonning apply to comp_mask field present in newer uverbs command: per commit 22878dbc9173 ("IB/core: Better checking of userspace values for receive flow steering"), unsupported values in comp_mask are rejected. [1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20IB/uverbs: New macro to set pointers to NULL if length is 0 in INIT_UDATA()Roland Dreier
Trying to have a ternary operator to choose between NULL (or 0) and the real pointer value in invocations leads to an impossible choice between a sparse error about a literal 0 used as a NULL pointer, and a gcc warning about "pointer/integer type mismatch in conditional expression." Rather than clutter the source with more casts, move the ternary operator into a new INIT_UDATA_BUF_OR_NULL() macro, which makes it easier to use and simplifies its callers. Reported-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17IB/core: Re-enable create_flow/destroy_flow uverbsMatan Barak
This commit reverts commit 7afbddfae993 ("IB/core: Temporarily disable create_flow/destroy_flow uverbs"). Since the uverbs extensions functionality was experimental for v3.12, this patch re-enables the support for them and flow-steering for v3.13. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17IB/core: extended command: an improved infrastructure for uverbs commandsYann Droneaud
Commit 400dbc96583f ("IB/core: Infrastructure for extensible uverbs commands") added an infrastructure for extensible uverbs commands while later commit 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs") exported ib_create_flow()/ib_destroy_flow() functions using this new infrastructure. According to the commit 400dbc96583f, the purpose of this infrastructure is to support passing around provider (eg. hardware) specific buffers when userspace issue commands to the kernel, so that it would be possible to extend uverbs (eg. core) buffers independently from the provider buffers. But the new kernel command function prototypes were not modified to take advantage of this extension. This issue was exposed by Roland Dreier in a previous review[1]. So the following patch is an attempt to a revised extensible command infrastructure. This improved extensible command infrastructure distinguish between core (eg. legacy)'s command/response buffers from provider (eg. hardware)'s command/response buffers: each extended command implementing function is given a struct ib_udata to hold core (eg. uverbs) input and output buffers, and another struct ib_udata to hold the hw (eg. provider) input and output buffers. Having those buffers identified separately make it easier to increase one buffer to support extension without having to add some code to guess the exact size of each command/response parts: This should make the extended functions more reliable. Additionally, instead of relying on command identifier being greater than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on unused bits in command field: on the 32 bits provided by command field, only 6 bits are really needed to encode the identifier of commands currently supported by the kernel. (Even using only 6 bits leaves room for about 23 new commands). So this patch makes use of some high order bits in command field to store flags, leaving enough room for more command identifiers than one will ever need (eg. 256). The new flags are used to specify if the command should be processed as an extended one or a legacy one. While designing the new command format, care was taken to make usage of flags itself extensible. Using high order bits of the commands field ensure that newer libibverbs on older kernel will properly fail when trying to call extended commands. On the other hand, older libibverbs on newer kernel will never be able to issue calls to extended commands. The extended command header includes the optional response pointer so that output buffer length and output buffer pointer are located together in the command, allowing proper parameters checking. This should make implementing functions easier and safer. Additionally the extended header ensure 64bits alignment, while making all sizes multiple of 8 bytes, extending the maximum buffer size: legacy extended Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes) Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes) For the purpose of doing proper buffer size accounting, the headers size are no more taken in account in "in_words". One of the odds of the current extensible infrastructure, reading twice the "legacy" command header, is fixed by removing the "legacy" command header from the extended command header: they are processed as two different parts of the command: memory is read once and information are not duplicated: it's making clear that's an extended command scheme and not a different command scheme. The proposed scheme will format input (command) and output (response) buffers this way: - command: legacy header + extended header + command data (core + hw): +----------------------------------------+ | flags | 00 00 | command | | in_words | out_words | +----------------------------------------+ | response | | response | | provider_in_words | provider_out_words | | padding | +----------------------------------------+ | | . <uverbs input> . . (in_words * 8) . | | +----------------------------------------+ | | . <provider input> . . (provider_in_words * 8) . | | +----------------------------------------+ - response, if present: +----------------------------------------+ | | . <uverbs output space> . . (out_words * 8) . | | +----------------------------------------+ | | . <provider output space> . . (provider_out_words * 8) . | | +----------------------------------------+ The overall design is to ensure that the extensible infrastructure is itself extensible while begin more reliable with more input and bound checking. Note: The unused field in the extended header would be perfect candidate to hold the command "comp_mask" (eg. bit field used to handle compatibility). This was suggested by Roland Dreier in a previous review[2]. But "comp_mask" field is likely to be present in the uverb input and/or provider input, likewise for the response, as noted by Matan Barak[3], so it doesn't make sense to put "comp_mask" in the header. [1]: http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com [2]: http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com [3]: http://marc.info/?i=525C1149.6000701@mellanox.com Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com [ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-10-21IB/core: Temporarily disable create_flow/destroy_flow uverbsYann Droneaud
The create_flow/destroy_flow uverbs and the associated extensions to the user-kernel verbs ABI are under review and are too experimental to freeze at this point. So userspace is not exposed to experimental features and an uinstable ABI, temporarily disable this for v3.12 (with a Kconfig option behind staging to reenable it if desired). The feature will be enabled after proper cleanup for v3.13. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com [ Add a Kconfig option to reenable these verbs. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28IB/core: Export ib_create/destroy_flow through uverbsHadar Hen Zion
Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to support flow steering for user space applications. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28IB/core: Infrastructure for extensible uverbs commandsIgor Ivanov
Add infrastructure to support extended uverbs capabilities in a forward/backward manner. Uverbs command opcodes which are based on the verbs extensions approach should be greater or equal to IB_USER_VERBS_CMD_THRESHOLD. They have new header format and processed a bit differently. Whenever a specific IB_USER_VERBS_CMD_XXX is extended, which practically means it needs to have additional arguments, we will be able to add them without creating a completely new IB_USER_VERBS_CMD_YYY command or bumping the uverbs ABI version. This patch for itself doesn't provide the whole scheme which is also dependent on adding a comp_mask field to each extended uverbs command struct. The new header framework allows for future extension of the CMD arguments (ib_uverbs_cmd_hdr.in_words, ib_uverbs_cmd_hdr.out_words) for an existing new command (that is a command that supports the new uverbs command header format suggested in this patch) w/o bumping ABI version and with maintaining backward and formward compatibility to new and old libibverbs versions. In the uverbs command we are passing both uverbs arguments and the provider arguments. We split the ib_uverbs_cmd_hdr.in_words to ib_uverbs_cmd_hdr.in_words which will now carry only uverbs input argument struct size and ib_uverbs_cmd_hdr.provider_in_words that will carry the provider input argument size. Same goes for the response (the uverbs CMD output argument). For example take the create_cq call and the mlx4_ib provider: The uverbs layer gets libibverb's struct ibv_create_cq (named struct ib_uverbs_create_cq in the kernel), mlx4_ib gets libmlx4's struct mlx4_create_cq (which includes struct ibv_create_cq and is named struct mlx4_ib_create_cq in the kernel) and in_words = sizeof(mlx4_create_cq)/4 . Thus ib_uverbs_cmd_hdr.in_words carry both uverbs plus mlx4_ib input argument sizes, where uverbs assumes it knows the size of its input argument - struct ibv_create_cq. Now, if we wish to add a variable to struct ibv_create_cq, we can add a comp_mask field to the struct which is basically bit field indicating which fields exists in the struct (as done for the libibverbs API extension), but we need a way to tell what is the total size of the struct and not assume the struct size is predefined (since we may get different struct sizes from different user libibverbs versions). So we know at which point the provider input argument (struct mlx4_create_cq) begins. Same goes for extending the provider struct mlx4_create_cq. Thus we split the ib_uverbs_cmd_hdr.in_words to ib_uverbs_cmd_hdr.in_words which will now carry only uverbs input argument struct size and ib_uverbs_cmd_hdr.provider_in_words that will carry the provider (mlx4_ib) input argument size. Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-02-21IB/uverbs: Implement memory windows support in uverbsShani Michaeli
The existing user/kernel uverbs API has IB_USER_VERBS_CMD_ALLOC/DEALLOC_MW. Implement these calls, along with destroying user memory windows during process cleanup. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-26switch simple cases of fget_light to fdgetAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26switch infinibarf users of fget() to fget_light()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03switch device_get_devnode() and ->devnode() to umode_t *Al Viro
both callers of device_get_devnode() are only interested in lower 16bits and nobody tries to return anything wider than 16bit anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-01Merge branches 'amso1100', 'cma', 'cxgb3', 'cxgb4', 'fdr', 'ipath', 'ipoib', ↵Roland Dreier
'misc', 'mlx4', 'misc', 'nes', 'qib' and 'xrc' into for-next
2011-10-13RDMA/uverbs: Export ib_open_qp() capability to user spaceSean Hefty
Allow processes that share the same XRC domain to open an existing shareable QP. This permits those processes to receive events on the shared QP and transfer ownership, so that any process may modify the QP. The latter allows the creating process to exit, while a remaining process can still transition it for path migration purposes. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13RDMA/core: Export ib_open_qp() to share XRC TGT QPsSean Hefty
XRC TGT QPs are shared resources among multiple processes. Since the creating process may exit, allow other processes which share the same XRC domain to open an existing QP. This allows us to transfer ownership of an XRC TGT QP to another process. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13RDMA/uverbs: Export XRC TGT QPs to user spaceSean Hefty
Allow user space to operate on XRC TGT QPs the same way as other types of QPs, with one notable exception: since XRC TGT QPs may be shared among multiple processes, the XRC TGT QP is allowed to exist beyond the lifetime of the creating process. The process that creates the QP is allowed to destroy it, but if the process exits without destroying the QP, then the QP will be left bound to the lifetime of the XRCD. TGT QPs are not associated with CQs or a PD. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13RDMA/uverbs: Export XRC SRQs to user spaceSean Hefty
We require additional information to create XRC SRQs than we can exchange using the existing create SRQ ABI. Provide an enhanced create ABI for extended SRQ types. Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il> and Roland Dreier <roland@purestorage.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-13RDMA/uverbs: Export XRC domains to user spaceSean Hefty
Allow user space to create XRC domains. Because XRCDs are expected to be shared among multiple processes, we use inodes to identify an XRCD. Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-06RDMA/ucm: Removed checks for unsigned value < 0Hefty, Sean
cmd is unsigned, no need to check for < 0. Found by code inspection. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-04RDMA: Check for NULL mode in .devnode methodsGoldwyn Rodrigues
Commits 71c29bd5c235 ("IB/uverbs: Add devnode method to set path/mode") and c3af0980ce01 ("IB: Add devnode methods to cm_class and umad_class") added devnode methods that set the mode. However, these methods don't check for a NULL mode, and so we get a crash when unloading modules because devtmpfs_delete_node() calls device_get_devnode() with mode == NULL. Add the missing checks. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> [ Also fix cm.c. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-23IB/uverbs: Add devnode method to set path/modeRoland Dreier
We want udev to create a device node under /dev/infiniband with permission 0666 for uverbsX devices, so add a devnode method to set the appropriate info. Signed-off-by: Roland Dreier <roland@purestorage.com>
2010-04-21IB: Explicitly rule out llseek to avoid BKL in default_llseek()Roland Dreier
Several RDMA user-access drivers have file_operations structures with no .llseek method set. None of the drivers actually do anything with f_pos, so this means llseek is essentially a NOP, instead of returning an error as leaving other file_operations methods unimplemented would do. This is mostly harmless, except that a NULL .llseek means that default_llseek() is used, and this function grabs the BKL, which we would like to avoid. Since llseek does nothing useful on these files, we would like it to return an error to userspace instead of silently grabbing the BKL and succeeding. For nearly all of the file types, we take the belt-and-suspenders approach of setting the .llseek method to no_llseek and also calling nonseekable_open(); the exception is the uverbs_event files, which are created with anon_inode_getfile(), which already sets f_mode the same way as nonseekable_open() would. This work is motivated by Arnd Bergmann's bkl-removal tree. Signed-off-by: Roland Dreier <rolandd@cisco.com>