This is second part of the series about RDMA. In the first part I talked about RDMA history, evolution and the current status.
As I mentioned in the last post (
link), iWARP is RDMA on top of IP based networks, which can use Ethernet as L2 layer technology but not necessarily limited to it. SoftiWARP is a pure software based implementation of iWARP protocol. In next section I will give a quick overview of technical details of iWARP protocol and describe how it is implemented in SoftiWARP.
Technical Details: The key advantages offered by RDMA are: zero copy networking and removal of the OS/application from fast data path. RDMA achieves it by pinning the user buffer pages involved, and marking all the network segments with buffer identifier and offset knowledge. Hence each received packet can be placed independently and immediately by identifying its position in the user buffer. However in order for NIC to do it, it must first process IP, TCP and RDMA headers. Traditional Ethernet NICs can only process Ethernet headers and have no idea about higher layer protocols such as IP and TCP. TCP socket based network communication is de-facto standard for data exchange on the Internet. Processing TCP headers (aka stateful offload) in hardware is a risky business( See
Mogul'03 HotOS) and have met with fierce opposition from the community. That is why there is no out of the box compatible support for RDMA in Linux and requires some patching or knowledge on part of the users. For example port space collision between in-kernel TCP stack and offloaded stack in the NIC is still an unresolved issue. RDMA capable NICs (which can process IP, TCP and RDMA headers in hardware) care called RDMA NICs or RNICs. Hence an RDMA traffic can not be mixed with normal socket based TCP traffic as it has additional headers and information which enables an RNIC to place the segment directly in the user buffer.
SoftiWARP is a pure software based implementation of iWARP protocol on top of
unmodified Linux kernel. It enables an ordinary NIC without RDMA capabilities to handle RDMA traffic in software. It is wire compatible with an RNIC, hence you can use it in a mixed setup. SoftiWARP is just another RDMA provider inside the OFED stack and consist of a kernel driver and a user space library. It uses in-kernel TCP sockets for data transmission. Some more details about its data transmission and receiving paths:
- Transmission Path: SoftiWARP uses per core kernel threads to perform data transmission on behalf of user processes. When a user process posts a transmission request (post syscall) - if the request is small enough to fit into socket buffer then it is handed over to the TCP socket, otherwise data is pushed to the socket in a non-blocking mode until it hits -EAGAIN. At this point post syscall returns and the QP is put on a wait work queue. Application is now free to do anything else. Each QP also register a write space callback in order to get notification when there is more free space on the socket buffer. Upon receiving the notification for further free space, it schedules the kernel thread to push data on behalf of the user process. Kernel thread pushes the data until it hits -EAGAIN and then moves to the next QP. And this process is repeated until complete user buffer is transmitted. User process is notified asynchronously in the end about the successful data transmission. Depending upon data transmission semantics, send or sendpage (zcopy) can be used. For example, a read response transmission is always zero-copy by using tcp_sendpage.
- Receive Path: SoftiWARP receive path is very simple. Each QP registers a socket callback function (sk->sk_data_ready) which is called at the end of netstack processing of tcp (called from at the end of tcp_rcv_established). In SoftiWARP code this function is siw_qp_llp_data_ready(). This function processes the RDMA header, locates the pinned user buffers, calculates the buffer offset and after checking access permissions copies the data by calling skb_copy_bits().
Setup on Debian 6.0: In this section I will outline how to install OFED and SoftiWARP from scratch on a Debian 6.0 machine. I like to install things from source so that in future you can check and see what is happening inside the code. I have a freshly installed Debian system with a vanilla kernel (2.6.36.2) installed. Nothing fancy here. Make sure you do compile in (from make menuconfig)-> Device Drivers -> InfiniBand Support -> InfiniBand userspace MAD support and InfiniBand userspace access (verbs and CM).
Step 1: Install OFED environment. This consist of installing librdmacm and libibverbs. I will install them from the source.
#apt-get source libibverbs
# cd libibverbs-1.1.3
# ./configure
# make
# make install
Same steps for librdmacm. Now they should be installed at /usr/local/lib. If required, then include this in your LD_LIBRARY_PATH by putting that in .bashrc file as
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LD_LIBRARY_PATH
Step 2: Install libsiw (user space driver for SoftiWARP device). You should have autoconf, automake and libtool installed. Same steps but get the source from the git
# cd 'your directory of choice'
#git clone git://www.gitorious.org/softiwarp/userlib.git
#cd userlib
#./autogen.sh
#./autogen.sh (again)
#./configure
#make
#make install
Step 3: Get the kernel driver compiled. Nothing fancy here. Everything should go without any problem.
# cd 'your directory of choice'
#git clone git://www.gitorious.org/softiwarp/kernel.git
# cd kernel/softiwarp
#make
I do not recommend installing and tainting your kernel. It might be handy to make a shell script that can do insmod from this build location. See step 6.
Step 4: Setup udev rules before inserting the kernel driver modules and OFED modules. Here is my copy of udev rules at /etc/udev/rules.d/90-ib.rules
KERNEL=="umad*", NAME="infiniband/%k", MODE="0666"
KERNEL=="issm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
Step 5: Load the OFED kernel environment by putting these modules in. I have them in a shell script as
#!/bin/bash
modprobe ib_addr
modprobe ib_cm
modprobe ib_core
modprobe ib_mad
modprobe ib_sa
modprobe ib_ucm
modprobe ib_umad
modprobe ib_uverbs
modprobe iw_cm
modprobe rdma_cm
modprobe rdma_ucm
Step 6: Load the siw.ko from where you build it in step 3. Do
#insmod siw.ko
Result: $ibv_devices
device node GUID
------ ----------------
siw_eth0 4437e668130c0000
siw_lo 7369775f6c6f0000
Moreover you can also start RDMA traffic on the local system by doing
$rping -s
(on another shell)
$rping -c -a 127.0.0.1 -v
This should give you lots of dump of something like:
ping data: rdma-ping-7799: \]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
ping data: rdma-ping-7800: ]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQ
ping data: rdma-ping-7801: ^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQR
You can ctrl^c after sometime. At this point you are good to go. Soon I will put an example RDMA server-client program in part 3 of the series.
Troubleshooting:
- "ibv_devices: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory"
- Check if libs are in /usr/local/lib
- Check $LD_LIBRARY_PATH
- Do ldconfig on a newly installed system, so it can learn about new libs
- "libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1"
- In some peculiar cases due to wrong installation sometimes the driver file is not properly located. For example in my system I have all driver files in /usr/local/etc/libibverbs.d/ directory. Driver file is nothing special but a simple file that tells libibverbs about the name of driver (and hence library file name). I have following files in my /usr/local/etc/libibverbs.d/
-rw-r--r-- 1 root staff 13 Aug 30 11:38 cxgb4.driver
-rw-r--r-- 1 root staff 11 Aug 30 12:31 siw.driver
cxgb4.driver is for Chelsio T4 RNIC, and siw.driver is for SoftiWARP. Inside the file there is nothing fancy. siw.driver contains one line of text :
driver siw
Check the log from strace if it is finding and trying to open this file for the device. If the file is missing then just create it by yourself.
- Permission denied errors such as :
"rping -s
CMA: unable to open RDMA device
Segmentation fault"
are related to if you missed setting up the udev rules. In this case only root can access the RDMA devices. Also on some debian system 50-udev.rules also contain some RDMA related rules. Delete them !
- For more detailed debugging try using strace with -f ...something like
$strace -f rping -s
It will give you tons of details what system is doing, which files it is opening, which one failed which lead to failure of the RDMA program. It is also useful to check mis-configured library paths in the lookup as you can see if ld is checking all of them or not.
- If nothing works then drop me an email !( atr AT zurich DOT ibm DOT com) :)
References: SoftiWARP is developed at Systems Software Group, IBM Research Lab, Zurich. More details about it can be found at
- IBM website -
http://www.zurich.ibm.com/sys/software/
- Gitorious
http://gitorious.org/softiwarp
- Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft RDMA to Boost Memcached. Patrick Stuedi, Animesh Trivedi, Bernard Metzler, Usenix ATC'12 (Short Paper), Boston, USA, June 2012.
- A Case for RDMA in Clouds: Turning Supercomputer Networking into Commodity, Animesh Trivedi, Bernard Metzler, Patrick Stuedi ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2011), Shanghai, China, July 2011.
--
Disclaimer: The opinions expressed here are my own and do not necessarily represent those of my past or current employers.