Tuesday 13 December 2011

Setting up SoftiWARP in Debian 6.0 from scratch

This is second part of the series about RDMA. In the first part I talked about RDMA history, evolution and the current status.

As I mentioned in the last post (link), iWARP is RDMA on top of IP based networks, which can use Ethernet as L2 layer technology but not necessarily limited to it. SoftiWARP is a pure software based implementation of iWARP protocol. In next section I will give a quick overview of technical details of iWARP protocol and describe how it is implemented in SoftiWARP.

Technical Details:  The key advantages offered by RDMA are: zero copy networking and removal of the OS/application from fast data path. RDMA achieves it by pinning the user buffer pages involved, and marking all the network segments with buffer identifier and offset knowledge. Hence each received packet can be placed independently and immediately by identifying its position in the user buffer. However in order for NIC to do it, it must first process IP, TCP and RDMA headers. Traditional Ethernet NICs can only process Ethernet headers and have no idea about higher layer protocols such as IP and TCP. TCP socket based network communication is de-facto standard for data exchange on the Internet. Processing TCP headers (aka stateful offload) in hardware is a risky business( See Mogul'03 HotOS)  and have met with fierce opposition from the community. That is why there is no out of the box compatible support for RDMA in Linux and requires some patching or knowledge on part of the users. For example port space collision between in-kernel TCP stack and offloaded stack in the NIC is still an unresolved issue. RDMA capable NICs (which can process IP, TCP and RDMA headers in hardware) care called RDMA NICs or RNICs. Hence an RDMA traffic can not be mixed with normal socket based TCP traffic as it has additional headers and information which enables an RNIC to place the segment directly in the user buffer.

SoftiWARP is a pure software based implementation of iWARP protocol on top of unmodified Linux kernel. It enables an ordinary NIC without RDMA capabilities to handle RDMA traffic in software. It is wire compatible with an RNIC, hence you can use it in a mixed setup. SoftiWARP is just another RDMA provider inside the OFED stack and consist of a kernel driver and a user space library. It uses in-kernel TCP sockets for data transmission. Some more details about its data transmission and receiving paths:

  • Transmission Path: SoftiWARP uses per core kernel threads to perform data transmission on behalf of user processes. When a user process posts a transmission request (post syscall) - if the request is small enough to fit into socket buffer then it is handed over to the TCP socket, otherwise data is pushed to the socket in a non-blocking mode until it hits -EAGAIN. At this point post syscall returns and the QP is put on a wait work queue. Application is now free to do anything else. Each QP also register a write space callback in order to get notification when there is more free space on the socket buffer. Upon receiving the notification for further free space, it schedules the kernel thread to push data on behalf of the user process. Kernel thread pushes the data until it hits -EAGAIN and then moves to the next QP. And this process is repeated until complete user buffer is transmitted. User process is notified asynchronously in the end about the successful data transmission. Depending upon data transmission semantics, send or sendpage (zcopy) can be used. For example, a read response transmission is always zero-copy by using tcp_sendpage. 
  • Receive Path: SoftiWARP receive path is very simple. Each QP registers a socket callback function (sk->sk_data_ready) which is called at the end of netstack processing of tcp (called from at the end of  tcp_rcv_established). In SoftiWARP code this function is siw_qp_llp_data_ready(). This function processes the RDMA header, locates the pinned user buffers, calculates the buffer offset and after checking access permissions copies the data by calling skb_copy_bits().

Setup on Debian 6.0: In this section I will outline how to install OFED and SoftiWARP from scratch on a Debian 6.0 machine. I like to install things from source so that in future you can check and see what is happening inside the code. I have a freshly installed Debian system with a vanilla kernel (2.6.36.2) installed. Nothing fancy here. Make sure you do compile in (from make menuconfig)-> Device Drivers -> InfiniBand Support -> InfiniBand userspace MAD support and InfiniBand userspace access (verbs and CM).

Step 1: Install OFED environment. This consist of installing librdmacm and libibverbs. I will install them from the source.
#apt-get source libibverbs 
# cd libibverbs-1.1.3
# ./configure 
# make 
# make install 
Same steps for librdmacm. Now they should be installed at /usr/local/lib. If required, then include this in your LD_LIBRARY_PATH by putting that in .bashrc file as


LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LD_LIBRARY_PATH

Step 2: Install libsiw (user space driver for SoftiWARP device). You should have autoconf, automake and libtool installed. Same steps but get the source from the git 
# cd 'your directory of choice' 
#git clone git://www.gitorious.org/softiwarp/userlib.git 
#cd userlib 
#./autogen.sh 
#./autogen.sh (again) 
#./configure 
#make 
#make install 

Step 3: Get the kernel driver compiled. Nothing fancy here. Everything should go without any problem. 
# cd 'your directory of choice' 
#git clone git://www.gitorious.org/softiwarp/kernel.git 
# cd kernel/softiwarp
#make 

I do not recommend installing and tainting your kernel. It might be handy to make a shell script that can do insmod from this build location. See step 6.

Step 4: Setup udev rules before inserting the kernel driver modules and OFED modules. Here is my copy of udev rules at /etc/udev/rules.d/90-ib.rules 
KERNEL=="umad*", NAME="infiniband/%k", MODE="0666"
KERNEL=="issm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"

Step 5: Load the OFED kernel environment by putting these modules in. I have them in a shell script as 
#!/bin/bash 
modprobe ib_addr
modprobe ib_cm
modprobe ib_core
modprobe ib_mad
modprobe ib_sa
modprobe ib_ucm
modprobe ib_umad
modprobe ib_uverbs
modprobe iw_cm
modprobe rdma_cm
modprobe rdma_ucm

Step 6: Load the siw.ko from where you build it in step 3. Do 
#insmod siw.ko

Result:  $ibv_devices 
    device             node GUID
    ------           ----------------
    siw_eth0       4437e668130c0000
    siw_lo           7369775f6c6f0000
Moreover you can also start RDMA traffic on the local system by doing 
$rping -s 
(on another shell) 
$rping -c -a 127.0.0.1 -v 

This should give you lots of dump of something like: 
ping data: rdma-ping-7799: \]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
ping data: rdma-ping-7800: ]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQ
ping data: rdma-ping-7801: ^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQR

You can ctrl^c after sometime. At this point you are good to go. Soon I will put an example RDMA server-client program in part 3 of the series. 

Troubleshooting: 
- "ibv_devices: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory" 
  • Check if libs are in /usr/local/lib 
  • Check $LD_LIBRARY_PATH 
  • Do ldconfig on a newly installed system, so it can learn about new libs 
- "libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1"
  • In some peculiar cases due to wrong installation sometimes the driver file is not properly located. For example in my system I have all driver files in /usr/local/etc/libibverbs.d/ directory. Driver file is nothing special but a simple file that tells libibverbs about the name of driver (and hence library file name). I have following files in my /usr/local/etc/libibverbs.d/
    -rw-r--r-- 1 root staff 13 Aug 30 11:38 cxgb4.driver
    -rw-r--r-- 1 root staff 11 Aug 30 12:31 siw.driver
    cxgb4.driver is for Chelsio T4 RNIC, and siw.driver is for SoftiWARP. Inside the file there is nothing fancy. siw.driver contains one line of text :
    driver siw
    Check the log from strace if it is finding and trying to open this file for the device. If the file is missing then just create it by yourself. 
- Permission denied errors such as :
  "rping -s 
   CMA: unable to open RDMA device
   Segmentation fault"
are related to if you missed setting up the udev rules. In this case only root can access the RDMA devices. Also on some debian system 50-udev.rules also contain some RDMA related rules. Delete them !

- For more detailed debugging try using strace with -f ...something like
  $strace -f rping -s 
It will give you tons of details what system is doing, which files it is opening, which one failed which lead to failure of the RDMA program. It is also useful to check mis-configured library paths in the lookup as you can see if ld is checking all of them or not.

- If nothing works then drop me an email !( atr AT zurich DOT ibm DOT com)  :)

References: SoftiWARP is developed at Systems Software Group, IBM Research Lab, Zurich. More details about it can be found at 
- IBM website - http://www.zurich.ibm.com/sys/software/
- Gitorious http://gitorious.org/softiwarp
- Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft RDMA to Boost Memcached. Patrick Stuedi, Animesh Trivedi, Bernard Metzler, Usenix ATC'12 (Short Paper), Boston, USA, June 2012.
A Case for RDMA in Clouds: Turning Supercomputer Networking into Commodity, Animesh Trivedi, Bernard Metzler, Patrick Stuedi ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2011), Shanghai, China, July 2011.


--
Disclaimer: The opinions expressed here are my own and do not necessarily represent those of my past or current employers. 

7 comments:

  1. Thanks a lot for this post. This worked like a charm!

    ReplyDelete
  2. When i compiled kernel part i get below error

    make[1]: Entering directory `/usr/src/linux-headers-2.6.32-24-generic'
    CC [M] /home/alok/softiwarp/kernel/softiwarp/siw_main.o
    /home/alok/softiwarp/kernel/softiwarp/siw_main.c:600: error: unknown field âdma_opsâ specified in initializer
    make[2]: *** [/home/alok/softiwarp/kernel/softiwarp/siw_main.o] Error 1

    So i comment below
    struct device siw_generic_dma_device = {
    //.archdata.dma_ops = &siw_dma_generic_ops,
    .init_name = "software-rdma",
    .release = siw_device_release
    };

    Compilation went fine, But now when i do rping 127.0.0.1 its crashes and throws error

    [ 206.656119] BUG: unable to handle kernel NULL pointer dereference at 00000004
    [ 206.656119] IP: [] nommu_map_sg+0xb8/0x130
    [ 206.656119] *pde = 00000000
    [ 206.656119] Oops: 0000 [#1] SMP
    [ 206.656119] last sysfs file: /sys/devices/software-rdma/infiniband/siw_lo/node_guid
    [ 206.656119] Modules linked in: siw rdma_ucm rdma_cm iw_cm ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core ib_addr binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd ppdev fbcon tileblit font bitblit softcursor parport_pc joydev vga16fb lp soundcore snd_page_alloc psmouse serio_raw vgastate i2c_piix4 parport usbhid e1000 ahci hid
    [ 206.656119]
    [ 206.656119] Pid: 1581, comm: rping Not tainted (2.6.32-24-generic #39-Ubuntu) VirtualBox
    [ 206.656119] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    [ 206.656119] EIP is at nommu_map_sg+0xb8/0x130
    [ 206.656119] EAX: f52353d0 EBX: f8397220 ECX: 00000000 EDX: 3e6b0000
    [ 206.656119] ESI: 00001000 EDI: 00000000 EBP: f5121e04 ESP: f5121dd4
    [ 206.656119] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    [ 206.656119] Process rping (pid: 1581, ti=f5120000 task=f4ecf2c0 task.ti=f5120000)
    [ 206.656119] Stack:
    [ 206.656119] c012a498 f5121dec c058ccdf f5121dec 00000282 ff839000 f5121e04 00000001
    [ 206.656119] <0> 00000000 c0766c80 00000001 00000000 f5121e74 f82a22b6 00000000 f5121e64
    [ 206.656119] <0> 00000000 f4467000 f4468000 00000001 f5235598 00000000 c0766c80 f8397220
    [ 206.656119] Call Trace:
    [ 206.656119] [] ? default_spin_lock_flags+0x8/0x10
    [ 206.656119] [] ? _spin_lock_irqsave+0x2f/0x50
    [ 206.656119] [] ? ib_umem_get+0x2d6/0x464 [ib_core]
    [ 206.656119] [] ? siw_reg_user_mr+0xdc/0x1f0 [siw]
    [ 206.656119] [] ? down_read+0x10/0x20
    [ 206.656119] [] ? ib_uverbs_reg_mr+0x14b/0x250 [ib_uverbs]
    [ 206.656119] [] ? copy_from_user+0x3d/0x130
    [ 206.656119] [] ? ib_uverbs_reg_mr+0x0/0x250 [ib_uverbs]
    [ 206.656119] [] ? ib_uverbs_write+0xb3/0xd0 [ib_uverbs]
    [ 206.656119] [] ? vfs_write+0xa2/0x1a0
    [ 206.656119] [] ? ib_uverbs_write+0x0/0xd0 [ib_uverbs]
    [ 206.656119] [] ? sys_write+0x42/0x70
    [ 206.656119] [] ? syscall_call+0x7/0xb
    [ 206.656119] Code: 8b d0 00 00 00 85 c9 74 22 01 f2 3b 79 04 72 b3 3b 11 76 af c7 45 ec 00 00 00 00 8b 45 ec 83 c4 24 5b 5e 5f 5d c3 90 8d 74 26 00 <8b> 41 04 8b 19 83 f8 00 76 36 89 5c 24 14 89 44 24 18 89 74 24
    [ 206.656119] EIP: [] nommu_map_sg+0xb8/0x130 SS:ESP 0068:f5121dd4
    [ 206.656119] CR2: 0000000000000004
    [ 206.656119] ---[ end trace 26d3f13b4e290b77 ]---

    ReplyDelete
  3. Looks like CONFIG_X86_DEV_DMA_OPS is enabled by default for 32bit , I am using 32 bit ubuntu 10.04.Trying to use 64bit .
    dma_ops

    ReplyDelete
  4. Typo CONFIG_X86_DEV_DMA_OPS is enabled by default for 64bit Not for 32bit

    ReplyDelete
  5. Now stuck at rping if i do following

    root@ubuntu:~# rping -c -a 127.0.0.1
    cma event RDMA_CM_EVENT_CONNECT_ERROR, error -22

    ReplyDelete
  6. Excellent ! This article is nice... image manipulation service
    Thanks for sharing....

    ReplyDelete
  7. Good post.. Thanks for sharing with us.

    ReplyDelete