Wednesday, 5 December 2012

Manage CPU cores in Linux

This is going to be a short post. There are two things I need to do frequently with manycores machines that I have:

* Switch them on/off to perform experiments in a more controlled environment. 
* Scale their frequency to match their performance.  

For first, there is a simple trick. All CPU cores in linux show up in sysfs 

#ls /sys/devices/system/cpu/
cpu0/       cpu1/       cpu2/       cpu3/       cpufreq/    cpuidle/    kernel_max  microcode/  modalias    offline     online      possible    present     probe       release     uevent      

As you can see there are 4 directories for 4 core system that I have. In each of the core* directories I have

#ls /sys/devices/system/cpu/cpu0/
cache  cpufreq crash_notes  microcode node0  online  subsystem thermal_throttle  topology  uevent

So one just have to echo 1 (for ON) or 0 (for OFF) into online file. This is a sysfs file, writing to which triggers some action inside kernel. In our case this would be switching off the CPU. 

#echo 0 > /sys/devices/system/cpu/cpu0/online

For this to work, you kernel must support dynamic hotplugging of CPUs. Mind that CPU0 has a special status and you can not switch if off. Although, Linux is smart enough when running on a single core, it switches to uniprocessor (UP) code. 

For second, you need to have associated processor power state driver. Most of the modern Intel processor can work with P state acpi_cpufreq.ko driver. For other driver options check in: 

'make menuconfig' -> Power Management and ACPI Options -> CPU Frequency Scaling
-> x86 CPU Frequency Scaling Drivers  

Now get the driver in. To regulate frequency from userspace you need a userspace tool called 'cpufrequtils'. Install it. With the driver and tool in, we get something like: 

cpufrequtils 007: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to, please.
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 1.60 GHz - 2.53 GHz
  available frequency steps: 2.53 GHz, 2.39 GHz, 2.26 GHz, 2.13 GHz, 2.00 GHz, 1.86 GHz, 1.73 GHz, 1.60 GHz
  available cpufreq governors: conservative, ondemand, userspace, performance
  current policy: frequency should be within 1.60 GHz and 2.53 GHz.
                  The governor "userspace" may decide which speed to use
                  within this range.
  current CPU frequency is 1.86 GHz (asserted by call to hardware).

As you can it there are multiple frequencies I can choose from. You can not arbitrarily set one frequency, it has to be one from the set. To set the frequency: 

#cpufreq-set -c 0 -f 1.8GHz

where -C is core number. And -f is frequency. 

And thats about it. For more details see Linux kernel documentation at 

and most of other ACPI and power/performance related options are in 
'make menuconfig' -> Power Management and ACPI Options


Tuesday, 13 December 2011

Setting up SoftiWARP in Debian 6.0 from scratch

This is second part of the series about RDMA. In the first part I talked about RDMA history, evolution and the current status.

As I mentioned in the last post (link), iWARP is RDMA on top of IP based networks, which can use Ethernet as L2 layer technology but not necessarily limited to it. SoftiWARP is a pure software based implementation of iWARP protocol. In next section I will give a quick overview of technical details of iWARP protocol and describe how it is implemented in SoftiWARP.

Technical Details:  The key advantages offered by RDMA are: zero copy networking and removal of the OS/application from fast data path. RDMA achieves it by pinning the user buffer pages involved, and marking all the network segments with buffer identifier and offset knowledge. Hence each received packet can be placed independently and immediately by identifying its position in the user buffer. However in order for NIC to do it, it must first process IP, TCP and RDMA headers. Traditional Ethernet NICs can only process Ethernet headers and have no idea about higher layer protocols such as IP and TCP. TCP socket based network communication is de-facto standard for data exchange on the Internet. Processing TCP headers (aka stateful offload) in hardware is a risky business( See Mogul'03 HotOS)  and have met with fierce opposition from the community. That is why there is no out of the box compatible support for RDMA in Linux and requires some patching or knowledge on part of the users. For example port space collision between in-kernel TCP stack and offloaded stack in the NIC is still an unresolved issue. RDMA capable NICs (which can process IP, TCP and RDMA headers in hardware) care called RDMA NICs or RNICs. Hence an RDMA traffic can not be mixed with normal socket based TCP traffic as it has additional headers and information which enables an RNIC to place the segment directly in the user buffer.

SoftiWARP is a pure software based implementation of iWARP protocol on top of unmodified Linux kernel. It enables an ordinary NIC without RDMA capabilities to handle RDMA traffic in software. It is wire compatible with an RNIC, hence you can use it in a mixed setup. SoftiWARP is just another RDMA provider inside the OFED stack and consist of a kernel driver and a user space library. It uses in-kernel TCP sockets for data transmission. Some more details about its data transmission and receiving paths:

  • Transmission Path: SoftiWARP uses per core kernel threads to perform data transmission on behalf of user processes. When a user process posts a transmission request (post syscall) - if the request is small enough to fit into socket buffer then it is handed over to the TCP socket, otherwise data is pushed to the socket in a non-blocking mode until it hits -EAGAIN. At this point post syscall returns and the QP is put on a wait work queue. Application is now free to do anything else. Each QP also register a write space callback in order to get notification when there is more free space on the socket buffer. Upon receiving the notification for further free space, it schedules the kernel thread to push data on behalf of the user process. Kernel thread pushes the data until it hits -EAGAIN and then moves to the next QP. And this process is repeated until complete user buffer is transmitted. User process is notified asynchronously in the end about the successful data transmission. Depending upon data transmission semantics, send or sendpage (zcopy) can be used. For example, a read response transmission is always zero-copy by using tcp_sendpage. 
  • Receive Path: SoftiWARP receive path is very simple. Each QP registers a socket callback function (sk->sk_data_ready) which is called at the end of netstack processing of tcp (called from at the end of  tcp_rcv_established). In SoftiWARP code this function is siw_qp_llp_data_ready(). This function processes the RDMA header, locates the pinned user buffers, calculates the buffer offset and after checking access permissions copies the data by calling skb_copy_bits().

Setup on Debian 6.0: In this section I will outline how to install OFED and SoftiWARP from scratch on a Debian 6.0 machine. I like to install things from source so that in future you can check and see what is happening inside the code. I have a freshly installed Debian system with a vanilla kernel ( installed. Nothing fancy here. Make sure you do compile in (from make menuconfig)-> Device Drivers -> InfiniBand Support -> InfiniBand userspace MAD support and InfiniBand userspace access (verbs and CM).

Step 1: Install OFED environment. This consist of installing librdmacm and libibverbs. I will install them from the source.
#apt-get source libibverbs 
# cd libibverbs-1.1.3
# ./configure 
# make 
# make install 
Same steps for librdmacm. Now they should be installed at /usr/local/lib. If required, then include this in your LD_LIBRARY_PATH by putting that in .bashrc file as


Step 2: Install libsiw (user space driver for SoftiWARP device). You should have autoconf, automake and libtool installed. Same steps but get the source from the git 
# cd 'your directory of choice' 
#git clone git:// 
#cd userlib 
#./ (again) 
#make install 

Step 3: Get the kernel driver compiled. Nothing fancy here. Everything should go without any problem. 
# cd 'your directory of choice' 
#git clone git:// 
# cd kernel/softiwarp

I do not recommend installing and tainting your kernel. It might be handy to make a shell script that can do insmod from this build location. See step 6.

Step 4: Setup udev rules before inserting the kernel driver modules and OFED modules. Here is my copy of udev rules at /etc/udev/rules.d/90-ib.rules 
KERNEL=="umad*", NAME="infiniband/%k", MODE="0666"
KERNEL=="issm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"

Step 5: Load the OFED kernel environment by putting these modules in. I have them in a shell script as 
modprobe ib_addr
modprobe ib_cm
modprobe ib_core
modprobe ib_mad
modprobe ib_sa
modprobe ib_ucm
modprobe ib_umad
modprobe ib_uverbs
modprobe iw_cm
modprobe rdma_cm
modprobe rdma_ucm

Step 6: Load the siw.ko from where you build it in step 3. Do 
#insmod siw.ko

Result:  $ibv_devices 
    device             node GUID
    ------           ----------------
    siw_eth0       4437e668130c0000
    siw_lo           7369775f6c6f0000
Moreover you can also start RDMA traffic on the local system by doing 
$rping -s 
(on another shell) 
$rping -c -a -v 

This should give you lots of dump of something like: 
ping data: rdma-ping-7799: \]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
ping data: rdma-ping-7800: ]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQ
ping data: rdma-ping-7801: ^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQR

You can ctrl^c after sometime. At this point you are good to go. Soon I will put an example RDMA server-client program in part 3 of the series. 

- "ibv_devices: error while loading shared libraries: cannot open shared object file: No such file or directory" 
  • Check if libs are in /usr/local/lib 
  • Do ldconfig on a newly installed system, so it can learn about new libs 
- "libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1"
  • In some peculiar cases due to wrong installation sometimes the driver file is not properly located. For example in my system I have all driver files in /usr/local/etc/libibverbs.d/ directory. Driver file is nothing special but a simple file that tells libibverbs about the name of driver (and hence library file name). I have following files in my /usr/local/etc/libibverbs.d/
    -rw-r--r-- 1 root staff 13 Aug 30 11:38 cxgb4.driver
    -rw-r--r-- 1 root staff 11 Aug 30 12:31 siw.driver
    cxgb4.driver is for Chelsio T4 RNIC, and siw.driver is for SoftiWARP. Inside the file there is nothing fancy. siw.driver contains one line of text :
    driver siw
    Check the log from strace if it is finding and trying to open this file for the device. If the file is missing then just create it by yourself. 
- Permission denied errors such as :
  "rping -s 
   CMA: unable to open RDMA device
   Segmentation fault"
are related to if you missed setting up the udev rules. In this case only root can access the RDMA devices. Also on some debian system 50-udev.rules also contain some RDMA related rules. Delete them !

- For more detailed debugging try using strace with -f ...something like
  $strace -f rping -s 
It will give you tons of details what system is doing, which files it is opening, which one failed which lead to failure of the RDMA program. It is also useful to check mis-configured library paths in the lookup as you can see if ld is checking all of them or not.

- If nothing works then drop me an email !( atr AT zurich DOT ibm DOT com)  :)

References: SoftiWARP is developed at Systems Software Group, IBM Research Lab, Zurich. More details about it can be found at 
- IBM website -
- Gitorious
- Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft RDMA to Boost Memcached. Patrick Stuedi, Animesh Trivedi, Bernard Metzler, Usenix ATC'12 (Short Paper), Boston, USA, June 2012.
A Case for RDMA in Clouds: Turning Supercomputer Networking into Commodity, Animesh Trivedi, Bernard Metzler, Patrick Stuedi ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2011), Shanghai, China, July 2011.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of my past or current employers. 

Tuesday, 22 November 2011

Remote Direct Memory Access (RDMA) 101 - Quick History Lesson and Introduction

Remote Direct Memory Access or RDMA is a cool technology which enables applications to directly read and write remote memory (of course after some sort of setup) directly from a NIC. Think about it as networked DMA. Often people associate RDMA with Infiniband, which is fair but there is a subtle difference. RDMA has its roots in Virtual Interface Architecture, which was essentially developed to support user-level, fast, low-latency zero-copy networks. However VIA was an abstraction not a concrete implementation. Infiniband was one of the first concrete implementations of VIA and led to the development of concrete RDMA stacks. However in the beginning, Infiniband itself was badly fragmented. This was still back in late 90s and early 2000s. 

Now fast forward to 2007. Infiniband was a commercial success and has found an easy way in to HPC community due to ultra-low latency and stringent performance demands. It is also popular with other high end data intensive appliances. But what about commodity stuff which runs mostly on IP based networks like data centers. Enter iWARP or Internet Wide Area RDMA Protocol (Don't ask me about why it is called iWARP). It defined RDMA semantics on top of IP-based networks, which makes its way to run on commodity interconnect technologies such as widely popular Ethernet/IP/TCP. 

Now today there is another stack making a lot of buzz called - RoCE (pronounced Rocky) or RDMA on Converged Enhanced Ethernet (CEE). The main line of argument here is -- since L2 layer (Ethernet) is lossless there is no need for complicated IP and TCP stacks. RoCE puts RDMA semantics directly on top of Ethernet packets. 

The point I am trying to make here is that - RDMA specification itself is just a set of abstractions and semantics. It is totally upto the developer of the stack how to develop it. There are also numerous proprietary implementations of RDMA around. Now just as with the low-level stuff, there is also no final word on higher-level stuff such as user-level APIs and libraries. Early on this led to very poor fragmentation of RDMA userspace. Every Infiniband vendors (in those days there was just one RDMA implementation which was IB) seemed to have their own user-space libraries and access mechanism to their RDMA hardware. But these days situation seems to be much coherent -- Open Fabric Alliance (OFA) distribution of user-space libraries and APIs seems to be the de-facto RDMA standard (although for sure not the only one). They provide kernel level support for RDMA and userlevel libraries. Their distribution is called OFA Enterprise Distribution or OFED. 

Since initially RDMA was developed for Infiniband (IB), due to historical reasons much of the RDMA code base and abbreviations still use _ib_ or _IB_. But RDMA most certainly not tied up to that. There should be a clear separation between the RDMA concept and the transport (e.g. Infiniband, iWARP, RoCE) which implements it. Also today the term RDMA itself is used as an umbrella terms, which also includes more fancy operations apart from trivial remote memory reads and writes. However not all of these operations might be available on every RDMA-transport implementation. However there are ways applications can probe about the transport capabilities. 

Another important aspect of this discussion is to understand how to write transport agnostic RDMA code, considering now there are more than one RDMA transports out there. RDMA transport differs in how they initiate and manage connections. To hide this complexity OFA has developed RDMA connection manager (which is distributed as librdmacm). In the standard RDMA library (which as you might have guessed is called libibverbs, btw - verbs is nothing but just a fancy name for API calls) there are also connection management calls which are IB specific but does not make much sense for iWARP (running on TCP/IP). So for that legacy IB code should have to be re-written to link against librdmacm to be transport agnostic. To one's surprise quite a bit of code out there (also inside OFED which include some benchmarks as well) is IB specific and will not run on iWARP

With the standard OFED development environment one just have to do: 
gcc your_rdma_app.c -l librcmam  

I will write soon how to setup an RDMA development environement entirely in software. No need for any fancy hardware and one can see RDMA in action ! 

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Saturday, 6 August 2011

Gigabit Ethernet efficiency with standard 1500 bytes MTU

In my experiments, I have found that using jumbo frames (9k MTU) eclipses many gains seen by other improvements in the network stack such as GRO, LRO, interrupt coalescing etc. Because of the large MTU size, the per packet overhead is very small and renders other receive side optimizations a little less effective.

Now I have started experimenting with standard MTU size (1500 bytes) on 10 GbE. With 9K MTU, I can easily get to the line speed of ~9850-9900 Mbps with 4K or bigger message sizes. But with 1500 MTU, I can not get past some odd ~9400 Mbps, even with large message size of 1MB. It is not CPU bounded and both rx and tx side CPUs were less than 100% loaded. Upon further investigation and calculations I understood that ~9400 was the theoretical application data limit on 10 GbE with 1500 MTU. Lets break it down point by point

- 10 Gigabits per second or 10^10 bits/sec transmission speed refers to the raw bit transmission capacity of the link. This is layer 1 (L1) in the OSI model.
- L2 is Ethernet. Ethernet has a standard frame size which is shown here: So, as can be seen, for every 1500 (MTU) bytes payload transferred, Ethernet would require transmitting additional 38 bytes (= 7 bytes (Preamble)+1 byte (delimiter)+12 bytes (src+dst)+2 bytes(Ethertype)+4 bytes (CRC)+12 bytes (Interframe Gap, YES this is also transmitted))
- L3, routing aka omnipresent IP stack. Add another 20 bytes to the protocol overhead.
- L4, transport aka everyone's favorite TCP stack. By default Linux has timestamp option enabled for the TCP stack. Which adds another 12 bytes of overhead in standard 5 Word TCP header. So in total TCP header becomes 32 bytes.

So for every 1500 bytes transmitted on the wire, Ethernet transmits additional 38 bytes. And in 1500 bytes payload, apart from user data we have 52 (20+32) bytes of TCP and IP headers. So the net efficiency of the stack becomes

(1500 - 52 ) / ( 1500 + 38) = 0.9414 or 94.14%, which is exactly what you get as end-to-end application data rate. This is called "protocol overhead". With jumbo frames we have same calculation but with 9k MTU so

(9000 - 52) / ( 9000 + 38) = 0.9900 or 99%. Less than 1% protocol overhead.

In these calculations I have ignored the vLAN extension to Ethernet which adds another (optional) 4 bytes to Ethernet frame and other various optional TCP/IP headers. Any additional stuff would just increase the protocol overhead.

Thursday, 30 June 2011

bash commands shenanigans

I needed to do some profiling using netcat and time and it turns out that there is more to those commands than meets the eyes ;) 

Case netcat: netcat is super awesome networking utility. I wanted to test how long does it take to transfer a large file from a cold cache start. So on the server side I did 

nc -v -l 5001 

and then on the client side I had 

nc -v ip 5001 < file_name 

and this is perfectly sane. It worked like a charm. But then I moved to another pair box (btw - both systems are running Debian testing Wheezy, in total 2 pair of boxes). On the new pair when I start nc in listen mode I get: 

5001: inverse host lookup failed: Unknown host
listening on [any] 47022 ...

This is not what I wanted. Additionally file transfer does not terminate properly which was essential to my benchmarking. After couple of hours of starring at nc code, latter I realize that there are couple of variants around. Notably nc.traditional and nc.openbsd. Things just work fine with openbsd version. The man page is written for openbsd version (the file transfer example I copied from man page). So then I installed openbsd version by apt-get install netcat-openbsd which updated the nc link to it in /bin. Since then things have been back to normal. I still don't know what is missing (could not find as well) and what is the difference between those two, but for now examples from the world of man page started to make sense again !  

Case time: Time is another useful command which can give you a lot of useful information about the process stats. But apparently, the version which is build in with bash is badly out-of-sync from man pages. So what I gathered, there are two versions of time: one built in in bash, other is /usr/bin/time. When you just run time, the built-in one gets invoked and ignores all the parameters and even complains about them. This is not what you would expect after reading the man page of time. For example I was trying : 

$time -f  "%P"  ls
-bash: -f: command not found

 It is even refusing to accept the command. Then after a while when I figured out the difference between the two versions: 

$ /usr/bin/time -f "%P" ls
. ..

It worked perfectly ! Again, the world of magnet and miracles aka. man pages started to make sense again. 

Gurr ! 

Wednesday, 4 May 2011

Understanding TCP behavior in Linux 2.6

Lately I have been doing a lot of TCP performance analysis on different configuration settings. TCP is a very complex protocol and have plenty of knobs which you can play with. Linux's TCP implement itself is messy (in its most positive sense) enough and requires quite a bit know-how and tools expertise. I am playing with
- 1GbE and 10GbE
-  variable send and recv user buffers sizes (what is passed  to the send, recv calls)
-  variable send and recv socket buffer sizes (what is passed to setsockopt call SO_SNDBUF and SO_RCVBUF)
- Different MTU sizes (for now just sticking with 1500 and 9000 bytes)
- Different interrupt coalescing and offloading settings (primarily LRO and GRO)
And to make matter worse I have 2 pairs of machines of different generations of CPUs and memory bandwidth.

tcp_dump is an excellent tools which gives basic information about TCP behavior on wire, showing all the standard information which can be extracted form a TCP header. The primary limitation what I felt was it did not give any peek into the implementation of Linux (which I guess it is not suppose to do as well). Also it has non-negligible overhead in measurement (I will post numbers soon). When collecting snapshots, sometime it is desired to see some internal details about how a particular OS (here Linux) sees that connection. So two options come to rescue:

a) Use getsockopt call with TCP_INFO. The call returns the current TCP information maintained by kernel filled in the struct tcp_info ( Here is how this structure is filled up inside the kernel A more detailed example is given here Although very neat, but it has certain issues. For example, sometimes I need to export some very specific internal data from struct tcp_sock. Now I can add that to header file but that requires recompiling kernel, application, and what not. For that I opted for option #2.

b) Use tcp_probe.ko to hook into TCP stream processing inside kernel. The main advantage of this approach is that it allows to selectively recompile the kernel module without having to recompile the whole kernel to export something peculiar. A quick tutorial

step 1: Insert tcp_probe.ko (if you are going to selectively recompile the module then I recommend going to linux/net/ipv4/ and then doing insmod tcp_probe.ko instead of modprobe tcp_probe). At the insertion time it takes two parameter, port and full or not. Port is which port you want to see activity on (or 0 for all). It can be either source or destination port. Second full, if you want to have log when congestion window changes or complete logging. Complete logging "can be" expensive (but I don't find it that much). I prefer complete logging.

Step 2: iPerf uses 5001 port number. Hence use something like
cat /proc/net/tcpprobe > dump &

Step 3: When done with running the TCP experiment, kill the background dumping process. Now parse the dump file  with your favorite column parser and use gnuplot to get pretty graphs. A (kind-of) non-working example is at here :

A few things have changed in tcp_probe.c since then, mainly introduction of full parameter which was introduced 22 series. With full the logging condition becomes:
/* Only update if port matches */

if ((port == 0 || ntohs(inet->inet_dport) == port ||
            ntohs(inet->inet_sport) == port) &&
           (full || tp->snd_cwnd != tcp_probe.lastcwnd))

On a straight forward end-to-end connected hosts (switch-less) one does not have much packet losses and window grow quickly. Hence you get a straight line or no samples at all in the log. So I prefer using full logging.

Now how to change it to export some custom stuff from inside the kernel. Simple.

Step #1: See struct tcp_log at Add your variable here.

Step #2: Calculate how to export from kernel here In this function ( jtcp_rcv_established) you have access to all the cool stuff inside Linux kernel -- struct sock, struct skb and struct tcp_sock. Get whatever you want to export and save it down in the log. For example you want to export toal number of retransmission so just add p->total_retrans = tp->total_retrans;

Step #3: Dump it out when you read it while doing cat on /proc/net/tcpprobe. Here As simple as putting an extra variable in a printf call.

Recompile the kernel (that would just update tcp_probe.ko) or your standalone module code. Insert it and you are good to go to see new variable in the output of tcp_probe. 

Thursday, 28 April 2011

gvim neat coding style

Often working in GUI mode, I can not help to overflow "standard" good practice of sticking to the 80 char columns. So gvim comes for rescue set this up in your ~/.gvimrc or ~/.vimrc file and it will highlight the text which you overflow to aid visually:

highlight OverLength ctermbg=red ctermfg=white guibg=#592929
match OverLength /\%81v.\+/

vim never lets me down and keeps on bringing surprises on its customizability ;)