Wednesday 4 May 2011

Understanding TCP behavior in Linux 2.6

Lately I have been doing a lot of TCP performance analysis on different configuration settings. TCP is a very complex protocol and have plenty of knobs which you can play with. Linux's TCP implement itself is messy (in its most positive sense) enough and requires quite a bit know-how and tools expertise. I am playing with
- 1GbE and 10GbE
-  variable send and recv user buffers sizes (what is passed  to the send, recv calls)
-  variable send and recv socket buffer sizes (what is passed to setsockopt call SO_SNDBUF and SO_RCVBUF)
- Different MTU sizes (for now just sticking with 1500 and 9000 bytes)
- Different interrupt coalescing and offloading settings (primarily LRO and GRO)
And to make matter worse I have 2 pairs of machines of different generations of CPUs and memory bandwidth.

tcp_dump is an excellent tools which gives basic information about TCP behavior on wire, showing all the standard information which can be extracted form a TCP header. The primary limitation what I felt was it did not give any peek into the implementation of Linux (which I guess it is not suppose to do as well). Also it has non-negligible overhead in measurement (I will post numbers soon). When collecting snapshots, sometime it is desired to see some internal details about how a particular OS (here Linux) sees that connection. So two options come to rescue:

a) Use getsockopt call with TCP_INFO. The call returns the current TCP information maintained by kernel filled in the struct tcp_info (http://lxr.linux.no/linux+v2.6.38/include/linux/tcp.h#L129). Here is how this structure is filled up inside the kernel http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp.c#L2433. A more detailed example is given here http://linuxgazette.tuwien.ac.at/136/pfeiffer.html. Although very neat, but it has certain issues. For example, sometimes I need to export some very specific internal data from struct tcp_sock. Now I can add that to header file but that requires recompiling kernel, application, and what not. For that I opted for option #2.

b) Use tcp_probe.ko to hook into TCP stream processing inside kernel. The main advantage of this approach is that it allows to selectively recompile the kernel module without having to recompile the whole kernel to export something peculiar. A quick tutorial

step 1: Insert tcp_probe.ko (if you are going to selectively recompile the module then I recommend going to linux/net/ipv4/ and then doing insmod tcp_probe.ko instead of modprobe tcp_probe). At the insertion time it takes two parameter, port and full or not. Port is which port you want to see activity on (or 0 for all). It can be either source or destination port. Second full, if you want to have log when congestion window changes or complete logging. Complete logging "can be" expensive (but I don't find it that much). I prefer complete logging.

Step 2: iPerf uses 5001 port number. Hence use something like
cat /proc/net/tcpprobe > dump &

Step 3: When done with running the TCP experiment, kill the background dumping process. Now parse the dump file  with your favorite column parser and use gnuplot to get pretty graphs. A (kind-of) non-working example is at here : http://www.linuxfoundation.org/collaborate/workgroups/networking/tcpprobe

A few things have changed in tcp_probe.c since then, mainly introduction of full parameter which was introduced 22 series. With full the logging condition becomes:
/* Only update if port matches */

if ((port == 0 || ntohs(inet->inet_dport) == port ||
            ntohs(inet->inet_sport) == port) &&
           (full || tp->snd_cwnd != tcp_probe.lastcwnd)) 
http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp_probe.c#L97

On a straight forward end-to-end connected hosts (switch-less) one does not have much packet losses and window grow quickly. Hence you get a straight line or no samples at all in the log. So I prefer using full logging.

Now how to change it to export some custom stuff from inside the kernel. Simple.

Step #1: See struct tcp_log at http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp_probe.c#L53. Add your variable here.

Step #2: Calculate how to export from kernel here http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp_probe.c#L91. In this function ( jtcp_rcv_established) you have access to all the cool stuff inside Linux kernel -- struct sock, struct skb and struct tcp_sock. Get whatever you want to export and save it down in the log. For example you want to export toal number of retransmission so just add p->total_retrans = tp->total_retrans;

Step #3: Dump it out when you read it while doing cat on /proc/net/tcpprobe. Here http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp_probe.c#L150. As simple as putting an extra variable in a printf call.

Recompile the kernel (that would just update tcp_probe.ko) or your standalone module code. Insert it and you are good to go to see new variable in the output of tcp_probe. 

1 comment:

  1. very helpful !

    Often I have been looking for something simple to get me information about tcp states inside the kernel to better understand the bottleneck.
    This quick tutorial looks really good.

    Thanks !

    ReplyDelete