Remote Direct Memory Access or RDMA is a cool technology which enables applications to directly read and write remote memory (of course after some sort of setup) directly from a NIC. Think about it as networked DMA. Often people associate RDMA with Infiniband, which is fair but there is a subtle difference. RDMA has its roots in Virtual Interface Architecture, which was essentially developed to support user-level, fast, low-latency zero-copy networks. However VIA was an abstraction not a concrete implementation. Infiniband was one of the first concrete implementations of VIA and led to the development of concrete RDMA stacks. However in the beginning, Infiniband itself was badly fragmented. This was still back in late 90s and early 2000s.
Now fast forward to 2007. Infiniband was a commercial success and has found an easy way in to HPC community due to ultra-low latency and stringent performance demands. It is also popular with other high end data intensive appliances. But what about commodity stuff which runs mostly on IP based networks like data centers. Enter iWARP or Internet Wide Area RDMA Protocol (Don't ask me about why it is called iWARP). It defined RDMA semantics on top of IP-based networks, which makes its way to run on commodity interconnect technologies such as widely popular Ethernet/IP/TCP.
Now today there is another stack making a lot of buzz called - RoCE (pronounced Rocky) or RDMA on Converged Enhanced Ethernet (CEE). The main line of argument here is -- since L2 layer (Ethernet) is lossless there is no need for complicated IP and TCP stacks. RoCE puts RDMA semantics directly on top of Ethernet packets.
The point I am trying to make here is that - RDMA specification itself is just a set of abstractions and semantics. It is totally upto the developer of the stack how to develop it. There are also numerous proprietary implementations of RDMA around. Now just as with the low-level stuff, there is also no final word on higher-level stuff such as user-level APIs and libraries. Early on this led to very poor fragmentation of RDMA userspace. Every Infiniband vendors (in those days there was just one RDMA implementation which was IB) seemed to have their own user-space libraries and access mechanism to their RDMA hardware. But these days situation seems to be much coherent -- Open Fabric Alliance (OFA) distribution of user-space libraries and APIs seems to be the de-facto RDMA standard (although for sure not the only one). They provide kernel level support for RDMA and userlevel libraries. Their distribution is called OFA Enterprise Distribution or OFED.
Since initially RDMA was developed for Infiniband (IB), due to historical reasons much of the RDMA code base and abbreviations still use _ib_ or _IB_. But RDMA most certainly not tied up to that. There should be a clear separation between the RDMA concept and the transport (e.g. Infiniband, iWARP, RoCE) which implements it. Also today the term RDMA itself is used as an umbrella terms, which also includes more fancy operations apart from trivial remote memory reads and writes. However not all of these operations might be available on every RDMA-transport implementation. However there are ways applications can probe about the transport capabilities.
Another important aspect of this discussion is to understand how to write transport agnostic RDMA code, considering now there are more than one RDMA transports out there. RDMA transport differs in how they initiate and manage connections. To hide this complexity OFA has developed RDMA connection manager (which is distributed as librdmacm). In the standard RDMA library (which as you might have guessed is called libibverbs, btw - verbs is nothing but just a fancy name for API calls) there are also connection management calls which are IB specific but does not make much sense for iWARP (running on TCP/IP). So for that legacy IB code should have to be re-written to link against librdmacm to be transport agnostic. To one's surprise quite a bit of code out there (also inside OFED which include some benchmarks as well) is IB specific and will not run on iWARP.
With the standard OFED development environment one just have to do:
Now fast forward to 2007. Infiniband was a commercial success and has found an easy way in to HPC community due to ultra-low latency and stringent performance demands. It is also popular with other high end data intensive appliances. But what about commodity stuff which runs mostly on IP based networks like data centers. Enter iWARP or Internet Wide Area RDMA Protocol (Don't ask me about why it is called iWARP). It defined RDMA semantics on top of IP-based networks, which makes its way to run on commodity interconnect technologies such as widely popular Ethernet/IP/TCP.
Now today there is another stack making a lot of buzz called - RoCE (pronounced Rocky) or RDMA on Converged Enhanced Ethernet (CEE). The main line of argument here is -- since L2 layer (Ethernet) is lossless there is no need for complicated IP and TCP stacks. RoCE puts RDMA semantics directly on top of Ethernet packets.
The point I am trying to make here is that - RDMA specification itself is just a set of abstractions and semantics. It is totally upto the developer of the stack how to develop it. There are also numerous proprietary implementations of RDMA around. Now just as with the low-level stuff, there is also no final word on higher-level stuff such as user-level APIs and libraries. Early on this led to very poor fragmentation of RDMA userspace. Every Infiniband vendors (in those days there was just one RDMA implementation which was IB) seemed to have their own user-space libraries and access mechanism to their RDMA hardware. But these days situation seems to be much coherent -- Open Fabric Alliance (OFA) distribution of user-space libraries and APIs seems to be the de-facto RDMA standard (although for sure not the only one). They provide kernel level support for RDMA and userlevel libraries. Their distribution is called OFA Enterprise Distribution or OFED.
Since initially RDMA was developed for Infiniband (IB), due to historical reasons much of the RDMA code base and abbreviations still use _ib_ or _IB_. But RDMA most certainly not tied up to that. There should be a clear separation between the RDMA concept and the transport (e.g. Infiniband, iWARP, RoCE) which implements it. Also today the term RDMA itself is used as an umbrella terms, which also includes more fancy operations apart from trivial remote memory reads and writes. However not all of these operations might be available on every RDMA-transport implementation. However there are ways applications can probe about the transport capabilities.
Another important aspect of this discussion is to understand how to write transport agnostic RDMA code, considering now there are more than one RDMA transports out there. RDMA transport differs in how they initiate and manage connections. To hide this complexity OFA has developed RDMA connection manager (which is distributed as librdmacm). In the standard RDMA library (which as you might have guessed is called libibverbs, btw - verbs is nothing but just a fancy name for API calls) there are also connection management calls which are IB specific but does not make much sense for iWARP (running on TCP/IP). So for that legacy IB code should have to be re-written to link against librdmacm to be transport agnostic. To one's surprise quite a bit of code out there (also inside OFED which include some benchmarks as well) is IB specific and will not run on iWARP.
With the standard OFED development environment one just have to do:
gcc your_rdma_app.c -l librcmam
I will write soon how to setup an RDMA development environement entirely in software. No need for any fancy hardware and one can see RDMA in action !
Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.
No comments:
Post a Comment