AccelTCP is a highly scalable, hardware-assisted TCP stack which harnesses NIC as a TCP accelerator. AccelTCP is optimized for handling short-lived connections and application-level proxying. The key idea behind AccelTCP is we can free a significant amount of CPU cycles for TCP applications by offloading repetitive, mechanical operations like connection setup, teardown, or splicing to NIC hardware.
The performance of modern key-value servers or layer-7 (L7) load balancers often heavily depends on the efficiency of the underlying TCP stack. Despite numerous optimizations such as kernel-bypassing and zero-copying, we observe that performance improvement with a TCP stack is fundamentally limited due to the protocol conformance overhead for compatible TCP operations.
While recent kernel-bypass TCP stacks (e.g., mTCP [NSDI'14], IX [OSDI'14]) substantially improved the performance of short RPC transactions, they still need to spend a significant amount of CPU cycles to track flow states during connection setup and teardown. For example, Redis-mTCP, a popular in-memory key-value store ported to mTCP, consumes over a half of CPU cycles on TCP stack operations, such as connection state management and handling TCP control packets.
L7 proxying is widely used in middlebox applications such as L7 load balancers and application gateways. While the key functionality of such L7 proxies is to map a client-side connection to a back-end server, it consumes most of CPU cycles on relaying the packets between the two connections, and DMA operations between the host memory and the NIC are unavoidable.
AccelTCP is a dual-stack TCP architecture that harnesses NIC hardware as a
TCP protocol accelerator. AccelTCP divides the TCP stack operations into two categories,
and we mainly target the peripheral operations for offloading to a NIC stack.
(i) Host stack performs central TCP operations which refer to all aspects of application data transfer - reliable data transfer with inferring loss and packet retransmission, performing flow reassembly, and enforcing congestion/flow control. These are typically complex and subject to flexible policies, which demands variable amount of compute cycles.
(ii) NIC stack performs peripheral operations which refer to any other tasks logically independent from the application. These include traditional partial NIC offload tasks, connection setup/teardown, and blind relaying of packets between two connections that requires no application-level intervention. Those tasks are either stateless operations with a fixed processing cost or lightly stateful operations.
Stateful TCP Offloads in AccelTCP
Connection management offload: State synchronization at a connection's boundary is a key requirement for TCP, but it is a pure overhead from the application's perspective. While NIC offload is logically desirable, conventional wisdom suggests otherwise due to complexity. Our position is that one can tame the complexity on recent smart NICs. First, connection setup operations can be made stateless with SYN cookies. Second, the common case of connection teardown is simple state transition, and modern smart NICs have enough resources to handle a few exceptions.
Connection splicing offload: Offloading connection splicing to NIC is conceptually complex as it requires state management of two separate connections on NIC. However, if the application does not modify the relayed content, as is often the case with L7 load balancers, we can splice two physical connections into a single logical connection. This allows the NIC to operate as a fast packet forwarder that simply translates the packet header. The compute cycles for this are fixed with a small state for connection splicing.
Host Stack Optimizations
We optimize the host networking stack to accelerate small message processing. While these optimizations are orthogonal to NIC offload, they bring a significant performance benefit to short-lived connections.
- Lazy TCB creation reduces the overhead of maintaining large TCBs for short-lived connections.
- Opportunistic zero-copy enables efficient I/O while maintaining the standard socket API semantics.
- User-level threading eliminates the high context switching overhead of kernel-level threads.
- CPU: Intel Xeon Gold 6142 @ 2.6 GHz (16-core)
- Memory: 128 GB DDR4 DRAM
- NIC: Netronome Agilio LX (dual-port 40 GbE NIC)
(* Only its 1st port is enabled for the Redis test, and both ports are enabled for the HAProxy test.)
- Ubuntu 16.04 (Linux 4.11), DPDK 17.08
Key-value store (Redis)
We first evaluate the effectiveness of AccelTCP with Redis. We use Redis on mTCP as a baseline server while we port it to use AccelTCP for comparison. We test with the USR workload from Facebook, which consists of 99.8% GET and 0.2% SET requests with short keys and values (< 20B). Redis-AccelTCP achieves up to 2.3x better throughput than Redis-mTCP, and its performance scales well with the number of CPU cores. We also find AccelTCP saves up to 75% of the CPU cycles for TCP processing.
Layer-7 load balancer (HAProxy)
We next see if AccelTCP improves the performance of HAProxy, a widely used HTTP-based L7 load balancer. We first port HAProxy to use mTCP and AccelTCP respectively, and evaluate their performance with the SpecWeb2009 workload. HAProxy-AccelTCP achieves 11.9x better throughput than HAProxy-mTCP. The average response time of HAProxy-AccelTCP is 13.6x lower than that of mTCP (13.33 vs. 0.98 ms).
AccelTCP: Accelerating Network Applications with Stateful TCP Offloading
YoungGyoun Moon, Seungeon Lee, Muhammad Jamshed, and Kyoungsoo Park
To appear in USENIX NSDI 2020
Accelerating Flow Processing Middleboxes with Programmable
YoungGyoun Moon, Ilwoo Park, Seungeon Lee, and Kyoungsoo Park
In Proceedings of ACM Apsys 2018
Checkout the latest release of AccelTCP at our github!