Cedos

Accelerating Network Applications with Stateful TCP Offloading

What is AccelTCP?

AccelTCP is a highly scalable, hardware-assisted TCP stack which harnesses NIC as a TCP accelerator. AccelTCP is optimized for handling short-lived connections and application-level proxying. The key idea behind AccelTCP is we can free a significant amount of CPU cycles for TCP applications by offloading repetitive, mechanical operations like connection setup, teardown, or splicing to NIC hardware.

Motivation

The performance of modern key-value servers or layer-7 (L7) load balancers often heavily depends on the efficiency of the underlying TCP stack. Despite numerous optimizations such as kernel-bypassing and zero-copying, we observe that performance improvement with a TCP stack is fundamentally limited due to the protocol conformance overhead for compatible TCP operations.

Short-lived Connections

While recent kernel-bypass TCP stacks (e.g., mTCP [NSDI'14], IX [OSDI'14]) substantially improved the performance of short RPC transactions, they still need to spend a significant amount of CPU cycles to track flow states during connection setup and teardown. For example, Redis-mTCP, a popular in-memory key-value store ported to mTCP, consumes over a half of CPU cycles on TCP stack operations, such as connection state management and handling TCP control packets.

Figure 1. CPU breakdown of Redis with mTCP

Application-level Proxying

L7 proxying is widely used in middlebox applications such as L7 load balancers and application gateways. While the key functionality of such L7 proxies is to map a client-side connection to a back-end server, it consumes most of CPU cycles on relaying the packets between the two connections, and DMA operations between the host memory and the NIC are unavoidable.

Figure 2. L7 proxying (mTCP) and L4 switch (DPDK) performance on a single CPU core

AccelTCP Design

AccelTCP is a dual-stack TCP architecture that harnesses NIC hardware as a TCP protocol accelerator. AccelTCP divides the TCP stack operations into two categories, and we mainly target the peripheral operations for offloading to a NIC stack.

(i) Host stack performs central TCP operations which refer to all aspects of application data transfer - reliable data transfer with inferring loss and packet retransmission, performing flow reassembly, and enforcing congestion/flow control. These are typically complex and subject to flexible policies, which demands variable amount of compute cycles.

(ii) NIC stack performs peripheral operations which refer to any other tasks logically independent from the application. These include traditional partial NIC offload tasks, connection setup/teardown, and blind relaying of packets between two connections that requires no application-level intervention. Those tasks are either stateless operations with a fixed processing cost or lightly stateful operations.

Figure 3. Split of TCP functionality in AccelTCP

Stateful TCP Offloads in AccelTCP

Connection management offload: State synchronization at a connection's boundary is a key requirement for TCP, but it is a pure overhead from the application's perspective. While NIC offload is logically desirable, conventional wisdom suggests otherwise due to complexity. Our position is that one can tame the complexity on recent smart NICs. First, connection setup operations can be made stateless with SYN cookies. Second, the common case of connection teardown is simple state transition, and modern smart NICs have enough resources to handle a few exceptions.

Figure 4. Connection setup offload

Figure 5. Connection teardown offload

Connection splicing offload: Offloading connection splicing to NIC is conceptually complex as it requires state management of two separate connections on NIC. However, if the application does not modify the relayed content, as is often the case with L7 load balancers, we can splice two physical connections into a single logical connection. This allows the NIC to operate as a fast packet forwarder that simply translates the packet header. The compute cycles for this are fixed with a small state for connection splicing.

Figure 6. Connection splicing offload

Host Stack Optimizations

We optimize the host networking stack to accelerate small message processing. While these optimizations are orthogonal to NIC offload, they bring a significant performance benefit to short-lived connections.

Evaluation

System configuration:

Key-value store (Redis)

We first evaluate the effectiveness of AccelTCP with Redis. We use Redis on mTCP as a baseline server while we port it to use AccelTCP for comparison. We test with the USR workload from Facebook, which consists of 99.8% GET and 0.2% SET requests with short keys and values (< 20B). Redis-AccelTCP achieves up to 2.3x better throughput than Redis-mTCP, and its performance scales well with the number of CPU cores. We also find AccelTCP saves up to 75% of the CPU cycles for TCP processing.

Figure 7. Throughput performance of Redis with mTCP vs. AccelTCP

Layer-7 load balancer (HAProxy)

We next see if AccelTCP improves the performance of HAProxy, a widely used HTTP-based L7 load balancer. We first port HAProxy to use mTCP and AccelTCP respectively, and evaluate their performance with the SpecWeb2009 workload. HAProxy-AccelTCP achieves 11.9x better throughput than HAProxy-mTCP. The average response time of HAProxy-AccelTCP is 13.6x lower than that of mTCP (13.33 vs. 0.98 ms).

Figure 7. Throughput performance of HAProxy with mTCP vs. AccelTCP

Publications

Source code

Checkout the latest release of AccelTCP at our github!

People

YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park