Packet I/O Performance on a Low-end Desktop
- CPU: Intel Core i3 540 Processor (3.07 GHz, 2 cores, Hyper-Threading disabled)
- RAM: 4 GB (two dual-channeled 2 GB DIMMs)
- NIC: 1 Intel X520-DA2 (Intel 82599 Controller, two ports, SFP+ Direct Attach)
- OS: Linux 22.214.171.124 (64 bit, built with x64_def_config)
The DUT (Device Under Test) was directly connected with a packet generater via two SFP+ Direct Attach cables. The packet generator was a commodity server, which also utilizes Packet I/O Engine (instead of hardware-based measurement instrument). The input traffic was generated by the program packet_generator included in the samples.
In this configuration, the packet generator asserts randomly generated UDP packets with different IP addresses and port numbers. The DUT then receives the packets and silently drops them, without any further processing. To make DUT work as a sink, we used echo program with -s option. We repeated the experiment over various packet sizes, which ranges between 60 and 1514 bytes. The packet size accounts from the Ethernet header to the end of UDP payload (excluding the CRC checksum).
All Ethernet overheads (24 bytes per packet) were counted for throughput measurement. Therefore, the theoretical maximum throughput with two 10 GbE links is exactly 20.0 Gbps. For 60-byte packets (minimum Ethernet frame size), 10 Gbps is translated to approximately 14.9 Mpps (Packets Per Second).
The above graph shows the RX throughput of the DUT. The DUT achieves nearly line-rate performance except for the two ranges, 65-87 and 129-139. For the former range, the packet generator could not generate line-rate traffic. For the latter case, the packet generator transmitted line-rate traffic, but the DUT could not afford it. This performance degradation is caused by cache alignment. For example, a 129-byte packet requires 192-byte memory access since the size of a cache line is 64-byte and a 129-byte packet takes three cache lines.
In this configuration, we exchanged the above roles of the packet generator and the DUT. The DUT generates packets with packet_generator, and the machine that once was a packet generator now works as a traffic sink.
The graph shows the similar throughput pattern to the above graph.
Performance: Simple forwarding
In this test scenario, the DUT receives packets from the packet generator and simply echoes them back to the generator with the sample program echo. The generator then simply ignores echoed traffic. This test represents minimal actions for packet forwarding.
(The RX line completely overlaps with the TX line, which means it transmitted all received packets)
For all packet sizes, the throughput does not achieve line-rate performance, converging up to 17-18 Gbps. We suspect that the performance is bound by the memory bandwith. This simple forwarding requires four memory accesses per packet, namely NIC -> RAM, RAM -> CPU, CPU -> RAM, and RAM -> NIC.
You can see the spikes on the plot for every 64 bytes. This is caused by the cache lines mentioned above. The height of spikes gets lower as the size of packet goes up since the wasted amount of memory access caused by cache line size becomes marginal with larger packets.