Estimated reading time: 8 minutes
“There are never any shortcuts, and anyone who tells you there’s a shortcut is blowing smoke.”
— Malcolm Gladwell (Journalist & Author)
“Malcolm Gladwell says it takes 10,000 hours to master something. This is [nonsense]. Total, utter [nonsense].”
— James Altucher (Entrepreneur, Hedge Fund Manager, Comedian & Author)
Remember when people used to complain about Airport Customs lines? They’d even research Airport Wait-Time Ranking websites and would check the best and worst times to travel internationally. But then along came “Global Entry” in ’08 which permitted travelers to skip the line.
Remember when Malcolm Gladwell popularized the research behind the “10,000 hour rule” to mastery? But then along came James Altucher in 2021 with several shortcuts to that rule1Actually, David Epstein beat James to it in his 2013 book “The Sports Gene: Inside the Science of Extraordinary Athletic Performance” and then again in his 2019 book (one of my favorites) “Range: Why Generalists Triumph in a Specialized World”. Epstein has also directly debated the research behind the “10,000 hour rule” with one of its main contributors, Anders Ericsson.in his book “Skip the Line: The 10,000 Experiments Rule and Other Surprising Advice for Reaching Your Goals”.
Main memory has been the computing world’s version of the Customs line and the “10,000 hour rule” combined. . . but then along came Intel DDIO to help skip the line. What is Intel DDIO, and how exactly does it help?
Table of contents
The Memory Wall
In the “von Neumann” architecture upon which most servers today are based, the compute unit is dependent on bus-connected memory for both instructions and data. However, the ever-increasing processor/memory performance gap, known as “The Memory Wall”, poses scaling challenges for this model. And while bandwidth has improved incrementally, memory latency is at a virtual standstill. Chip designers mitigate this performance gap with larger core caches, on-die memory controllers, Simultaneous Multithreading (SMT), additional cores, larger out-of-order windows, etc. But the gulf in performance remains a prominent issue.
The situation is particularly problematic for storage and network I/O. Take a network adapter (NIC), for instance. Not only does it incur a 150+ ns PCIe packet forwarding latency, but it must DMA that packet into RAM. This adds an additional 60 – 100ns latency per memory access before the CPU can process it.2DMA in a single direction may comprise multiple memory accesses That’s a lotta extra overhead, not to mention the memory bandwidth it robs from running applications. This bottleneck will only worsen once 100Gb+ NICs become more widely deployed. Hmm, if only the NIC could skip the line, right? Oh wait. . . it can! Enter Intel DDIO.
What is Intel DDIO?
DDIO, or Data Direct I/O, is Intel’s latest implementation of Direct Cache Access (DCA). DCA is a technique that Intel researchers proposed in ’05 to deal with the growing adoption of 10GbE. Processing budget for the smallest Ethernet frame at a 10Gb/s rate is only 67.2ns, smaller than typical RAM access latency. Therefore, DCA aims to minimize, and ideally eliminate, main memory involvement in the I/O critical path.
Early implementations relied on PCIe TLP Processing Hints, which made it possible for chipsets to prefetch I/O data into targeted cache destinations and, thus, reduce CPU stall time. While this reduces latency, it does little for bandwidth since the packet must still be DMA-ed into main memory. Also, DCA requires I/O device, chipset, CPU, and OS support (plus a signed permission slip from a legal guardian) in order to function.
Intel improved upon this with DDIO, which transparently supports DMA directly into and out of Last Level Cache (LLC), bypassing main memory. However, unlike early DCA, DDIO doesn’t support granular targeting of I/O data (i.e., per TLP), nor does it support DMA into remote socket LLCs. Still, DDIO as it stands delivers substantial benefits to I/O-intensive workloads by reducing both memory bandwidth usage and I/O latency.
Intel DDIO Demo
We’ll demonstrate DDIO with a SockPerf “ping pong” test between hosts on the same VLAN equipped with Solarflare 10GbE NICs. The system-under-test (SUT) is a dual-socket Cascade Lake machine running CentOS 7.8 and using Solarflare OpenOnload 7.1.x. DDIO is toggled off/on between tests using ddio-bench. For both scenarios (DDIO on and off), the following command lines are used:
On the SUT host:
taskset -c 3 onload -p latency sockperf ping-pong -i 10.1.1.3 -p 5001 --msg-size=256 -t 10
This pins the SockPerf ping-pong client to core 3 on Socket #1, where the PCIe port of the Solarflare NIC is connected. It bypasses the Linux kernel with the “low latency” profile configuration of OpenOnload, and exchanges 256-byte messages with the server at 10.1.1.3:5001 for a duration of 10 seconds.
On the SockPerf Server host (an old Westmere):
taskset -c 3 onload -p latency sockperf server -i 10.1.1.3 -p 5001
This spins up a waiting server on its own 10.1.1.3 address, pins it to core 3 on Socket #1, and bypasses the kernel network stack in the same fashion as the SUT.
Here are the latency metrics as reported by SockPerf with DDIO disabled:
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1793530; ReceivedMessages=1793530
sockperf: ====> avg-latency=2.653 (std-dev=0.062)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 2.653 usec
sockperf: Total 1793530 observations; each percentile contains 17935.30 observations
sockperf: ---> <MAX> observation = 5.960
sockperf: ---> percentile 99.999 = 5.435
sockperf: ---> percentile 99.990 = 4.146
sockperf: ---> percentile 99.900 = 3.633
sockperf: ---> percentile 99.000 = 2.815
sockperf: ---> percentile 90.000 = 2.687
sockperf: ---> percentile 75.000 = 2.658
sockperf: ---> percentile 50.000 = 2.641
sockperf: ---> percentile 25.000 = 2.629
sockperf: ---> <MIN> observation = 2.572
Notice the difference in latency and message throughput with DDIO enabled:
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1849114; ReceivedMessages=1849114
sockperf: ====> avg-latency=2.572 (std-dev=0.048)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 2.572 usec
sockperf: Total 1849114 observations; each percentile contains 18491.14 observations
sockperf: ---> <MAX> observation = 5.717
sockperf: ---> percentile 99.999 = 5.152
sockperf: ---> percentile 99.990 = 3.915
sockperf: ---> percentile 99.900 = 3.451
sockperf: ---> percentile 99.000 = 2.666
sockperf: ---> percentile 90.000 = 2.598
sockperf: ---> percentile 75.000 = 2.579
sockperf: ---> percentile 50.000 = 2.566
sockperf: ---> percentile 25.000 = 2.556
sockperf: ---> <MIN> observation = 2.512
The DDIO-disabled run has a 99% Confidence Interval of 2.653μs +/- .000119, while the DDIO-enabled run has a 99% Confidence Interval of 2.572μs +/- .000091. That’s roughly an 80ns improvement (~310 cycles on a 3.6GHz CPU) that results in 55,584 (1,849,114 – 1,793,530) more messages processed within the same 10s duration.
Active Benchmarking
As I outlined in last month’s article, proper benchmarking incorporates low-impact observability tools to determine the “Why?” behind benchmark results. Since DDIO provides direct paths to and from the LLC, let’s look at LLC miss ratios during longer SockPerf runs.
Perf Stat
Notice the LLC miss rate and IPC for SockPerf with DDIO disabled:
[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 28461 -- sleep 10
Performance counter stats for process id '28461':
42,901,050,092 cycles:u
111,549,903,148 instructions:u # 2.60 insn per cycle
5,213,372 mem_load_retired.l3_miss:u
9,775,507 mem_load_retired.l2_miss:u
10.000510912 seconds time elapsed
Compare that to the LLC miss rate and IPC with DDIO enabled:
[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 27711 -- sleep 10
Performance counter stats for process id '27711':
42,901,181,882 cycles:u
117,305,434,741 instructions:u # 2.73 insn per cycle
37,401 mem_load_retired.l3_miss:u
6,629,667 mem_load_retired.l2_miss:u
10.000545539 seconds time elapsed
Because DDIO bypasses RAM to access LLC directly, the number of cache misses falls by two orders of magnitude between DDIO disabled and enabled for overall LLC miss rates (l3_miss / l2_miss) of 53% and 0.6%, respectively. This helps explain the bump in IPC from 2.60 to 2.73.
Perf Top
And where is CPU time spent during these runs? Let’s fire up perf top and see:
[root@eltoro]# perf top
Samples: 34K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 28510377758 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
47.15% libonload.so [.] ci_udp_recvmsg_common
13.93% libonload.so [.] ef_eventq_has_event
11.47% libonload.so [.] ci_udp_recvmsg_get
6.65% libonload.so [.] ci_netif_unlock
5.86% libonload.so [.] ci_udp_recvmsg_try_os
3.36% libonload.so [.] ef10_ef_vi_transmitv_ctpio_paced
0.94% sockperf [.] Client<IoRecvfrom, SwitchOff, SwitchOff, SwitchOff, SwitchOff, PongModeAlways>::doSendThenReceiveLoop
0.91% libonload.so [.] ef10_ef_eventq_poll
0.84% libonload.so [.] ci_udp_sendmsg
0.76% libonload.so [.] ip_csum64_partialv
0.66% libonload.so [.] ci_netif_poll_n
0.59% libonload.so [.] ci_netif_poll_evq
0.48% libonload.so [.] ci_netif_poll_intf_future
0.42% libc-2.17.so [.] __memcpy_ssse3_back
0.35% libonload.so [.] process_post_poll_list
0.34% libonload.so [.] ci_netif_filter_for_each_match
0.25% libonload.so [.] __ci_netif_send
0.25% libonload.so [.] ci_udp_sendmsg_send
0.22% libonload.so [.] recvfrom
0.20% libonload.so [.] sendto
0.18% libonload.so [.] ci_netif_rx_post
All CPU time is spent in OpenOnload kernel-bypass networking code, exactly where we’d expect it during a ping-pong benchmark.
NOTE: For an in-depth discussion on using perf for profiling and reading hardware PMU counters for application performance analysis, check out our book Performance Analysis and Tuning on Modern CPUs.3Paid affiliate link
Intel Performance Counter Monitor (PCM)
If DDIO works as advertised then we’ll also see a sharp drop in memory r/w throughput between runs. Below is a snapshot of pcm-memory.x output between DDIO disabled vs. enabled:
|---------------------------------------||---------------------------------------|
|-- System DRAM Read Throughput(MB/s): 33.27 --|
|-- System DRAM Write Throughput(MB/s): 88.43 --|
|-- System PMM Read Throughput(MB/s): 0.00 --|
|-- System PMM Write Throughput(MB/s): 0.00 --|
|-- System Read Throughput(MB/s): 33.27 --|
|-- System Write Throughput(MB/s): 88.43 --|
|-- System Memory Throughput(MB/s): 121.70 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|-- System DRAM Read Throughput(MB/s): 4.08 --|
|-- System DRAM Write Throughput(MB/s): 3.96 --|
|-- System PMM Read Throughput(MB/s): 0.00 --|
|-- System PMM Write Throughput(MB/s): 0.00 --|
|-- System Read Throughput(MB/s): 4.08 --|
|-- System Write Throughput(MB/s): 3.96 --|
|-- System Memory Throughput(MB/s): 8.04 --|
|---------------------------------------||---------------------------------------|
Not only do we achieve a higher message rate with DDIO enabled, but we do it while drastically reducing memory controller burden (122MB/s vs. 8MB/s), which limits contention and thereby lowers RAM access latency for all other running threads on this socket.
Intel DDIO Tradeoffs
Performance optimization is all about tradeoffs – there are no solutions in this racket. That is also true for the default configuration of DDIO, which is restricted to 2 of the 11 LLC ways for PCIe Writes that miss in the cache.4However, it can update any cache line in the event of a cache hit This limit ensures that I/O caching doesn’t unduly hinder application caching. However, research shows that this 2-way limit renders DDIO ineffective at rates above 100Gb/s.5“Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-giagabit Networks” – Usenix Annual Tech Conference – July 15 – 17, 2020 – https://www.usenix.org/conference/atc20/presentation/farshin However, this issue can be mitigated by increasing the number of ways via MSR 0xC8B, which holds a bitmask of the 11 ways. It defaults to 0x600, or 11000000000, for two enabled ways. You can increase that limit from the default, but you can’t decrease it below 2.
For example, if you wanted to add two (2) additional DDIO LLC ways for a total of four (4), then the bitmask you’d use would be 11110000000 or 0x780:
[root@eltoro]# wrmsr -a 0xC8B 0x780
If all else fails, you can disable DDIO on a PCIe port basis or globally using ddio-bench or the BIOS.
Wrapping Up
People always say that there are no shortcuts in life, but life repeatedly disproves that. You don’t have to toil for 10,000 hours before achieving mastery. And packets don’t have to wait outside in the long RAM line to gain entry into trendy Club CPU. Heck, packets don’t even have to wait in the long PCIe line anymore ever since the advent of SmartNICs, nor do they have to wait in the fiber<=>transceiver<=>NIC line any longer with the advent of FPGA-enabled, in-line processing network devices! Yes, in both life and computing, we very well can skip the line.
- 1Actually, David Epstein beat James to it in his 2013 book “The Sports Gene: Inside the Science of Extraordinary Athletic Performance” and then again in his 2019 book (one of my favorites) “Range: Why Generalists Triumph in a Specialized World”. Epstein has also directly debated the research behind the “10,000 hour rule” with one of its main contributors, Anders Ericsson.
- 2DMA in a single direction may comprise multiple memory accesses
- 3Paid affiliate link
- 4However, it can update any cache line in the event of a cache hit
- 5“Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-giagabit Networks” – Usenix Annual Tech Conference – July 15 – 17, 2020 – https://www.usenix.org/conference/atc20/presentation/farshin