“Noisy Neighbor Effect” and Ways of Handling It

Noisy Neighbor Effect

Maybe you’re in your dorm room studying for mid-terms while there’s a loud party going on next door. Or you’re a homeowner on the brink of rocking your newborn to sleep when the dysfunctional couple next door decides to take their argument just outside the fence you both share. That party didn’t take place in YOUR dorm, and that couple’s spat didn’t happen in YOUR house. Yet it disturbed you nonetheless due to shared space. This phenomenon, known as “Noisy Neighbor Effect”, shows up in the computing world, too. Just like those neighbors affected your study time or your baby’s bedtime, so can noisy platform neighbors affect your application.

As is the case with anyone in a dorm or a house, your app only has the illusion of seclusion. In reality, your app shares resources in similar ways that apartments share walls or houses share fences. Therefore, optimal performance dictates active management of this noisy neighbor nuisance. We’ll consider just a few of the more common sources of noisy neighbors1Software sources such as the OS, along with mitigation techniques that include cgroups and namespaces, deserves a dedicated article to address adequately, along with soundproofing techniques for each. These noise sources include:

  • Simultaneous Multithreading (SMT)
  • Last Level Cache (LLC)
  • Memory Controller
  • DRAM

NOTE: All demos through the remainder of this article will utilize various workloads from the stress-ng tool. Each section will start with a workload run w/o interference, followed by a subsequent run alongside a noisy neighbor. Finally, the effects of various soundproofing techniques will be evaluated.

Noisy Neighbor #1: SMT

Given the way they’re treated in online writeups, you’d be forgiven for confusing SMTs for full-on cores. We can thank the industry’s overemphasis of throughput over latency for that. In reality, SMT is a method of partitioning a core for simultaneous execution of multiple application threads. This is useful since relatively few applications exploit the max instruction-level parallelism of 4 – 5 instructions-per-cycle (IPC) of modern CPUs. So, SMT trades instruction-level parallelism for thread-level parallelism to make more efficient use of underutilized core resources. But how much sense does SMT make for low latency applications that play in the > 3.0 IPC space?

Demo

This first demo will run on an i9-10940X 14C/28T desktop. We’ll use the vm workload method “flip”, which sequentially works through memory 8 times, each time inverting one-bit (effectively inverting each byte in 8 passes).

Notice the number of total bogo ops (2,733,696) and IPC (3.78) achieved when running the “flip” workload on only one of the siblings (siblings 1 and 15) on core 1:

[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 1 --taskset 1 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41721] dispatching hogs: 1 vm
stress-ng: info:  [41721] successful run completed in 30.00s
stress-ng: info:  [41721] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41721]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41721] vm              2733696     30.00     28.11      1.99     91122.80     90820.47

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   123,284,656,301      cycles:u
   465,984,735,649      instructions:u            #    3.78  insn per cycle

      30.019700231 seconds time elapsed

      28.121419000 seconds user
       2.007005000 seconds sys

Now notice what happens when we spawn a second “flip” thread so that both SMT siblings run a separate thread. Total bogo ops drops to 1,467,360 (2,934,720 / 2) per “flip” thread and the IPC drops from 3.78 to 1.95.

[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 2 --taskset 1,15 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41853] dispatching hogs: 2 vm
stress-ng: info:  [41853] successful run completed in 30.00s
stress-ng: info:  [41853] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41853]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41853] vm              2934720     30.00     56.76      3.44     97823.48     48749.50

 Performance counter stats for 'stress-ng --vm 2 --taskset 1,15 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   256,586,712,633      cycles:u
   500,395,671,193      instructions:u            #    1.95  insn per cycle

      30.020655553 seconds time elapsed

      56.779145000 seconds user
       3.459457000 seconds sys

SMT Soundproofing

  • Skip a Sibling – notice what happens when we leave the SMT sibling vacant on core 1, and use one of the siblings of core 2 instead. We essentially double the total bogo ops while keeping the IPC at the same high level as the initial run.
[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 2 --taskset 1,2 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41787] dispatching hogs: 2 vm
stress-ng: info:  [41787] successful run completed in 30.00s
stress-ng: info:  [41787] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41787]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41787] vm              5424768     30.00     56.32      3.88    180824.74     90112.43

 Performance counter stats for 'stress-ng --vm 2 --taskset 1,2 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   243,994,747,398      cycles:u
   924,967,676,094      instructions:u            #    3.79  insn per cycle

      30.020230453 seconds time elapsed

      56.334931000 seconds user
       3.903555000 seconds sys
  • Disable SMT – while skipping a sibling seems like the optimal approach, we must remember various SMT implementations apportion core resources among siblings both statically and dynamically. In other words, just having SMT enabled will rob a sibling of all the resources it could be using had SMT been completely disabled at the BIOS.

Noisy Neighbor #2: LLC

While the L1/L2 caches are private to each core of a multicore CPU, they share the LLC among them. This shared space forms yet another point of contention rife with noisy neighbor potential.

Interestingly, Intel microarchitectures prior to Skylake posed a rather insidious form of LLC noisy neighbor potential via its inclusivity. Inclusive LLCs require that cache lines present in the private L1/L2 must also reside in the LLC. Benefits include simplified cache coherency snoop traffic which largely remains at the LLC level. However, the drawback (other than wasted space) is that any cache line evicted at the LLC level *must* be evicted from the private L1 to maintain the inclusive property. Imagine a memory-hungry application causing evictions from your private L1 due to thrashing behavior at the LLC!

Intel originally planned to address the aforementioned in the Broadwell (BDW) release using Query Based Selection (QBS), whereby private L1 caches are queried during the LLC line eviction algorithm. However, due to “unresolved issues”2At the Spring 2015 Intel HPC Roundtable, Intel Engineering informed us that Broadwell would include QBS. When I flew back home, I wrote a test harness to simulate the LLC Noisy Neighbor issue. My plan was use it to test the pre-release BDW E5-2699 v4 eval chip I was due to receive. However, upon delivery, I couldn’t detect a behavior difference. Only after reaching back out to Intel did they reveal that QBS was abandoned due to “unresolved issues”. I wasted a lot of time with that testing. it was abandoned before beta release. Luckily, Intel shifted to Non-inclusive LLCs starting with Skylake while offering additional soundproofing techniques which we’ll discuss below.

Demo

The next several demos will run on a 2-socket Xeon Gold 6244 CPU with SMT disabled. We’ll be using cores from the 2nd socket, which are all odd-numbered (i.e., 1,3,5,7,9,11,13, and 15). We’ll be using the “zero-one” stress-ng workload, which sets all memory bits to zero and checks that they’re all zero, then sets them to one and checks them again. Notice the baseline run produces 8,226,560 total bogo ops, IPC of 1.74, and LLC miss ratio of 65% (mem_load_retired.l3_miss / mem_load_retired.l2_miss) when running “zero-one” on core 1 of socket 1:

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [12670] dispatching hogs: 1 vm
stress-ng: info:  [12670] successful run completed in 30.00s
stress-ng: info:  [12670] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [12670]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [12670] vm              8226560     30.00     29.92      0.00    274207.30    274951.87

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,680,542,954      cycles:u
   223,262,419,426      instructions:u            #    1.74  insn per cycle
         3,077,063      mem_load_retired.l3_miss:u
         4,722,952      mem_load_retired.l2_miss:u

      30.006803071 seconds time elapsed

      29.926971000 seconds user
       0.009710000 seconds sys

Now notice what happens when I run a competing workload on core 15 of that same socket. The total bogo ops drops by 1,749,760, IPC drops from 1.74 to 1.37, and the LLC miss rate jumps from 65% to 84%.

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [12773] dispatching hogs: 1 vm
stress-ng: info:  [12773] successful run completed in 30.00s
stress-ng: info:  [12773] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [12773]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [12773] vm              6476800     30.00     29.92      0.00    215884.08    216470.59

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,672,001,885      cycles:u
   175,790,168,325      instructions:u            #    1.37  insn per cycle
         7,099,347      mem_load_retired.l3_miss:u
         8,407,384      mem_load_retired.l2_miss:u

      30.007200248 seconds time elapsed

      29.927424000 seconds user
       0.009620000 seconds sys

LLC Soundproofing

  • Intel Cache Allocation Technology (CAT) – since all the cores of a socket share the LLC, Intel introduced a LLC partitioning technology known as CAT so that segments of the cache can be apportioned among the cores. Notice how much of the original workload performance is recovered after restricting the core 15 to only 2 of the 11 LLC slices available by first creating an LLC class-of-service using the first 2 slices (i.e., pqos -e “llc:1=0x003“) and then assigning core 15 to that class-of-service (i.e., pqos -a “llc:1=15”):
[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [13080] dispatching hogs: 1 vm
stress-ng: info:  [13080] successful run completed in 30.00s
stress-ng: info:  [13080] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [13080]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [13080] vm              7279360     30.00     29.92      0.00    242635.07    243294.12

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,677,485,832      cycles:u
   197,556,939,861      instructions:u            #    1.54  insn per cycle
         4,264,701      mem_load_retired.l3_miss:u
         5,862,294      mem_load_retired.l2_miss:u

      30.007018799 seconds time elapsed

      29.926168000 seconds user
       0.010653000 seconds sys
  • Socket Partitioning – Intel CAT works pretty well but it’s not perfect. Both Intel and AMD offer socket partitioning capabilities which create separate NUMA nodes out of a single socket, effectively separating the LLC among the remaining partitions. Below, Intel’s Sub NUMA Cluster (SNC) mode splits the socket into 2 NUMA Nodes. Both runs below use the same “one-zero” workload from before on core 1, but the second one runs alongside a competing workload on core 15. Yet, both runs have virtually identical total bogo ops, IPC rate, and LLC miss rate – this is due to the effective separation of LLC resources in SNC mode (cores 1 and 15 end up in separate NUMA nodes). However, the downside is that the LLC size per node is cut in half, thus the higher LLC miss ratio.
[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [8393] dispatching hogs: 1 vm
stress-ng: info:  [8393] successful run completed in 30.00s
stress-ng: info:  [8393] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [8393]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [8393] vm              6880000     30.00     29.92      0.00    229323.60    229946.52

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,675,142,769      cycles:u
   186,714,635,811      instructions:u            #    1.45  insn per cycle
        12,784,466      mem_load_retired.l3_miss:u
        13,523,594      mem_load_retired.l2_miss:u

      30.006948214 seconds time elapsed

      29.925508000 seconds user
       0.011315000 seconds sys
[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [8184] dispatching hogs: 1 vm
stress-ng: info:  [8184] successful run completed in 30.00s
stress-ng: info:  [8184] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [8184]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [8184] vm              6877440     30.00     29.92      0.00    229238.19    229860.96

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,674,644,912      cycles:u
   186,637,049,382      instructions:u            #    1.45  insn per cycle
        14,528,126      mem_load_retired.l3_miss:u
        15,188,233      mem_load_retired.l2_miss:u

      30.007001555 seconds time elapsed

      29.926171000 seconds user
       0.010716000 seconds sys

Noisy Neighbor #3: Memory Controller

Memory Controllers (MC) are complex orchestrators. They must juggle disparate requests fairly from every core on the socket. Rearrange these requests appropriately to maximize DIMM row buffer hits. Respect DIMM timing constraints and bank refresh scheduling. And they must do all this while preserving memory ordering guarantees. Space constraints limit how much intelligence can be implemented in the MC to protect against bandwidth-hogging applications. Fortunately for us, other sources of mitigation exist.

Demo

This demo will continue on the Xeon Gold 6244 CPU. We’ll use the “read64” workload which sequentially reads memory using 32 x 64 bit reads per bogo loop. Notice the baseline total bogo ops of 350,208 when run unperturbed on core 1:

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [14969] dispatching hogs: 1 vm
stress-ng: info:  [14969] successful run completed in 30.11s
stress-ng: info:  [14969] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [14969]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [14969] vm               350208     30.11     29.34      0.69     11631.11     11661.94

Now, notice the drop in total bogo ops (~25,000) when competing workloads are run on cores 7, 11, and 15 of the same socket.

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [14619] dispatching hogs: 1 vm
stress-ng: info:  [14619] successful run completed in 30.11s
stress-ng: info:  [14619] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [14619]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [14619] vm               325632     30.11     29.33      0.70     10814.63     10843.56

Memory Controller Soundproofing

  • Intel Memory Bandwidth Allocation (MBA)MBA is a best-effort per-core throttling feature which allows the ability to restrict available memory bandwidth usage using class-of-service features similar to that available with LLC CAT. Notice how much of the initial total bogo ops drop is recovered below once I apply a 10% bandwidth throttle against cores 7, 11, and 15 using pqos -e “mba:1=10”; pqos -a “core:1=7,11,15”:
[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [15246] dispatching hogs: 1 vm
stress-ng: info:  [15246] successful run completed in 30.11s
stress-ng: info:  [15246] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [15246]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [15246] vm               329728     30.11     29.33      0.70     10950.71     10979.95
  • Socket Partitioning – while Intel MBA provides some benefit, you’ll notice it only recovered ~4,000 of the lost 25,000 total bogo ops. Again, this is because MBA is only a best effort technology. On the other hand, socket partitioning technologies like Intel SNC provides stricter controls, as shown before in the LLC section. Since this Xeon Gold 6244 chip has two (2) memory controllers per-socket, SNC mode splits the two controllers among the resulting separate NUMA nodes. Both runs below are taken from SNC mode, the first an unperturbed run and the second run against competing runs on cores 7, 11, and 15. Not only is there no difference in total bogo ops between these two runs, but each run achieves higher total bogo ops due to the higher bandwidth/lower access latency achievable in SNC mode.
[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [10029] dispatching hogs: 1 vm
stress-ng: info:  [10029] successful run completed in 30.11s
stress-ng: info:  [10029] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [10029]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [10029] vm               417792     30.11     29.36      0.67     13875.72     13912.49
[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [10155] dispatching hogs: 1 vm
stress-ng: info:  [10155] successful run completed in 30.11s
stress-ng: info:  [10155] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [10155]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [10155] vm               417792     30.11     29.35      0.68     13875.48     13912.49

Noisy Neighbor #4: DRAM

Server memory subsystems organize into a well-known hierarchy. Memory controller -> memory channel -> DIMM rank -> DRAM chip -> DRAM bank. Read/write concurrency operates at DRAM bank granularity. Therefore, the more banks available, the higher the request concurrency level.3DDR3 offers a max of 8 banks, DDR4 a max of 16 banks, and DDR5 a max of 32 banks.

In addition to the bank-level concurrency constraint, DRAM requires periodic bank refreshes that preclude concurrent r/w access. While LPDDR and DDR5 offer finer-grained bank-level refresh modes, the more commonly deployed DDR3/4 refreshes at the rank-level, thereby preventing all concurrent r/w access during operation.

Therefore, not only do concurrent bank-level requests pose a noisy neighbor danger, but also the refresh schedule of the DIMM itself. Let’s look at a demo, and then look at some mitigation strategies.

Demo

Our final demo will using two (2) Xeon Gold 6244 chips – one (serverA) with all six (6) DIMM channels per socket occupied, and the other (serverB) with only one (1) DIMM channel per socket occupied. Each test will begin with a single “read64” workload run, then another run with two simultaneous runs, and finally a third with three simultaneous runs.

Notice the steady decline in total bogo ops as the number of simultaneous runs increases on serverB:

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228145] dispatching hogs: 1 vm
stress-ng: info:  [228145] successful run completed in 30.07s
stress-ng: info:  [228145] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228145]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228145] vm               424960     30.07     29.62      0.37     14132.78     14170.06

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228290] dispatching hogs: 1 vm
stress-ng: info:  [228289] dispatching hogs: 1 vm
stress-ng: info:  [228290] successful run completed in 30.08s
stress-ng: info:  [228290] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228290]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228290] vm               297984     30.08     29.61      0.38      9907.69      9936.11
stress-ng: info:  [228289] successful run completed in 30.08s
stress-ng: info:  [228289] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228289]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228289] vm               302080     30.08     29.61      0.39     10044.03     10069.33

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 5 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228381] dispatching hogs: 1 vm
stress-ng: info:  [228382] dispatching hogs: 1 vm
stress-ng: info:  [228380] dispatching hogs: 1 vm
stress-ng: info:  [228381] successful run completed in 30.08s
stress-ng: info:  [228381] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228381]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228381] vm               210944     30.08     29.61      0.39      7012.99      7031.47
stress-ng: info:  [228382] successful run completed in 30.08s
stress-ng: info:  [228382] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228382]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228382] vm               217088     30.08     29.61      0.39      7216.89      7236.27
stress-ng: info:  [228380] successful run completed in 30.08s
stress-ng: info:  [228380] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228380]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228380] vm               215040     30.08     29.60      0.39      7148.96      7170.39

DRAM Soundproofing

  • Memory Interleaving – with only one DIMM channel per socket, it didn’t take long for the contention to build up on serverB at the DRAM bank level as multiple competing requests competed for access. On serverA below, we will use the maximum level, 6-way memory interleaving to reduce bank-level contention and note the results. Prior to the introduction of Intel 3D XPoint DIMMs, Intel interleaved memory at the granularity of one or two cachelines (64 or 128 bytes). However, since 3D XPoint DIMMs, interleaving now occurs at the level of four (4) cachelines or 256 bytes. Notice how steady the total bogo ops remains as the number of simultaneous “read64” runs is increased with memory-interleaving enabled:
[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [26338] dispatching hogs: 1 vm
stress-ng: info:  [26338] successful run completed in 30.05s
stress-ng: info:  [26338] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26338]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26338] vm               355328     30.05     29.63      0.34     11822.78     11856.12


[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [26453] dispatching hogs: 1 vm
stress-ng: info:  [26452] dispatching hogs: 1 vm
stress-ng: info:  [26453] successful run completed in 30.07s
stress-ng: info:  [26453] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26453]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26453] vm               342016     30.07     29.62      0.37     11374.17     11404.33
stress-ng: info:  [26452] successful run completed in 30.07s
stress-ng: info:  [26452] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26452]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26452] vm               345088     30.07     29.62      0.37     11476.34     11506.77

[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 5 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [27058] dispatching hogs: 1 vm
stress-ng: info:  [27059] dispatching hogs: 1 vm
stress-ng: info:  [27057] dispatching hogs: 1 vm
stress-ng: info:  [27058] successful run completed in 30.07s
stress-ng: info:  [27058] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27058]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27058] vm               344064     30.07     29.62      0.37     11441.32     11472.62
stress-ng: info:  [27059] successful run completed in 30.08s
stress-ng: info:  [27059] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27059]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27059] vm               350208     30.08     29.61      0.38     11644.13     11677.49
stress-ng: info:  [27057] successful run completed in 30.08s
stress-ng: info:  [27057] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27057]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27057] vm               347136     30.08     29.61      0.38     11542.07     11575.06
  • Socket Partitioning – Do you notice a theme developing here? Intel SNC offers benefits in this case since it partitions the socket into two 3-way interleaving NUMA Nodes. Keeping the workloads on separate sides of the partition provides similar protection as in the LLC and memory controller scenarios.

Soundproof Home

At great expense, you can eliminate all danger of noisy neighbors by living like a hermit. Or by limiting every server purchase to only one application per machine. However, there is a middle-ground where building materials for soundproofing and server technologies like CAT, MBA, and SNC exist. With the low barriers to entry of these modern techniques, you no longer have to put up with noisy neighbors these days, nor must you pay an arm and a leg to achieve those levels of quiet.

  • 1
    Software sources such as the OS, along with mitigation techniques that include cgroups and namespaces, deserves a dedicated article to address adequately
  • 2
    At the Spring 2015 Intel HPC Roundtable, Intel Engineering informed us that Broadwell would include QBS. When I flew back home, I wrote a test harness to simulate the LLC Noisy Neighbor issue. My plan was use it to test the pre-release BDW E5-2699 v4 eval chip I was due to receive. However, upon delivery, I couldn’t detect a behavior difference. Only after reaching back out to Intel did they reveal that QBS was abandoned due to “unresolved issues”. I wasted a lot of time with that testing.
  • 3
    DDR3 offers a max of 8 banks, DDR4 a max of 16 banks, and DDR5 a max of 32 banks.

Do you enjoy this kind of content? Share on your social media.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

You can also follow us on

or 

and get notified as soon as new posts become available.