Noisy Neighbor Effect: How to Manage It

Noisy Neighbor Effect

Estimated reading time: 10 minutes

Maybe you’re in your dorm room studying for mid-terms while there’s a loud party going on next door. Or you’re a homeowner on the brink of rocking your newborn to sleep when the dysfunctional couple next door decides to take their argument just outside the fence you both share. That party didn’t take place in YOUR dorm, and that couple’s spat didn’t happen in YOUR house. Yet it still disturbed you due to shared space. This phenomenon, known as “Noisy Neighbor Effect”, shows up in the computing world, too. Just like those neighbors affected your study time or your baby’s bedtime, so can noisy neighbors affect your application.

As is the case with anyone in a dorm or a house, your app only has the illusion of seclusion. In reality, your app shares resources in similar ways that apartments share walls or houses share fences. Therefore, optimal performance dictates active management of these noisy neighbors. We’ll consider just a few of the more common sources of noisy neighbors1Software sources such as the OS, along with mitigation techniques that include cgroups and namespaces, deserves a dedicated article to address adequately, along with soundproofing techniques for each. These noise sources include:

  • Simultaneous Multithreading (SMT)
  • Last Level Cache (LLC)
  • Memory Controller
  • DRAM

NOTE: All demos through the rest of this article will utilize various workloads from the stress-ng tool. Each section will start with a run w/o interference, followed by another run with a noisy neighbor. Finally, we’ll consider the benefits of various soundproofing techniques.

Noisy Neighbor Effect: SMT

Given the way they’re treated in online articles, you’d be forgiven for confusing SMTs for full-on cores. We can thank the industry’s overemphasis of throughput over latency for that. In reality, SMT is a method of partitioning a core for simultaneous execution of multiple application threads. This is useful since relatively few applications exploit the max instruction-level parallelism of 4 – 5 instructions-per-cycle (IPC) of modern CPUs. So, SMT trades instruction-level parallelism for thread-level parallelism to make better use of underutilized core resources. But how much sense does SMT make for low latency applications that play in the > 3.0 IPC space? In other words, does it impose a noisy neighbor effect?

Demo

This first demo will run on an i9-10940X 14C/28T desktop. We’ll use the vm workload method “flip”, which sequentially works through memory 8 times, each time inverting one-bit (effectively inverting each byte in 8 passes).

Notice the number of total bogo ops (2,733,696) and IPC (3.78) achieved when running the “flip” workload on only one of the siblings (1 or 15) on core 1:

[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 1 --taskset 1 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41721] dispatching hogs: 1 vm
stress-ng: info:  [41721] successful run completed in 30.00s
stress-ng: info:  [41721] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41721]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41721] vm              2733696     30.00     28.11      1.99     91122.80     90820.47

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   123,284,656,301      cycles:u
   465,984,735,649      instructions:u            #    3.78  insn per cycle

      30.019700231 seconds time elapsed

      28.121419000 seconds user
       2.007005000 seconds sys

Now notice what happens when we spawn a second “flip” thread so that both SMT siblings run a separate thread. Total bogo ops drops to 1,467,360 (2,934,720 / 2) per “flip” thread and the IPC drops from 3.78 to 1.95.

[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 2 --taskset 1,15 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41853] dispatching hogs: 2 vm
stress-ng: info:  [41853] successful run completed in 30.00s
stress-ng: info:  [41853] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41853]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41853] vm              2934720     30.00     56.76      3.44     97823.48     48749.50

 Performance counter stats for 'stress-ng --vm 2 --taskset 1,15 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   256,586,712,633      cycles:u
   500,395,671,193      instructions:u            #    1.95  insn per cycle

      30.020655553 seconds time elapsed

      56.779145000 seconds user
       3.459457000 seconds sys

SMT Soundproofing

Skip A Sibling

Notice what happens when we leave the SMT sibling vacant on core 1, and use one of the siblings of core 2 instead. We basically double the total bogo ops while keeping the IPC at the same high level as the initial run.

[mdawson@eltoro]$ perf stat -e cycles:u,instructions:u stress-ng --vm 2 --taskset 1,2 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s

stress-ng: info:  [41787] dispatching hogs: 2 vm
stress-ng: info:  [41787] successful run completed in 30.00s
stress-ng: info:  [41787] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [41787]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [41787] vm              5424768     30.00     56.32      3.88    180824.74     90112.43

 Performance counter stats for 'stress-ng --vm 2 --taskset 1,2 --vm-bytes 1m --vm-method flip --metrics-brief --timeout 30s':

   243,994,747,398      cycles:u
   924,967,676,094      instructions:u            #    3.79  insn per cycle

      30.020230453 seconds time elapsed

      56.334931000 seconds user
       3.903555000 seconds sys

Disable SMT

Skipping a sibling seems like the optimal approach. But we must remember that various SMT implementations split core resources among siblings both statically and dynamically. In other words, enabling SMT at all robs siblings of resources they could access were it disabled at the BIOS.

Noisy Neighbor Effect: LLC

While the L1/L2 caches are private to each core of a multicore CPU, they share the LLC among them. This shared space forms yet another point of contention rife with noisy neighbor effect potential.

Brief History Lesson

Interestingly, Intel microarchitectures prior to Skylake posed a rather vexing form of LLC noisy neighbor potential via its inclusivity. Inclusive LLCs require that cache lines present in the private L1/L2 must also reside in the LLC. Benefits include simplified cache coherency snoop traffic which largely remains at the LLC level. However, the catch (other than wasted space) is that any cache line evicted at the LLC level *must* be evicted from the private L1 to maintain the inclusive property. Imagine a memory-hungry application causing evictions from your private L1 due to thrashing behavior at the LLC!

Intel planned to address the issue in the Broadwell (BDW) release using Query Based Selection (QBS), whereby private L1 caches are queried during the LLC line eviction algorithm. However, due to “unresolved issues”2At the Spring 2015 Intel HPC Roundtable, Intel Engineering informed us that Broadwell would include QBS. When I flew back home, I wrote a test harness to simulate the LLC Noisy Neighbor issue. My plan was use it to test the pre-release BDW E5-2699 v4 eval chip I was due to receive. However, upon delivery, I couldn’t detect a behavior difference. Only after reaching back out to Intel did they reveal that QBS was abandoned due to “unresolved issues”. I wasted a lot of time with that testing. it was abandoned before beta release. Luckily, Intel shifted to Non-inclusive LLCs starting with Skylake while offering other soundproofing techniques which we’ll discuss below.

Demo

The next several demos will run on a 2-socket Xeon Gold 6244 CPU with SMT disabled. We’ll be using cores from the 2nd socket, which are all odd-numbered (i.e., 1,3,5,7,9,11,13, and 15). We’ll be using the “zero-one” stress-ng workload, which sets all memory bits to zero and checks that they’re all zero, then sets them to one and checks them again. Notice the baseline run produces 8,226,560 total bogo ops, IPC of 1.74, and LLC miss ratio of 65% (mem_load_retired.l3_miss / mem_load_retired.l2_miss) when running “zero-one” on core 1 of socket 1:

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [12670] dispatching hogs: 1 vm
stress-ng: info:  [12670] successful run completed in 30.00s
stress-ng: info:  [12670] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [12670]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [12670] vm              8226560     30.00     29.92      0.00    274207.30    274951.87

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,680,542,954      cycles:u
   223,262,419,426      instructions:u            #    1.74  insn per cycle
         3,077,063      mem_load_retired.l3_miss:u
         4,722,952      mem_load_retired.l2_miss:u

      30.006803071 seconds time elapsed

      29.926971000 seconds user
       0.009710000 seconds sys

Now notice what happens when I run a competing workload on core 15 of that same socket. The total bogo ops drops by 1,749,760, IPC drops from 1.74 to 1.37, and the LLC miss rate jumps from 65% to 84%.

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [12773] dispatching hogs: 1 vm
stress-ng: info:  [12773] successful run completed in 30.00s
stress-ng: info:  [12773] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [12773]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [12773] vm              6476800     30.00     29.92      0.00    215884.08    216470.59

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,672,001,885      cycles:u
   175,790,168,325      instructions:u            #    1.37  insn per cycle
         7,099,347      mem_load_retired.l3_miss:u
         8,407,384      mem_load_retired.l2_miss:u

      30.007200248 seconds time elapsed

      29.927424000 seconds user
       0.009620000 seconds sys

LLC Soundproofing

Intel Cache Allocation Technology (CAT)

Since all the cores of a socket share the LLC, Intel introduced a LLC partitioning technology known as CAT so that slices of the cache can be apportioned among the cores. Notice how much of the original workload performance is recovered after restricting core 15 to only 2 of the 11 LLC slices available by first creating an LLC class-of-service using the first 2 slices (i.e., pqos -e “llc:1=0x003“) and then assigning core 15 to that class-of-service (i.e., pqos -a “llc:1=15”):

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [13080] dispatching hogs: 1 vm
stress-ng: info:  [13080] successful run completed in 30.00s
stress-ng: info:  [13080] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [13080]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [13080] vm              7279360     30.00     29.92      0.00    242635.07    243294.12

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,677,485,832      cycles:u
   197,556,939,861      instructions:u            #    1.54  insn per cycle
         4,264,701      mem_load_retired.l3_miss:u
         5,862,294      mem_load_retired.l2_miss:u

      30.007018799 seconds time elapsed

      29.926168000 seconds user
       0.010653000 seconds sys

Socket Partitioning

Intel CAT works pretty well but it’s not perfect. Both Intel and AMD offer socket partitioning which creates multiple NUMA nodes out of a single socket, which splits the LLC among the nodes. Below, Intel’s Sub NUMA Cluster (SNC) mode splits the socket into 2 NUMA Nodes. Both runs below use the same “one-zero” workload from before on core 1, but the second one runs alongside a competing workload on core 15. Yet, both runs have virtually identical total bogo ops, IPC rate, and LLC miss rate. This is due to the effective separation of LLC resources in SNC mode (cores 1 and 15 end up in separate NUMA nodes). However, the catch is that the LLC size per node is cut in half, thus the higher LLC miss ratio.

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [8393] dispatching hogs: 1 vm
stress-ng: info:  [8393] successful run completed in 30.00s
stress-ng: info:  [8393] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [8393]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [8393] vm              6880000     30.00     29.92      0.00    229323.60    229946.52

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,675,142,769      cycles:u
   186,714,635,811      instructions:u            #    1.45  insn per cycle
        12,784,466      mem_load_retired.l3_miss:u
        13,523,594      mem_load_retired.l2_miss:u

      30.006948214 seconds time elapsed

      29.925508000 seconds user
       0.011315000 seconds sys

[root@serverA]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s

stress-ng: info:  [8184] dispatching hogs: 1 vm
stress-ng: info:  [8184] successful run completed in 30.00s
stress-ng: info:  [8184] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [8184]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [8184] vm              6877440     30.00     29.92      0.00    229238.19    229860.96

 Performance counter stats for 'stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 20m --vm-method zero-one --metrics-brief --timeout 30s':

   128,674,644,912      cycles:u
   186,637,049,382      instructions:u            #    1.45  insn per cycle
        14,528,126      mem_load_retired.l3_miss:u
        15,188,233      mem_load_retired.l2_miss:u

      30.007001555 seconds time elapsed

      29.926171000 seconds user
       0.010716000 seconds sys

NOTE: For an deeper discussion on using perf for profiling and reading hardware PMU counters for application performance analysis, check out our book Performance Analysis and Tuning on Modern CPUs.3Paid affiliate link

Noisy Neighbor Effect: Memory Controller

Memory Controllers (MC) are complex traffic cops. They must juggle requests fairly from every core on the socket. Rearrange them appropriately to maximize DIMM row buffer hits. Respect DIMM timing constraints and bank refresh scheduling. And they must do all this while preserving memory ordering guarantees. Space constraints limit how much intelligence can be implemented in the MC to protect against bandwidth-hogging applications. All these factors impose a noisy neighbor effect. Fortunately for us, other sources of mitigation exist.

Demo

This demo will continue on the Xeon Gold 6244 CPU. We’ll use the “read64” workload which sequentially reads memory using 32 x 64 bit reads per bogo loop. Notice the baseline total bogo ops of 350,208 when run undisturbed on core 1:

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [14969] dispatching hogs: 1 vm
stress-ng: info:  [14969] successful run completed in 30.11s
stress-ng: info:  [14969] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [14969]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [14969] vm               350208     30.11     29.34      0.69     11631.11     11661.94

Now, notice the drop in total bogo ops (~25,000) when competing workloads run on cores 7, 11, and 15 of the same socket.

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [14619] dispatching hogs: 1 vm
stress-ng: info:  [14619] successful run completed in 30.11s
stress-ng: info:  [14619] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [14619]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [14619] vm               325632     30.11     29.33      0.70     10814.63     10843.56

Memory Controller Soundproofing

Intel Memory Bandwidth Allocation (MBA)

MBA is a best-effort, per-core throttling feature which offers the ability to limit memory bandwidth usage using class-of-service features similar to that available with LLC CAT. Notice how much of the initial total bogo ops drop is recovered once I apply a 10% throttle against cores 7, 11, and 15 using pqos -e “mba:1=10”; pqos -a “core:1=7,11,15”:

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [15246] dispatching hogs: 1 vm
stress-ng: info:  [15246] successful run completed in 30.11s
stress-ng: info:  [15246] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [15246]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [15246] vm               329728     30.11     29.33      0.70     10950.71     10979.95

Socket Partitioning

While Intel MBA provides some benefit, you’ll notice it only recovered ~4,000 of the lost 25,000 total bogo ops. Again, this is because MBA is only a best effort technology. On the other hand, socket partitioning technologies like Intel SNC provides stricter controls, as shown before in the LLC section. Since this Xeon Gold 6244 chip has two (2) memory controllers per-socket, SNC mode splits the two controllers among the remaining NUMA nodes. Both runs below are taken from SNC mode, the first an unperturbed run and the second run against competing runs on cores 7, 11, and 15. Not only is there no difference in total bogo ops between these two runs, but each run achieves higher total bogo ops due to the higher bandwidth/lower access latency achievable in SNC mode.

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [10029] dispatching hogs: 1 vm
stress-ng: info:  [10029] successful run completed in 30.11s
stress-ng: info:  [10029] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [10029]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [10029] vm               417792     30.11     29.36      0.67     13875.72     13912.49

[root@serverA]# stress-ng --vm 1 --taskset 1 --vm-keep --vm-method read64 --vm-bytes 2g --metrics-brief --timeout 30s

stress-ng: info:  [10155] dispatching hogs: 1 vm
stress-ng: info:  [10155] successful run completed in 30.11s
stress-ng: info:  [10155] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [10155]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [10155] vm               417792     30.11     29.35      0.68     13875.48     13912.49

Noisy Neighbor Effect: DRAM

Server memory subsystems organize into a hierarchy. Memory controller -> memory channel -> DIMM rank -> DRAM chip -> DRAM bank. Read/write concurrency operates at DRAM bank granularity. Therefore, the more banks available, the higher the request concurrency level.4DDR3 offers a max of 8 banks, DDR4 a max of 16 banks, and DDR5 a max of 32 banks.

In addition to the bank-level concurrency constraint, DRAM requires periodic bank refreshes that preclude concurrent r/w access. While LPDDR and DDR5 offer finer-grained bank-level refresh modes, the more commonly deployed DDR3/4 refreshes at the rank-level, thereby preventing all concurrent r/w access during operation.

Therefore, not only do concurrent bank-level requests pose a noisy neighbor effect, but also the refresh schedule of the DIMM itself. Let’s look at a demo, and then look at some mitigation strategies.

Demo

Our final demo will using two (2) Xeon Gold 6244 chips. One (serverA) has all six (6) DIMM channels per socket populated. The other (serverB) has only one (1) DIMM channel per socket populated. Each test begins with a single “read64” workload run. The next executes two simultaneous runs. And finally a third executes three simultaneous runs.

Notice the steady decline in total bogo ops as the number of simultaneous runs increases on serverB:

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228145] dispatching hogs: 1 vm
stress-ng: info:  [228145] successful run completed in 30.07s
stress-ng: info:  [228145] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228145]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228145] vm               424960     30.07     29.62      0.37     14132.78     14170.06

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228290] dispatching hogs: 1 vm
stress-ng: info:  [228289] dispatching hogs: 1 vm
stress-ng: info:  [228290] successful run completed in 30.08s
stress-ng: info:  [228290] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228290]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228290] vm               297984     30.08     29.61      0.38      9907.69      9936.11
stress-ng: info:  [228289] successful run completed in 30.08s
stress-ng: info:  [228289] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228289]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228289] vm               302080     30.08     29.61      0.39     10044.03     10069.33

[root@serverB]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 5 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverB]# cat nohup.out

stress-ng: info:  [228381] dispatching hogs: 1 vm
stress-ng: info:  [228382] dispatching hogs: 1 vm
stress-ng: info:  [228380] dispatching hogs: 1 vm
stress-ng: info:  [228381] successful run completed in 30.08s
stress-ng: info:  [228381] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228381]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228381] vm               210944     30.08     29.61      0.39      7012.99      7031.47
stress-ng: info:  [228382] successful run completed in 30.08s
stress-ng: info:  [228382] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228382]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228382] vm               217088     30.08     29.61      0.39      7216.89      7236.27
stress-ng: info:  [228380] successful run completed in 30.08s
stress-ng: info:  [228380] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [228380]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [228380] vm               215040     30.08     29.60      0.39      7148.96      7170.39

DRAM Soundproofing

Memory Interleaving

With only one DIMM channel per socket, it didn’t take long to inflict DRAM bank contention on serverB. On serverA below, we will use the maximum level, 6-way memory interleaving to reduce bank-level contention and note the results. Prior to the introduction of Intel 3D XPoint DIMMs, Intel interleaved memory at the granularity of one or two cachelines. However, since 3D XPoint DIMMs, interleaving now occurs at the level of four (4) cachelines or 256 bytes. Notice how steady the total bogo ops remains as the number of simultaneous “read64” runs is bumped with memory-interleaving enabled:

[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [26338] dispatching hogs: 1 vm
stress-ng: info:  [26338] successful run completed in 30.05s
stress-ng: info:  [26338] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26338]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26338] vm               355328     30.05     29.63      0.34     11822.78     11856.12

[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [26453] dispatching hogs: 1 vm
stress-ng: info:  [26452] dispatching hogs: 1 vm
stress-ng: info:  [26453] successful run completed in 30.07s
stress-ng: info:  [26453] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26453]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26453] vm               342016     30.07     29.62      0.37     11374.17     11404.33
stress-ng: info:  [26452] successful run completed in 30.07s
stress-ng: info:  [26452] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [26452]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [26452] vm               345088     30.07     29.62      0.37     11476.34     11506.77

[root@serverA]# nohup stress-ng --vm 1 --taskset 1 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &; nohup stress-ng --vm 1 --taskset 5 --vm-keep --vm-bytes 1g --vm-method read64 --metrics-brief --timeout 30s &

[root@serverA]# cat nohup.out

stress-ng: info:  [27058] dispatching hogs: 1 vm
stress-ng: info:  [27059] dispatching hogs: 1 vm
stress-ng: info:  [27057] dispatching hogs: 1 vm
stress-ng: info:  [27058] successful run completed in 30.07s
stress-ng: info:  [27058] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27058]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27058] vm               344064     30.07     29.62      0.37     11441.32     11472.62
stress-ng: info:  [27059] successful run completed in 30.08s
stress-ng: info:  [27059] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27059]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27059] vm               350208     30.08     29.61      0.38     11644.13     11677.49
stress-ng: info:  [27057] successful run completed in 30.08s
stress-ng: info:  [27057] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [27057]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [27057] vm               347136     30.08     29.61      0.38     11542.07     11575.06

Socket Partitioning

Do you notice a theme here? Intel SNC also offers benefits in this case since it splits the socket into two 3-way interleaving NUMA Nodes. Keeping the workloads on separate sides of the nodes provides similar protection as in the LLC and memory controller scenarios.

Soundproof Home

At great expense, you can remove all danger of noisy neighbors by living like a hermit. Or by limiting every machine to just one application per box. However, there is a middle ground where building materials for soundproofing and server technologies like CAT, MBA, and SNC exist. With such low barriers to entry of these modern techniques, noisy neighbor effect is no longer a given these days. Neither is the need to pay an arm and a leg to achieve such levels of quiet.

  • 1
    Software sources such as the OS, along with mitigation techniques that include cgroups and namespaces, deserves a dedicated article to address adequately
  • 2
    At the Spring 2015 Intel HPC Roundtable, Intel Engineering informed us that Broadwell would include QBS. When I flew back home, I wrote a test harness to simulate the LLC Noisy Neighbor issue. My plan was use it to test the pre-release BDW E5-2699 v4 eval chip I was due to receive. However, upon delivery, I couldn’t detect a behavior difference. Only after reaching back out to Intel did they reveal that QBS was abandoned due to “unresolved issues”. I wasted a lot of time with that testing.
  • 3
    Paid affiliate link
  • 4
    DDR3 offers a max of 8 banks, DDR4 a max of 16 banks, and DDR5 a max of 32 banks.

Do you enjoy this kind of content? Share on your social media.

Facebook
Twitter
LinkedIn

You can also follow us on

or 

and get notified as soon as new posts become available.