Last Level Cache: Where It’s Bad To Be Inclusive

Estimated reading time: 8 minutes

I remember when Big Tech focused all their recruitment efforts at prestigious engineering colleges and universities. They’ve since evolved to be more inclusive, casting a wider net that encompasses places like HBCUs and Code Bootcamps. Corporations traditionally reserved “Openness to Feedback” for only execs or upwardly mobile hotshot employees. But nowadays, companies boast of flat management structures and tout an “open door policy”, inclusive of all employee levels, as a major selling point. Such efforts toward inclusivity generally improve reputation and produce positive outcomes. On the other hand, if the CPU you select for your latency-sensitive application contains an inclusive Last Level Cache, then you got problems, buddy!

And you’ll find these CPUs in the wild even today. All the major cloud vendors still offer them as options. Heck, you may even have a few reliably chuggin’ along in your own datacenter.

But what exactly does it mean for a Last Level Cache to be “inclusive”? And what problem does it pose for latency-sensitive apps? Read on to find out. And don’t worry – I *will* provide a demo.

Last Level Cache: Final Stop Before Main Memory
Inclusion Policy
Inclusive LLC & Backward Invalidations
- Temporal Locality Hints
- Query Based Selection
Embracing Non-inclusive Last Level Caches
Demo
- Haswell: Inclusive Last Level Cache
- Cascade Lake: Non-inclusive Last Level Cache
Be Inclusive Everywhere Except the LLC

Last Level Cache: Final Stop Before Main Memory

I’ve written previously about the “Memory Wall” stemming from a widening CPU <=> Main Memory performance gap. Among the steps taken by chip designers to mitigate the issue is the placement of smaller, faster pockets of SRAM nearer the CPU (illustrated below):

Level 3 (L3) represents the Last Level Cache (LLC) in the example above, and is the last (and slowest) stop within the cache hierarchy before the system must endure the long trek out to Main Memory. Among LLC design choices is the “inclusion policy” – i.e., whether or not the contents of the smaller caches shall be a subset of the LLC.

NOTE: For an in-depth discussion on CPU microarchitecture and squeezing the most performance from it, check out our book Performance Analysis and Tuning on Modern CPUs.¹

Inclusion Policy

LLC inclusion policy falls into three camps: inclusive, exclusive, and non-inclusive. If all cache blocks of the smaller caches must also reside in the LLC, then that LLC is “inclusive”. If the LLC only contains blocks which are *not* present in the smaller caches, then that LLC is “exclusive”. And finally, if the LLC is neither strictly inclusive nor exclusive of the smaller caches, it is labeled “non-inclusive”.

Benefits of an inclusive LLC include greatly simplified cache coherency since less traffic must traverse all levels of the cache hierarchy to achieve its aim. Simply put, when the LLC contains all blocks from all levels of the cache hierarchy, it becomes the “one stop shop” for coherency info. However, one of the drawbacks is wasted capacity. As a matter of fact, a long held belief pinpointed squandered memory as the main drawback of an inclusive policy. But its true disadvantage is a more insidious side-effect – “backward invalidations”.

Inclusive LLC & Backward Invalidations

Recall that an inclusive policy dictates that all blocks of the smaller caches *must* also reside in the LLC. This means that any block evicted from the LLC must be evicted from the smaller caches to maintain compliance. This is referred to as a “backward invalidation”.

Inclusive Last Level Cache Invalidation Issue

Imagine a hypothetical CPU as pictured above with the L2 designated as its inclusive LLC. Letters ‘a’ thru ‘e’ depict cache blocks in the cache hierarchy. If the CPU core references blocks in the pattern depicted (a -> b -> a -> c -> a -> d and so forth), the LLC will fill up with each of these blocks until the core requests block ‘e’. The LLC reaches max capacity at that point, and so must evict another block based on its LRU history table. The inclusion victim would be block ‘a’ despite the fact that this block remains at the MRU end of the L1’s history table. In compliance with inclusion policy, the L1 evicts block ‘a’, as well. Imagine the performance hit incurred from this repeated L1 eviction of hot cache block ‘a’!

Filtered temporal information between the L1 and LLC forms the crux of the issue. The LLC only knows about compulsory cache miss events across all levels, but not about cache hit updates for those blocks. Mitigating this issue, therefore, requires opening that channel of communication back to the LLC. Intel attempted at least two different solutions to this issue: Temporal Locality Hints (TLH) and Query Based Selection (QBS).

Temporal Locality Hints

TLH conveys temporal info about hot L1 cache blocks back to the LLC. This makes it far less likely for the LLC to choose those blocks for eviction. The drawback, however, is all that extra bandwidth required between the L1 and LLC. In fact, this feature was once configurable as a BIOS option on CPUs as recent as Westmere. It was called “Data Reuse Optimization”:

Inclusive Last Level Cache Data Reuse Optimization

However, that BIOS option disappeared on subsequent CPU releases. Is this because Intel replaced TLH with something else? Or did they just remove it as a configuration option? I don’t know. Worse still, I have no Westmere system on which to perform a demo for you. Sorry, guys.

Query Based Selection

Each year, I’d get invited to the Intel HPC Roundtable where we’d discuss microarchitectural details of upcoming chip releases. These intimate workshops with Intel Fellows and Distinguished Engineers facilitated the kind of deep dives and Q&As that weren’t possible on public forums.

Here’s what I scribbled in my notes from one of the speakers on the subject of the upcoming Broadwell server CPU release at Intel HPC Roundtable 2015:

“posted interrupts, page modification logging, cache QoS enforcement, memory BW monitoring, HW-controlled power mgmt., improved page miss handling/page table walking, Query Based Selection (L3 won’t evict before querying core)“

And that’s exactly how QBS works – before selecting a block as an inclusion victim, it first queries the L1 for approval:

I flew back home to Chicago excited and eager to get my hands on the pre-release Broadwell evaluation CPU for testing.² But my benchmark results left me scratching my head. Maybe QBS was not all it was touted to be. So, I reached out to Intel Engineering with my benchmark code and test results, only to hear back that they’d given up on QBS prior to release due to “unresolved issues.” Well, at least Intel came through with the “Cache QoS Enforcement” promise as a workaround.

Embracing Non-inclusive Last Level Caches

After Broadwell, Intel finally joined the AMD camp and adopted non-inclusive LLCs with the release of Skylake. This permitted them to reduce the LLC footprint while considerably boosting L2 size. But does it live up to billing? Let’s see!

Demo

Our demo includes two machines: one Haswell-based (inclusive LLC) and the other Cascade Lake-based (non-inclusive LLC). I’ll grab my favorite all-purpose benchmark tool, stress-ng, and use its ‘flip’ VM stressor as a stand-in for our “low latency application”. The LLC-hogging application will be played by the ‘read64’ VM stressor. We’ll conduct both tests on the 2nd socket of each machine (all odd-numbered cores) where all cores are isolated from the scheduler. We’ll use core 3 for ‘flip’ and core 7 for ‘read64’.

“That’s odd. Why would you skip core 1, the first core on the 2nd socket?” Oh, you know full well why I’m not using that core! Don’t play with me!

Haswell: Inclusive Last Level Cache

This Haswell system contains 32KB of L1d and 20MB of LLC as shown below:

[mdawson@haswell ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Stepping:              2
CPU MHz:               3199.738
BogoMIPS:              6403.88
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15

Let’s grab a baseline run of ‘flip’ on core 3 using a 32KB working set which neatly fits the L1d:

[mdawson@haswell ~]$ perf stat -r 5 -d numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s
stress-ng: info:  [80547] dispatching hogs: 1 vm
stress-ng: info:  [80547] successful run completed in 15.00s
stress-ng: info:  [80547] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80547]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80547] vm              1052649     15.00     14.87      0.12     70175.86     70223.42
stress-ng: info:  [80568] dispatching hogs: 1 vm
stress-ng: info:  [80568] successful run completed in 15.00s
stress-ng: info:  [80568] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80568]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80568] vm              1051884     15.00     14.87      0.12     70124.85     70172.38
stress-ng: info:  [80584] dispatching hogs: 1 vm
stress-ng: info:  [80584] successful run completed in 15.00s
stress-ng: info:  [80584] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80584]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80584] vm              1052379     15.00     14.87      0.12     70157.86     70205.40
stress-ng: info:  [80601] dispatching hogs: 1 vm
stress-ng: info:  [80601] successful run completed in 15.00s
stress-ng: info:  [80601] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80601]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80601] vm              1052289     15.00     14.87      0.12     70151.86     70199.40
stress-ng: info:  [80618] dispatching hogs: 1 vm
stress-ng: info:  [80618] successful run completed in 15.00s
stress-ng: info:  [80618] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80618]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80618] vm              1052280     15.00     14.87      0.12     70151.25     70198.80

 Performance counter stats for 'numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s' (5 runs):

         15,005.64 msec task-clock                #    1.000 CPUs utilized            ( +-  0.00% )
                14      context-switches          #    0.001 K/sec                    ( +-  2.71% )
                 0      cpu-migrations            #    0.000 K/sec
             1,704      page-faults               #    0.114 K/sec
    50,584,401,411      cycles                    #    3.371 GHz                      ( +-  0.01% )  (49.99%)
   181,359,934,141      instructions              #    3.59  insn per cycle           ( +-  0.01% )  (62.49%)
    17,583,120,821      branches                  # 1171.768 M/sec                    ( +-  0.01% )  (74.99%)
         2,244,595      branch-misses             #    0.01% of all branches          ( +-  0.76% )  (87.50%)
    44,492,963,211      L1-dcache-loads           # 2965.083 M/sec                    ( +-  0.01% )  (37.52%)
        61,653,565      L1-dcache-load-misses     #    0.14% of all L1-dcache hits    ( +-  0.85% )  (37.51%)
           254,253      LLC-loads                 #    0.017 M/sec                    ( +-  1.34% )  (37.50%)
           146,656      LLC-load-misses           #   57.68% of all LL-cache hits     ( +-  1.51% )  (37.48%)

         15.007112 +- 0.000626 seconds time elapsed  ( +-  0.00% )

Bogo ops/s measures consistently at slightly over 70,000 per run. It maintains a 3.59 IPC, L1d throughput of 2.96GB/s, and LLC throughput of 17KB/s.

Now, let’s re-run ‘flip’ with ‘read64’ concurrently executing on core 7 with a 21MB working set size:³

[mdawson@haswell ~]$ perf stat -r 5 -d numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s
stress-ng: info:  [80393] dispatching hogs: 1 vm
stress-ng: info:  [80393] successful run completed in 15.00s
stress-ng: info:  [80393] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80393]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80393] vm              1028772     15.00     14.79      0.20     68583.61     68630.55
stress-ng: info:  [80416] dispatching hogs: 1 vm
stress-ng: info:  [80416] successful run completed in 15.00s
stress-ng: info:  [80416] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80416]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80416] vm              1028232     15.00     14.77      0.22     68547.73     68594.53
stress-ng: info:  [80441] dispatching hogs: 1 vm
stress-ng: info:  [80441] successful run completed in 15.00s
stress-ng: info:  [80441] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80441]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80441] vm              1026774     15.00     14.78      0.21     68450.44     68497.26
stress-ng: info:  [80462] dispatching hogs: 1 vm
stress-ng: info:  [80462] successful run completed in 15.00s
stress-ng: info:  [80462] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80462]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80462] vm              1018467     15.00     14.75      0.24     67896.67     67943.10
stress-ng: info:  [80484] dispatching hogs: 1 vm
stress-ng: info:  [80484] successful run completed in 15.00s
stress-ng: info:  [80484] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [80484]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [80484] vm              1020240     15.00     14.76      0.23     68014.82     68061.37

 Performance counter stats for 'numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s' (5 runs):

         15,006.57 msec task-clock                #    1.000 CPUs utilized            ( +-  0.00% )
                15      context-switches          #    0.001 K/sec                    ( +-  2.60% )
                 0      cpu-migrations            #    0.000 K/sec
             1,704      page-faults               #    0.114 K/sec
    50,357,946,125      cycles                    #    3.356 GHz                      ( +-  0.04% )  (49.98%)
   176,607,210,201      instructions              #    3.51  insn per cycle           ( +-  0.20% )  (62.48%)
    17,122,614,281      branches                  # 1141.008 M/sec                    ( +-  0.20% )  (74.99%)
         2,241,031      branch-misses             #    0.01% of all branches          ( +-  0.96% )  (87.49%)
    43,313,418,811      L1-dcache-loads           # 2886.296 M/sec                    ( +-  0.22% )  (37.52%)
        59,635,656      L1-dcache-load-misses     #    0.14% of all L1-dcache hits    ( +-  0.80% )  (37.52%)
         1,894,194      LLC-loads                 #    0.126 M/sec                    ( +-  7.24% )  (37.50%)
         1,750,423      LLC-load-misses           #   92.41% of all LL-cache hits     ( +-  7.08% )  (37.48%)

         15.007929 +- 0.000884 seconds time elapsed  ( +-  0.01% )

With core 7 polluting the shared LLC, ‘flip’ drops from ~70,000 to ~68,000 bogo ops/s. Notice the drop in IPC from 3.59 to 3.51, L1d throughput drop from 2.96GB/s to 2.89GB/s, and LLC throughput increase from 17KB/s to 126KB/s. Despite a small, L1d-sized working set (32KB), messiness at the LLC level still adversely impacts core 3’s private core cache.

How does a non-inclusive LLC change matters, if at all?

Cascade Lake: Non-inclusive Last Level Cache

This Cascade Lake system contains 32KB of L1d cache and 25MB of LLC as depicted below:

[mdawson@cascadelake ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
Stepping:            7
CPU MHz:             4299.863
BogoMIPS:            7200.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15

Just like in our previous Haswell demo, we’ll grab a baseline run of ‘flip’ on core 3 with a 32KB working set which fits our L1d cache:

[mdawson@cascadelake ~]$ perf stat -r 5 -d numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s
stress-ng: info:  [389059] setting to a 15 second run per stressor
stress-ng: info:  [389059] dispatching hogs: 1 vm
stress-ng: info:  [389059] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [389059]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [389059] vm              1361427     15.00     14.62      0.31     90760.78       91187.34
stress-ng: info:  [389059] successful run completed in 15.00s
stress-ng: info:  [389064] setting to a 15 second run per stressor
stress-ng: info:  [389064] dispatching hogs: 1 vm
stress-ng: info:  [389064] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [389064]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [389064] vm              1361232     15.00     14.62      0.31     90747.84       91174.28
stress-ng: info:  [389064] successful run completed in 15.00s
stress-ng: info:  [389069] setting to a 15 second run per stressor
stress-ng: info:  [389069] dispatching hogs: 1 vm
stress-ng: info:  [389069] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [389069]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [389069] vm              1385590     15.00     14.61      0.32     92371.71       92805.76
stress-ng: info:  [389069] successful run completed in 15.00s
stress-ng: info:  [389077] setting to a 15 second run per stressor
stress-ng: info:  [389077] dispatching hogs: 1 vm
stress-ng: info:  [389077] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [389077]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [389077] vm              1361349     15.00     14.62      0.31     90755.72       91182.12
stress-ng: info:  [389077] successful run completed in 15.00s
stress-ng: info:  [389081] setting to a 15 second run per stressor
stress-ng: info:  [389081] dispatching hogs: 1 vm
stress-ng: info:  [389081] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [389081]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [389081] vm              1361366     15.00     14.62      0.31     90756.78       91183.26
stress-ng: info:  [389081] successful run completed in 15.00s

 Performance counter stats for 'numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s' (5 runs):

         15,003.53 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.00% )
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
               917      page-faults:u             #   61.118 /sec
    62,471,828,843      cycles:u                  #    4.164 GHz                      ( +-  0.01% )  (87.50%)
   252,455,743,745      instructions:u            #    4.04  insn per cycle           ( +-  0.15% )  (87.50%)
    28,372,743,612      branches:u                #    1.891 G/sec                    ( +-  0.08% )  (87.50%)
         2,840,043      branch-misses:u           #    0.01% of all branches          ( +-  1.82% )  (87.50%)
    62,138,602,359      L1-dcache-loads:u         #    4.142 G/sec                    ( +-  0.23% )  (87.50%)
       165,323,553      L1-dcache-load-misses:u   #    0.27% of all L1-dcache accesses  ( +-  1.64% )  (87.50%)
            22,070      LLC-loads:u               #    1.471 K/sec                    ( +-  0.18% )  (87.50%)
            15,785      LLC-load-misses:u         #   71.86% of all LL-cache accesses  ( +-  0.13% )  (87.50%)

         15.004840 +- 0.000385 seconds time elapsed  ( +-  0.00% )

In this case, bogo ops/s clocks in around 91,000 per run. It maintains a 4.04 IPC, L1d throughput of 4.14GB/s, and LLC throughput of ~1.5KB/s.

Now, let’s re-run ‘flip’ with ‘read64’ concurrently executing on core 7 with a 26MB working set size:⁴

[mdawson@cascadelake ~]$ perf stat -r 5 -d numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s
stress-ng: info:  [388919] setting to a 15 second run per stressor
stress-ng: info:  [388919] dispatching hogs: 1 vm
stress-ng: info:  [388919] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [388919]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [388919] vm              1360767     15.00     14.61      0.32     90716.70       91143.13
stress-ng: info:  [388919] successful run completed in 15.00s
stress-ng: info:  [388928] setting to a 15 second run per stressor
stress-ng: info:  [388928] dispatching hogs: 1 vm
stress-ng: info:  [388928] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [388928]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [388928] vm              1385074     15.00     14.61      0.32     92337.25       92771.20
stress-ng: info:  [388928] successful run completed in 15.00s
stress-ng: info:  [388936] setting to a 15 second run per stressor
stress-ng: info:  [388936] dispatching hogs: 1 vm
stress-ng: info:  [388936] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [388936]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [388936] vm              1385027     15.00     14.60      0.32     92334.09       92830.23
stress-ng: info:  [388936] successful run completed in 15.00s
stress-ng: info:  [388944] setting to a 15 second run per stressor
stress-ng: info:  [388944] dispatching hogs: 1 vm
stress-ng: info:  [388944] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [388944]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [388944] vm              1361188     15.00     14.62      0.31     90744.86       91171.33
stress-ng: info:  [388944] successful run completed in 15.00s
stress-ng: info:  [388952] setting to a 15 second run per stressor
stress-ng: info:  [388952] dispatching hogs: 1 vm
stress-ng: info:  [388952] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [388952]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [388952] vm              1361205     15.00     14.61      0.32     90746.03       91172.47
stress-ng: info:  [388952] successful run completed in 15.00s

 Performance counter stats for 'numactl --membind=1 stress-ng --vm 1 --taskset 3 --vm-keep --vm-bytes 32k --vm-method flip --metrics-brief --timeout 15s' (5 runs):

         15,003.56 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.00% )
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
               917      page-faults:u             #   61.118 /sec
    62,457,928,284      cycles:u                  #    4.163 GHz                      ( +-  0.02% )  (87.49%)
   252,469,328,307      instructions:u            #    4.04  insn per cycle           ( +-  0.17% )  (87.50%)
    28,369,514,296      branches:u                #    1.891 G/sec                    ( +-  0.09% )  (87.50%)
         2,889,046      branch-misses:u           #    0.01% of all branches          ( +-  0.69% )  (87.50%)
    62,125,500,337      L1-dcache-loads:u         #    4.141 G/sec                    ( +-  0.27% )  (87.50%)
       162,790,289      L1-dcache-load-misses:u   #    0.26% of all L1-dcache accesses  ( +-  2.14% )  (87.50%)
            22,027      LLC-loads:u               #    1.468 K/sec                    ( +-  0.13% )  (87.50%)
            15,768      LLC-load-misses:u         #   71.65% of all LL-cache accesses  ( +-  0.58% )  (87.50%)

        15.0046476 +- 0.0000781 seconds time elapsed  ( +-  0.00% )

Even though core 7 swamps the LLC with reads, the ‘flip’ workload throughput never drops, and IPC & L1d/LLC throughput rates remain the same between setups! Cascade Lake’s non-inclusive policy protected the performance of our low-latency application!

Be Inclusive Everywhere Except the LLC

We should strive for inclusivity in our personal and professional lives, and in society as a whole for the betterment of humanity. But when it comes to your CPU Last Level Cache, you might want to reconsider. And don’t forget to check your chosen cloud instance types. If they’re supported by anything earlier than Skylake, then you may just be suffering from a form of noisy neighbor you never anticipated.

1
Paid affiliate link
2
Lots of HFT firms are on Early Release Programs with chip manufacturers to test CPUs prior to GA release.
3
perf stat -r 5 -d numactl –membind=1 stress-ng –vm 1 –taskset 7 –vm-keep –vm-bytes 21m –vm-method read64 –metrics-brief –timeout 15s
4
perf stat -r 5 -d numactl –membind=1 stress-ng –vm 1 –taskset 7 –vm-keep –vm-bytes 26m –vm-method read64 –metrics-brief –timeout 15s

Do you enjoy this kind of content? Share on your social media.

You can also follow us on

Twitter

and get notified as soon as new posts become available.

Last Level Cache: Where It’s Bad To Be Inclusive

Table of contents

Last Level Cache: Final Stop Before Main Memory

Inclusion Policy

Inclusive LLC & Backward Invalidations

Temporal Locality Hints

Query Based Selection

Embracing Non-inclusive Last Level Caches

Demo

Haswell: Inclusive Last Level Cache

Cascade Lake: Non-inclusive Last Level Cache

Be Inclusive Everywhere Except the LLC

Linkedin

Twitter

Mark Dawson, Jr.

About us

Performance Analysis and Tuning on Modern CPUs

Contact us

Call us

Email us