Google Benchmark w/o PMU Counters? Congratulations, you played yourself!

Google Benchmark Title Image

Estimated reading time: 8 minutes

DJ Khaled, the internationally famous Hip Hop DJ, record executive, and music producer, is the Tony Robbins of Rap. He’s almost as famous for his motivational catchphrases as he is for his chart-topping hits. Who from Gen X to Gen Z hasn’t pushed through a final set in the gym w/o his mantra “Anotha one” chanting in his head? Or brushed his teeth in the bathroom mirror w/o affirming to himself, “We the best”? Of all his catchphrases, my personal favorite is his reprimanding, “Congratulations, you played yourself.” It’s useful for all those times you helped “the haters” by getting in your *own* way (we’ve all done it). One of the more frequent ways we “play ourselves” is by using microbenchmark tools like Google Benchmark w/o gathering PMU metrics. Since the addition of PMU Counter support to Google Benchmark last year, ignoring them is even more of a face-palm.

“What’s the big deal with PMU counters? Isn’t it enough that I’m isolating functions and measuring runtime differences?” On the surface, relying on runtime comparisons seems reasonable. But we’re going to examine how runtime alone can lead you down the wrong path in surprising ways. Then, we’ll show how PMU counters help avoid this pitfall.

Using Google Benchmark

“What’s so great about Google Benchmark, anyway?” Similar to tools like hayai, Nonius, and Celero, Google Benchmark is a framework for benchmarking isolated C++ components. Think C++ Unit Tests but with runtime statistics added. Let’s look at the simple example taken from the online documentation which compares std::string creation vs. std::string copy:

mdawson@eltoro (main *%=)$ ./mytestbenchmark
2022-03-30T16:41:39-05:00
Running ./mytestbenchmark
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.23, 0.14, 0.08
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation       4.20 ns         4.19 ns    158420904
BM_StringCopy           7.93 ns         7.91 ns     88528830

Not bad, huh? You get a nice output of per-operation wall-clock time and CPU time. You can also add basic statistics to the output like so:

mdawson@eltoro (main *%=)$ ./mytestbenchmark --benchmark_repetitions=30 --benchmark_report_aggregates_only=true
2022-03-30T16:42:34-05:00
Running ./mytestbenchmark
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.09, 0.11, 0.08
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_StringCreation_mean         4.30 ns         4.29 ns           30
BM_StringCreation_median       4.28 ns         4.27 ns           30
BM_StringCreation_stddev      0.094 ns        0.094 ns           30
BM_StringCreation_cv           2.18 %          2.18 %            30
BM_StringCopy_mean             7.93 ns         7.92 ns           30
BM_StringCopy_median           7.93 ns         7.91 ns           30
BM_StringCopy_stddev          0.025 ns        0.025 ns           30
BM_StringCopy_cv               0.31 %          0.31 %            30

We’ve requested 30 iterations for each function, and it returns the mean, median, standard deviation, and coefficient of variation. It even supports adding custom statistical measures, as well. Fantastic!

“Hey, I can use this to compare the performance of multiple functions that perform the same operation but in different ways, right?” Yes, you can! Excited yet? Good. . . because that’s exactly where we begin to run into issues.

Google Benchmark Gotchas

We’ve all read blogs with interesting microbenchmark showdowns between various C++ idioms and STL algorithms. They begin by explaining the idea behind each function’s way of performing the same operation, then run the microbenchmark comparison, and finally declare a victor based on the reported runtimes. Seems reasonable, right? Well, my colleague Bartosz Adamczewski wrote an article that will disabuse you of that notion and give you microbenchmark nightmares, to boot. But, while I have you here, let me show you just one of the pitfalls that I see rather frequently in practice.

Is this function slower or just misaligned?

Let’s say you want to test 2 functions that essentially accomplish the same thing in different ways, and you will select one based on performance. Let’s sanity check first by running the exact same function side-by-side, one named with an A suffix and the other with a B suffix:

mdawson@eltoro (main *%=)$ ./samefuncbenchmark1
2022-03-30T17:09:00-05:00
Running ./samefuncbenchmark1
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.00, 0.01, 0.05
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_StringCreationA       3.96 ns         3.95 ns    177058640
BM_StringCreationB       3.96 ns         3.95 ns    177058640

Ok, this is what we’d expect since we simply duplicated the function and gave them different names. Now, let’s do the same for the BM_StringCopy function:

mdawson@eltoro (main *%=)$ ./samefuncbenchmark2
2022-03-30T17:11:28-05:00
Running ./samefuncbenchmark2
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.00, 0.01, 0.05
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_StringCopyA       7.69 ns         7.67 ns     91184115
BM_StringCopyB       8.16 ns         8.14 ns     85947611

“Wait. . . what? It’s the same function! How is that even possible?” Well, maybe it’s because I didn’t use multiple runs to get statistics. Let’s try that:

mdawson@eltoro (main *%=)$ ./samefuncbenchmark2 --benchmark_repetitions=30 --benchmark_report_aggregates_only=true
2022-03-30T17:14:14-05:00
Running ./samefuncbenchmark2
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.01, 0.02, 0.05
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_StringCopyA_mean         7.76 ns         7.74 ns           30
BM_StringCopyA_median       7.69 ns         7.67 ns           30
BM_StringCopyA_stddev      0.208 ns        0.208 ns           30
BM_StringCopyA_cv           2.68 %          2.68 %            30
BM_StringCopyB_mean         8.12 ns         8.10 ns           30
BM_StringCopyB_median       8.16 ns         8.14 ns           30
BM_StringCopyB_stddev      0.136 ns        0.136 ns           30
BM_StringCopyB_cv           1.68 %          1.68 %            30

Nope, stats don’t illuminate the issue, either. Now, imagine you were actually comparing 2 distinct functions, then used the reported runtimes to select a victor. Then afterward, because you’re not a “Knowledge Hoarder”, you wrote up your findings in an article that got shared on Hacker News, resulting in a faulty performance tip which techies blindly follow & repeat over the next 5 years.

Google Benchmark Article Image

It happens all the time. That’s why PMU Counter support is such a boon for Google Benchmark. Not only will it help explain why one function performs better than another. But It also helps us ensure that we’re testing exactly what we intend to test, not some residual effect of the framework itself. Let’s see how it does this.

PMU Counters to the Rescue

Let’s re-run both experiments from the previous section, but add INSTRUCTIONS & CYCLES (i.e., IPC), along with IDQ_UOPS_NOT_DELIVERED.CORE (i.e., total runtime count of unused execution slots, of which there are 4 per cycle) PMU Counters on the command line:

mdawson@eltoro (main *%=)$ ./samefuncbenchmark1 --benchmark_perf_counters=CYCLES,INSTRUCTIONS,IDQ_UOPS_NOT_DELIVERED:CORE
2022-03-30T17:24:06-05:00
Running ./samefuncbenchmark1
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.00, 0.03, 0.05
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
BM_StringCreationA       3.96 ns         3.95 ns    177058381 CYCLES=17 IDQ_UOPS_NOT_DELIVERED:CORE=30 INSTRUCTIONS=37
BM_StringCreationB       3.96 ns         3.95 ns    177058413 CYCLES=17 IDQ_UOPS_NOT_DELIVERED:CORE=31 INSTRUCTIONS=37

All metrics are roughly equivalent across both runs, with an identical IPC of 2.18. Now, let’s try it for BM_StringCopy:

mdawson@eltoro (main *%=)$ ./samefuncbenchmark2 --benchmark_perf_counters=CYCLES,INSTRUCTIONS,IDQ_UOPS_NOT_DELIVERED:CORE
2022-03-30T17:26:01-05:00
Running ./samefuncbenchmark2
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 25344 KiB (x2)
Load Average: 0.08, 0.03, 0.05
-------------------------------------------------------------------------
Benchmark               Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------
BM_StringCopyA       7.69 ns         7.67 ns     91210815 CYCLES=33 IDQ_UOPS_NOT_DELIVERED:CORE=39 INSTRUCTIONS=97
BM_StringCopyB       8.16 ns         8.14 ns     85999045 CYCLES=35 IDQ_UOPS_NOT_DELIVERED:CORE=46 INSTRUCTIONS=97

AHA! The 2nd function stalls the CPU frontend more often, delivering fewer μops per cycle to the CPU backend! This is a classic symptom of code misalignment, usually resulting from underutilization of the Decoded Stream Buffer (DSB) or Loop Stream Detector (LSD), leading to ~50ns runtime hit.

To be clear, both functions are heavily frontend bound – IDQ_UOPS_NOT_DELIVERED:CORE / (4 * CYCLES). But the 2nd function exhibits more frontend-boundedness (30% vs. 33%), and thus a higher runtime and lower IPC (2.94 vs. 2.77). So, instead of testing function A vs. function B as we intended, we actually tested some residual effect of the microbenchmark framework itself. How would we have caught this w/o grabbing PMU Counters?

Of course, Google Benchmark is not limited to the aforementioned 3 PMU counters. We can specify any counter of which libpfm4 is aware. As a result, it gives us all we need to follow the discipline of “Active Benchmarking”, about which I proselytize constantly on this site.

Why can’t I just use perf stat?

For simple cases like those above, you can compile each function into its own benchmark and execute them under perf stat just fine. However, more complex microbenchmarks that include setup and teardown code would muddy up your perf stat output. That’s where Google Benchmark w/PMU Counters helps – it only gathers PMU metrics during the timed aspects of the microbenchmark.

Building Google Benchmark w/PMU Counter Support

Including PMU Counter support with Google Benchmark is pretty straightforward. But for completeness sake, I’ll add it here anyway:

  1. git clone https://github.com/google/benchmark.git
  2. git clone https://github.com/google/googletest.git benchmark/googletest
  3. cd benchmark
  4. Edit CMakeLists.txt and toggle BENCHMARK_ENABLE_LIBPFM to “ON”
  5. cmake -E make_directory “build”
  6. cmake -DCMAKE_BUILD_TYPE=Release -S . -B “build”
  7. cmake –build “build” –config Release

Easy peezy! From this point on, you build your benchmark as you normally would except you’d add -lpfm to the command line.

Don’t Play Yourself

Microbenchmark frameworks provide value to both Software and Performance Engineers alike. But using them in a fire-and-forget manner is a recipe for disaster, as we learned above. PMU Counters help sidestep potential microbenchmark landmines, and C++ frameworks like Google Benchmark have evolved to support them. They’re not only the best way to understand the reasons behind runtime differences, but they also help determine whether we’re testing what we think we are.

So, don’t be that guy who imparts false wisdom that will live on for years in “Performance Urban Legends” whispered among credulous techies around the world, all due to lazy benchmark reporting. Apply the due diligence of “Active Benchmarking” to your microbenchmarks by utilizing PMU Counter support. And avoid having anyone post a comment of reprimand under your article that reads:

“Congratulations. . . you played yourself!”

Do you enjoy this kind of content? Share on your social media.

Facebook
Twitter
LinkedIn

You can also follow us on

or 

and get notified as soon as new posts become available.