“I’ma let you finish but. . .” – Hunting Down System Interrupts

Kanye Interrupts Taylor

Estimated reading time: 13 minutes

“Taylor Swift, I’ma let you finish but Beyoncé had one of the best videos of all time!”

Kanye West (2009 MTV Video Music Awards)

Up til that point, the musical performances and award presentations had been proceeding without a hitch. Sure, Kanye West and his then-girlfriend Amber Rose sat near the stage drinking from an open bottle of Hennessy. But what could go wrong from that? But then Taylor Lautner opened the envelope to announce the winner of the 2009 Female Video of The Year Award. “Taylor Swift!” he shouted. Taylor had just defeated her co-nominee Beyoncé to Kanye’s drunken shock. Then, midway into her heartfelt acceptance speech, an appalled and cognac-fueled West stormed on stage, commandeered her mic, and infamously delivered the aforementioned quote. That notorious pop culture interruption ruined her moment, let alone her speech. System interrupts pose a similar disruptive risk to our applications.

Think about it. Your application thread grabs a core expecting its exclusive use as it performs its duties. But little does it realize, a system interrupt waits in the wings, drunk off the Henny, eager to yank the core away for its own purposes. That thread’s moment, and performance, would be ruined. What can we do to guard against these digital Kanyes in our systems and preserve our latency Service Level Objectives (SLO)?

Sources of System Interrupts

Out of the box, our x86-based Linux systems conspire to interrupt our applications in a variety of ways. These system interruptions include:

  • Timer Tick – a per-core interrupt that cues the CPU to reconsider which thread should be running, catch up on read-copy-update (RCU) callbacks, and other necessary housekeeping. By default, this interrupt fires anywhere between 100 to 1000 times a second.
  • Scheduler – switches both user and kernel threads on/off cores as necessary to apportion runtime fairness as much as possible.
  • Device Interrupts – signals generated from peripheral devices to request service from the CPU.
  • Inter-processor Interrupts (IPI) – special interrupts sent from one core to another core.
  • System Management Interrupts (SMI) – special interrupts originating from the underlying system (outside the Linux kernel) to provide extended system management features (e.g., DRAM ECC functions, CPU temperature and fan control, etc.). Systems BIOS and firmware routines service these interrupts, unlike device interrupts which are serviced by OS device driver routines.

The latency impact of these system interrupt sources can range between single-digit microseconds to many milliseconds. Obviously, if you define SLOs within that aforementioned range, hunting down these digital Kanyes is critical for your application. So, let the hunting begin!

But first. . . let’s not reinvent the wheel

Tackling each source of interrupts from scratch can be a fun and engaging exercise. But I’m gonna guess that you don’t have unlimited free time or indefinite project deadlines at work. So, please, don’t reinvent the wheel. Follow best practice guidelines from technologists who’ve already laid a lot of the initial groundwork in this area. Plenty of opportunities will remain to locate and eradicate mysterious interrupts on your own, I assure you. Here are some online resources to get you started:

  • Red Hat Guidelines – this tuning document is based on RHEL 7 but still provides a solid foundation upon which to build
  • Erik Rigtorp’s Blog – this is a more up-to-date and comprehensive guideline which addresses the full stack of interrupt sources
  • HP BIOS Guidelines – An example “best practices” document from HP for recommended settings for best performance and SMI elimination
  • Dell BIOS Guidelines – Sections 1.10.1 and 1.10.2 of this document from Dell outlines recommended settings for optimal machine performance and SMI elimination

Now that we’ve set ourselves up to stand on the shoulder of giants, how do we hunt down any residual sources of system interrupts?

System Interrupt Hunting Tools

Let’s review a few tools which I’m certain you’ll find useful on your system interrupt hunting expeditions. Please note that this is *not* intended to be an exhaustive list – it simply represents the set of utilities I’ve found most helpful over the years. Now let’s get started!

Hardware-level System Interrupt Hunting Tools

perf stat

Linux perf never ceases to amaze me. Just when I think I’m familiar with all its functionality, I learn about a new wrinkle. It’s an amazing tool that every Performance Engineer should devote time to learning.

Now, when it came to detecting SMIs on a newly racked machine, I’d always either fire up the turbostat utility or directly read the MSR_SMI_COUNT register on Intel via rdmsr -a 0x34. That is, until I learned about the ‘–smi-cost’ option of perf stat:

[root@eltoro]# perf stat -a -A --smi-cost -- sleep 10

 Performance counter stats for 'system wide':

                   SMI cycles%                 SMI#
CPU0                      0.0%                    0
CPU1                      0.0%                    0
CPU2                      0.0%                    0
CPU3                      0.0%                    0
CPU4                      0.0%                    0
CPU5                      0.0%                    0
CPU6                      0.0%                    0
CPU7                      0.0%                    0
CPU8                      0.0%                    0
CPU9                      0.0%                    0
CPU10                     0.0%                    0
CPU11                     0.0%                    0
CPU12                     0.0%                    0
CPU13                     0.0%                    0
CPU14                     0.0%                    0
CPU15                     0.0%                    0
CPU16                     0.0%                    0
CPU17                     0.0%                    0

      10.000486424 seconds time elapsed

Notice that it not only produces a Count but also the % of CPU cycles imposed by the SMIs, offering a glimpse into its impact in duration. And unlike the aforementioned rdmsr alternative, this perf stat approach works on AMD processors, too.

hwlat

FTrace offers the hwlat tracer as an alternative method of detecting hardware/firmware-based system interrupts such as SMIs:

[root@eltoro]# echo hwlat > /sys/kernel/debug/tracing/current_tracer
[root@eltoro]# echo 1 > /sys/kernel/debug/tracing/tracing_on

[root@eltoro]# cat /sys/kernel/debug/tracing/trace
# tracer: hwlat
#
# entries-in-buffer/entries-written: 0/0   #P:18
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

[root@eltoro]# echo 0 > /sys/kernel/debug/tracing/tracing_on
[root@eltoro]# echo nop > /sys/kernel/debug/tracing/current_tracer

Alternatively, you can use the more user-friendly trace-cmd FTrace frontend to do the same thing without requiring interaction with the SysFS directory hierarchy:

[root@eltoro]# trace-cmd start -p hwlat
  plugin 'hwlat'
[root@eltoro]# trace-cmd show
# tracer: hwlat
#
# entries-in-buffer/entries-written: 0/0   #P:18
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
[root@eltoro]# trace-cmd stop

OS-level System Interrupt Hunting Tools

/proc/interrupts

We can’t talk about system interrupts without discussing the procfs file which has “interrupts” in its name: /proc/interrupts. This file contains columns of CPU#s and rows of interrupt sources with interrupt counts at the intersection. Below is a snapshot taken from a CentOS 7 machine on which I debugged an interrupt issue where I discovered the root cause as IPIs emanating from TLB Shootdowns from the “Auto NUMA Balancing” feature:

          CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       

 45:          0          0          0          0          0          0          0          0   PCI-MSI-edge      eth0
 46:     192342          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
 47:         14          0          0          0          0          0          0          0   PCI-MSI-edge      mei

NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:     552219    1010298    2272333    3179890    1445484    1226202    1800191    1894825   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts

IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
RTR:          7          0          0          0          0          0          0          0   APIC ICR read retries
RES:      18708       9550        771        528        129        170        151        139   Rescheduling interrupts
CAL:        711        934       1312       1261       1446       1411       1433       1432   Function call interrupts
TLB:       4493       6108      73789       5014       1788       2327       1967        914   TLB shootdowns

This is often the first place I go when a thread affined to a certain core is exhibiting anomalous latency behavior. You can even use the watch command to get a feel for incoming rates from interrupts of interest:

[root@eltoro]# watch -d -n2 "egrep 'TLB|CAL' /proc/interrupts" #Show delta of interrupt counts from TLB Shootdowns and Function Call Interrupts every 2 secs

oslat

Part of the Real-time Tests repo, the oslat tool used to be my favorite utility in the interrupt hunting game (you’ll see later why I wrote “used to be”). It detects OS-level thread latency introduced by system scheduling or other interruptions. Here’s an example output from a system with cores 4 thru 7 isolated and timer tick reduced (i.e., isolcpus=4-7 nohz_full=4-7 boot options) while stress-ng –cpu 8 runs in the background:

[root@eltoro rt-tests]# ./oslat -c 0-7 -D 30
oslat V 2.40
Total runtime:          30 seconds
Thread priority:        default
CPU list:               0-7
CPU for main thread:    0
Workload:               no
Workload mem:           0 (KiB)
Preheat cores:          8

Pre-heat for 1 seconds...
Test starts...
Test completed.

        Core:    0 1 2 3 4 5 6 7
Counter Freq:    3096 3096 3096 3096 3096 3096 3095 3095 (Mhz)
    001 (us):    297462870 707961371 681256247 914343488 1496193369 1491253718 1496120812 1490616245
    002 (us):    4852 12319 11245 2069 4 5 4 9
    003 (us):    400 837 851 428 1 1 1 3
    004 (us):    178 82 34 29 0 1 1 0
    005 (us):    6 42 1 19 1 1 1 1
    006 (us):    1 28 0 8 0 0 0 0
    007 (us):    0 22 0 1 1 0 0 1
    008 (us):    0 3 0 0 0 1 1 0
    009 (us):    2 1 0 0 0 0 0 1
    010 (us):    1 0 0 0 0 0 0 0
    011 (us):    0 0 0 0 0 0 0 0
    012 (us):    0 0 0 0 0 0 0 0
    013 (us):    0 0 0 0 0 0 0 0
    014 (us):    0 0 0 0 0 0 0 0
    015 (us):    1 0 0 0 0 0 0 0
    016 (us):    0 0 0 0 0 0 0 0
    017 (us):    0 0 0 0 0 0 0 0
    018 (us):    2 0 0 0 0 0 0 0
    019 (us):    9 0 0 0 0 0 0 0
    020 (us):    7 0 0 0 0 0 0 0
    021 (us):    2 0 0 0 0 0 0 0
    022 (us):    0 0 0 0 0 0 0 0
    023 (us):    0 0 0 0 0 0 0 0
    024 (us):    0 0 0 0 0 0 0 0
    025 (us):    0 0 0 0 0 0 0 0
    026 (us):    0 0 0 0 0 0 0 0
    027 (us):    0 0 0 0 0 0 0 0
    028 (us):    0 0 0 0 0 0 0 0
    029 (us):    0 0 0 0 0 0 0 0
    030 (us):    0 0 0 0 0 0 0 0
    031 (us):    0 0 0 0 0 0 0 0
    032 (us):    525 865 1089 869 0 0 0 0 (including overflows)
     Minimum:    1 1 1 1 1 1 1 1 (us)
     Average:    1.081 1.034 1.035 1.026 1.016 1.016 1.016 1.016 (us)
     Maximum:    142851 81917 28973 23977 7 8 8 9 (us)
     Max-Min:    142850 81916 28972 23976 6 7 7 8 (us)
    Duration:    30.049 29.978 29.962 29.966 29.961 29.961 29.971 29.971 (sec)

It’s clear from the breakdown that cores 0 thru 3 exhibit higher MAXes and longer tails than cores 4 thru 7. That alone is pretty cool, but it still leaves the search for root cause(s) as an exercise for the user.

That all changed when the maintainers added the “-T” option, which works along with FTrace! So, let’s say you have FTrace running in the background. This “-T” option allows for setting a latency threshold above which oslat will make an entry in the FTrace recording buffer using the trace marker facility, after which it will stop the FTrace. Cool, huh? This way, you can just locate the oslat marker in the FTrace file and note the surrounding entries to deduce the offending interrupt!

Not bad, huh? But it still requires coordinating across two separate utilities (oslat and FTrace). Also, it only deals with OS-level interrupts. Wouldn’t it be cool if it included this source-tracking capability for the lower level interrupts, too?

The Infinity Gauntlet of System Interrupt Hunting Tools

osnoise

In the previous section, I noted that oslat “used to be” a favorite of mine for system interrupt tracking. And now I’ll tell you why: the introduction of the new osnoise tracer in FTrace!

Not only does it include the OS-level interrupt tracking capability of oslat, but it can also trace HW/firmware level noise. Even more, since it’s included in the kernel as both a tracer *and* workload generator itself (oslat is only the latter), it tracks root causes like the “-T” option of oslat without requiring the juggling of multiple tools. Below is a snapshot of the basic functionality of osnoise:

[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# cat trace
# tracer: osnoise
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth                            MAX
#                              || /                                             SINGLE      Interference counters:
#                              ||||               RUNTIME      NOISE  %% OF CPU  NOISE    +-----------------------------+
#           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
#              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
           <...>-4731  [000] ....  3625.723652: 1000000         71  99.99290      19      0      0   1001     18      5
           <...>-4732  [001] ....  3625.723672: 1000000          8  99.99920       8      0      0   1000     83     61
           <...>-4734  [003] ....  3625.723732: 1000000          0 100.00000       0      0      1   1000     14      0
           <...>-4735  [004] ....  3625.723760: 1000000          0 100.00000       0      0      0      4      0      0
           <...>-4736  [005] ....  3625.723790: 1000000          0 100.00000       0      0      0      4      0      0
           <...>-4737  [006] ....  3625.723814: 1000000          0 100.00000       0      0      0      4      0      0
           <...>-4738  [007] ....  3625.723836: 1000000          0 100.00000       0      0      0      4      0      0

Not only does it report on total noise duration and single max noise readings, but there’s an entire section to the right dedicated to counting the types of interrupts observed – HW, NMI (Non-Maskable Interrupts), IRQ, SIRQ (Soft IRQ), and THREAD. It truly encompasses all possible levels of system noise.

Now let’s see if osnoise will report the root cause for interrupts on cores 0 thru 3. Below, I set the tracer to “osnoise” and also enable all “osnoise” events for root cause tracking. Lastly, I tell “osnoise” to stop tracing once a latency threshold of 10μs is breached:

[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# echo osnoise > set_event
[root@eltoro tracing]# echo 10 > osnoise/stop_tracing_us
[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace
# tracer: osnoise
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth                            MAX
#                              || /                                             SINGLE      Interference counters:
#                              ||||               RUNTIME      NOISE  %% OF CPU  NOISE    +-----------------------------+
#           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
#              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
       osnoise/0-5024  [000] d.h.  5328.056759: irq_noise: local_timer:236 start 5328.056757596 duration 1360 ns
       osnoise/1-5025  [001] d.h.  5328.056775: irq_noise: local_timer:236 start 5328.056773292 duration 1179 ns
           <...>-5026  [002] d.h.  5328.056790: irq_noise: local_timer:236 start 5328.056788930 duration 894 ns
       osnoise/0-5024  [000] ..s.  5328.057760: softirq_noise:      RCU:9 start 5328.057759392 duration 199 ns
       osnoise/1-5025  [001] d.h.  5328.057774: irq_noise: local_timer:236 start 5328.057773277 duration 902 ns
       osnoise/1-5025  [001] ..s.  5328.057774: softirq_noise:      RCU:9 start 5328.057774329 duration 83 ns
           <...>-5026  [002] d.h.  5328.057790: irq_noise: local_timer:236 start 5328.057788856 duration 622 ns
           <...>-5026  [002] ..s.  5328.057790: softirq_noise:      RCU:9 start 5328.057789621 duration 62 ns
           <...>-5027  [003] d.h.  5328.057805: irq_noise: local_timer:236 start 5328.057804436 duration 607 ns
           <...>-5027  [003] ..s.  5328.057805: softirq_noise:      RCU:9 start 5328.057805192 duration 64 ns
. . .
. . .
. . .
       osnoise/0-5024  [000] ..s.  5328.238757: softirq_noise:    TIMER:1 start 5328.238756704 duration 723 ns
       osnoise/1-5025  [001] d.h.  5328.238772: irq_noise: local_timer:236 start 5328.238771210 duration 790 ns
       osnoise/0-5024  [000] dNh.  5328.239757: irq_noise: local_timer:236 start 5328.239755524 duration 1064 ns
     kworker/0:2-241   [000] d...  5328.239766: thread_noise: kworker/0:2:241 start 5328.239757025 duration 9334 ns
       osnoise/0-5024  [000] ....  5328.239767: sample_threshold: start 5328.239755276 duration 11322 ns interference 2

The trace automatically ends at the final line where we breach the 10μs threshold on core 0. It measures the total interrupt duration as 11,322ns, and points to the two prior lines as root causes (interference 2). Those interruptions include “thread_noise” from kworker/0:2 (9,334ns) and “irq_noise” from its local_timer (1,064ns).

Now that’s what I call a “System Interrupt Hunting Tool”! If you want to learn more about osnoise and all its neat bells and whistles, check out this talk from Daniel Bristot de Oliveira, Principal Software Engineer at Red Hat:

“I’ma let you finish. . . period”

You scoped out this hardware and tuned the OS specifically for your IT department’s hand-crafted and thoroughly tested production application. Once your app gets scheduled on a core, that’s its moment. Its time to shine. Don’t let rowdy system interrupts storm the stage, snatch the core away for heaven-only-knows-how-long, and trample all over your app’s day in the “CPU time-slice” sun. Track down those lurking digital Kanyes and cut ’em off at the pass. Your SRE Team and, more importantly, your paying customers will thank you for it.

Do you enjoy this kind of content? Share on your social media.

Facebook
Twitter
LinkedIn

You can also follow us on

or 

and get notified as soon as new posts become available.