Estimated reading time: 10 minutes
“Taylor Swift, I’ma let you finish but Beyoncé had one of the best videos of all time!”
Kanye West (2009 MTV Video Music Awards)
Up til that point, the musical performances and award presentations had been proceeding without a hitch. Sure, Kanye West and his then-girlfriend Amber Rose sat near the stage drinking from an open bottle of Hennessy. But what could go wrong from that? But then Taylor Lautner opened the envelope to announce the winner of the 2009 Female Video of The Year Award. “Taylor Swift!” he shouted. Taylor had just defeated her co-nominee Beyoncé to Kanye’s drunken shock. Then, midway into her heartfelt acceptance speech, an appalled and cognac-fueled West stormed on stage, commandeered her mic, and infamously delivered the aforementioned quote. That notorious pop culture interruption ruined her moment, let alone her speech. System interrupts pose a similar disruptive risk to our applications.
Think about it. Your application thread grabs a core expecting its exclusive use as it performs its duties. But little does it realize, a system interrupt waits in the wings, drunk off the Henny, eager to yank the core away for its own purposes. That thread’s moment, and performance, would be ruined. What can we do to guard against these digital Kanyes in our systems and preserve our latency Service Level Objectives (SLO)?
Table of contents
Sources of System Interrupts
Out of the box, our x86-based Linux systems conspire to interrupt our applications in a variety of ways. These system interruptions include:
- Timer Tick – a per-core interrupt that cues the CPU to reconsider which thread should be running, catch up on read-copy-update (RCU) callbacks, and other necessary housekeeping. By default, this interrupt fires anywhere between 100 to 1000 times a second.
- Scheduler – switches both user and kernel threads on/off cores as necessary to apportion runtime fairness as much as possible.
- Device Interrupts – signals generated from peripheral devices to request service from the CPU.
- Inter-processor Interrupts (IPI) – special interrupts sent from one core to another core.
- System Management Interrupts (SMI) – special interrupts originating from the underlying system (outside the Linux kernel) to provide extended system management features (e.g., DRAM ECC functions, CPU temperature and fan control, etc.). Systems BIOS and firmware routines service these interrupts, unlike device interrupts which are serviced by OS device driver routines.
The latency impact of these system interrupt sources can range between single-digit microseconds to many milliseconds. Obviously, if you define SLOs within that aforementioned range, hunting down these digital Kanyes is critical for your application. So, let the hunting begin!
But first. . . let’s not reinvent the wheel
Tackling each source of interrupts from scratch can be a fun and engaging exercise. But I’m gonna guess that you don’t have unlimited free time or indefinite project deadlines at work. So, please, don’t reinvent the wheel. Follow best practice guidelines from technologists who’ve already laid a lot of the initial groundwork in this area. Plenty of opportunities will remain to locate and eradicate mysterious interrupts on your own, I assure you. Here are some online resources to get you started:
- Red Hat Guidelines – this tuning document is based on RHEL 7 but still provides a solid foundation upon which to build
- Erik Rigtorp’s Blog – this is a more up-to-date and comprehensive guideline which addresses the full stack of interrupt sources
- HP BIOS Guidelines – An example “best practices” document from HP for recommended settings for best performance and SMI elimination
- Dell BIOS Guidelines – Sections 1.10.1 and 1.10.2 of this document from Dell outlines recommended settings for optimal machine performance and SMI elimination
Now that we’ve set ourselves up to stand on the shoulder of giants, how do we hunt down any residual sources of system interrupts?
System Interrupts Hunting Tools
Let’s review a few tools which I’m certain you’ll find useful on your system interrupt hunting expeditions. Please note that this is *not* intended to be an exhaustive list. It simply represents the set of utilities I’ve found most helpful over the years. Now let’s get started!
Hardware-level System Interrupts Hunting Tools
perf stat
Linux perf never ceases to amaze me. Just when I think I’m familiar with all its functionality, I learn about a new wrinkle. It’s an amazing tool that every Performance Engineer should devote time to learning.
Now, when it came to detecting SMIs on a newly racked machine, I’d always either fire up the turbostat utility or directly read the MSR_SMI_COUNT register on Intel via rdmsr -a 0x34. That is, until I learned about the ‘–smi-cost’ option of perf stat:
[root@eltoro]# perf stat -a -A --smi-cost -- sleep 10
Performance counter stats for 'system wide':
SMI cycles% SMI#
CPU0 0.0% 0
CPU1 0.0% 0
CPU2 0.0% 0
CPU3 0.0% 0
CPU4 0.0% 0
CPU5 0.0% 0
CPU6 0.0% 0
CPU7 0.0% 0
CPU8 0.0% 0
CPU9 0.0% 0
CPU10 0.0% 0
CPU11 0.0% 0
CPU12 0.0% 0
CPU13 0.0% 0
CPU14 0.0% 0
CPU15 0.0% 0
CPU16 0.0% 0
CPU17 0.0% 0
10.000486424 seconds time elapsed
Notice that it not only produces a Count but also the % of CPU cycles imposed by the SMIs. And unlike the aforementioned rdmsr alternative, this perf stat approach works on AMD processors, too, except the command options are slightly different:
[root@amd]# perf stat -e ls_smi_rx -I 1000
# time counts unit events
1.001009858 0 ls_smi_rx
2.002295208 0 ls_smi_rx
3.003531738 0 ls_smi_rx
4.004755954 0 ls_smi_rx
5.005970763 0 ls_smi_rx
6.006391858 0 ls_smi_rx
7.007612107 0 ls_smi_rx
8.008825914 0 ls_smi_rx
9.010042997 0 ls_smi_rx
10.010390676 0 ls_smi_rx
. . .
. . .
. . .
NOTE: For an in-depth discussion on using perf to read MSRs and hardware PMU counters for application performance analysis, check out our book Performance Analysis and Tuning on Modern CPUs.1Paid affiliate link
hwlat
FTrace offers the hwlat tracer as an alternative method of detecting hardware/firmware-based system interrupts such as SMIs:
[root@eltoro]# echo hwlat > /sys/kernel/debug/tracing/current_tracer
[root@eltoro]# echo 1 > /sys/kernel/debug/tracing/tracing_on
[root@eltoro]# cat /sys/kernel/debug/tracing/trace
# tracer: hwlat
#
# entries-in-buffer/entries-written: 0/0 #P:18
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
[root@eltoro]# echo 0 > /sys/kernel/debug/tracing/tracing_on
[root@eltoro]# echo nop > /sys/kernel/debug/tracing/current_tracer
Alternatively, you can use the more user-friendly trace-cmd FTrace frontend to do the same thing without requiring interaction with the SysFS directory hierarchy:
[root@eltoro]# trace-cmd start -p hwlat
plugin 'hwlat'
[root@eltoro]# trace-cmd show
# tracer: hwlat
#
# entries-in-buffer/entries-written: 0/0 #P:18
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
[root@eltoro]# trace-cmd stop
OS-level System Interrupt Hunting Tools
/proc/interrupts
We can’t talk about system interrupts without discussing the procfs file which has “interrupts” in its name: /proc/interrupts. This file contains columns of CPU#s and rows of interrupt sources with interrupt counts at the intersection. Below is a snapshot taken from a CentOS 7 machine on which I debugged an interrupt issue for which the root cause was IPIs emanating from TLB Shootdowns for “Auto NUMA Balancing”:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
45: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth0
46: 192342 0 0 0 0 0 0 0 PCI-MSI-edge ahci
47: 14 0 0 0 0 0 0 0 PCI-MSI-edge mei
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 552219 1010298 2272333 3179890 1445484 1226202 1800191 1894825 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts
RTR: 7 0 0 0 0 0 0 0 APIC ICR read retries
RES: 18708 9550 771 528 129 170 151 139 Rescheduling interrupts
CAL: 711 934 1312 1261 1446 1411 1433 1432 Function call interrupts
TLB: 4493 6108 73789 5014 1788 2327 1967 914 TLB shootdowns
This is often the first place I go when an affinitized thread is exhibiting anomalous latency behavior. You can even use the watch command to get a feel for incoming rates from interrupts of interest:
[root@eltoro]# watch -d -n2 "egrep 'TLB|CAL' /proc/interrupts" #Show delta of interrupt counts from TLB Shootdowns and Function Call Interrupts every 2 secs
oslat
Part of the Real-time Tests repo, the oslat tool used to be my favorite utility in the interrupt hunting game (you’ll see later why I wrote “used to be”). It detects OS-level thread latency introduced by system scheduling or other interruptions. Here’s an example output from a system with cores 4 thru 7 isolated and timer tick reduced (i.e., isolcpus=4-7 nohz_full=4-7 boot options) while stress-ng –cpu 8 runs in the background:
[root@eltoro rt-tests]# ./oslat -c 0-7 -D 30
oslat V 2.40
Total runtime: 30 seconds
Thread priority: default
CPU list: 0-7
CPU for main thread: 0
Workload: no
Workload mem: 0 (KiB)
Preheat cores: 8
Pre-heat for 1 seconds...
Test starts...
Test completed.
Core: 0 1 2 3 4 5 6 7
Counter Freq: 3096 3096 3096 3096 3096 3096 3095 3095 (Mhz)
001 (us): 297462870 707961371 681256247 914343488 1496193369 1491253718 1496120812 1490616245
002 (us): 4852 12319 11245 2069 4 5 4 9
003 (us): 400 837 851 428 1 1 1 3
004 (us): 178 82 34 29 0 1 1 0
005 (us): 6 42 1 19 1 1 1 1
006 (us): 1 28 0 8 0 0 0 0
007 (us): 0 22 0 1 1 0 0 1
008 (us): 0 3 0 0 0 1 1 0
009 (us): 2 1 0 0 0 0 0 1
010 (us): 1 0 0 0 0 0 0 0
011 (us): 0 0 0 0 0 0 0 0
012 (us): 0 0 0 0 0 0 0 0
013 (us): 0 0 0 0 0 0 0 0
014 (us): 0 0 0 0 0 0 0 0
015 (us): 1 0 0 0 0 0 0 0
016 (us): 0 0 0 0 0 0 0 0
017 (us): 0 0 0 0 0 0 0 0
018 (us): 2 0 0 0 0 0 0 0
019 (us): 9 0 0 0 0 0 0 0
020 (us): 7 0 0 0 0 0 0 0
021 (us): 2 0 0 0 0 0 0 0
022 (us): 0 0 0 0 0 0 0 0
023 (us): 0 0 0 0 0 0 0 0
024 (us): 0 0 0 0 0 0 0 0
025 (us): 0 0 0 0 0 0 0 0
026 (us): 0 0 0 0 0 0 0 0
027 (us): 0 0 0 0 0 0 0 0
028 (us): 0 0 0 0 0 0 0 0
029 (us): 0 0 0 0 0 0 0 0
030 (us): 0 0 0 0 0 0 0 0
031 (us): 0 0 0 0 0 0 0 0
032 (us): 525 865 1089 869 0 0 0 0 (including overflows)
Minimum: 1 1 1 1 1 1 1 1 (us)
Average: 1.081 1.034 1.035 1.026 1.016 1.016 1.016 1.016 (us)
Maximum: 142851 81917 28973 23977 7 8 8 9 (us)
Max-Min: 142850 81916 28972 23976 6 7 7 8 (us)
Duration: 30.049 29.978 29.962 29.966 29.961 29.961 29.971 29.971 (sec)
It’s clear here that cores 0 thru 3 exhibit higher MAXes and longer tails than cores 4 thru 7. That alone is pretty cool, but it still leaves the search for root cause(s) as an exercise for the user.
That all changed when the maintainers added the “-T” option, which works along with FTrace! So, let’s say you have FTrace running in the background. This “-T” option allows for setting a latency threshold above which oslat will make an entry in the FTrace recording buffer using the trace marker facility, after which it will stop the FTrace. Cool, huh? This way, you can just locate the oslat marker in the FTrace file and note the surrounding entries to deduce the offending interrupt!
Not bad, huh? But it still requires coordinating across two separate utilities (oslat and FTrace). Also, it only deals with OS-level interrupts. Wouldn’t it be cool if it included this source-tracking capability for the lower level interrupts, too?
The Infinity Gauntlet of System Interrupts Hunting Tools
osnoise
In the previous section, I noted that oslat “used to be” a favorite of mine for system interrupt tracking. And now I’ll tell you why: the introduction of the new osnoise tracer in FTrace!
Not only does it include the OS-level interrupt tracking capability of oslat, but it can also trace HW/firmware level noise. Even more, since it’s included in the kernel as both a tracer *and* workload generator itself (oslat is only the latter), it tracks root causes like the “-T” option of oslat without requiring the juggling of multiple tools. Below is a snapshot of the basic functionality of osnoise:
[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE %% OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-4731 [000] .... 3625.723652: 1000000 71 99.99290 19 0 0 1001 18 5
<...>-4732 [001] .... 3625.723672: 1000000 8 99.99920 8 0 0 1000 83 61
<...>-4734 [003] .... 3625.723732: 1000000 0 100.00000 0 0 1 1000 14 0
<...>-4735 [004] .... 3625.723760: 1000000 0 100.00000 0 0 0 4 0 0
<...>-4736 [005] .... 3625.723790: 1000000 0 100.00000 0 0 0 4 0 0
<...>-4737 [006] .... 3625.723814: 1000000 0 100.00000 0 0 0 4 0 0
<...>-4738 [007] .... 3625.723836: 1000000 0 100.00000 0 0 0 4 0 0
Not only does it report on total noise duration and single max noise readings, but there’s an entire section to the right dedicated to counting the types of interrupts observed – HW, NMI (Non-Maskable Interrupts), IRQ, SIRQ (Soft IRQ), and THREAD. It truly encompasses all possible levels of system noise.
Now let’s see if osnoise will report the root cause for interrupts on cores 0 thru 3. Below, I set the tracer to “osnoise” and also enable all “osnoise” events for root cause tracking. Lastly, I tell “osnoise” to stop tracing once a latency threshold of 10μs is breached:
[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# echo osnoise > set_event
[root@eltoro tracing]# echo 10 > osnoise/stop_tracing_us
[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE %% OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
osnoise/0-5024 [000] d.h. 5328.056759: irq_noise: local_timer:236 start 5328.056757596 duration 1360 ns
osnoise/1-5025 [001] d.h. 5328.056775: irq_noise: local_timer:236 start 5328.056773292 duration 1179 ns
<...>-5026 [002] d.h. 5328.056790: irq_noise: local_timer:236 start 5328.056788930 duration 894 ns
osnoise/0-5024 [000] ..s. 5328.057760: softirq_noise: RCU:9 start 5328.057759392 duration 199 ns
osnoise/1-5025 [001] d.h. 5328.057774: irq_noise: local_timer:236 start 5328.057773277 duration 902 ns
osnoise/1-5025 [001] ..s. 5328.057774: softirq_noise: RCU:9 start 5328.057774329 duration 83 ns
<...>-5026 [002] d.h. 5328.057790: irq_noise: local_timer:236 start 5328.057788856 duration 622 ns
<...>-5026 [002] ..s. 5328.057790: softirq_noise: RCU:9 start 5328.057789621 duration 62 ns
<...>-5027 [003] d.h. 5328.057805: irq_noise: local_timer:236 start 5328.057804436 duration 607 ns
<...>-5027 [003] ..s. 5328.057805: softirq_noise: RCU:9 start 5328.057805192 duration 64 ns
. . .
. . .
. . .
osnoise/0-5024 [000] ..s. 5328.238757: softirq_noise: TIMER:1 start 5328.238756704 duration 723 ns
osnoise/1-5025 [001] d.h. 5328.238772: irq_noise: local_timer:236 start 5328.238771210 duration 790 ns
osnoise/0-5024 [000] dNh. 5328.239757: irq_noise: local_timer:236 start 5328.239755524 duration 1064 ns
kworker/0:2-241 [000] d... 5328.239766: thread_noise: kworker/0:2:241 start 5328.239757025 duration 9334 ns
osnoise/0-5024 [000] .... 5328.239767: sample_threshold: start 5328.239755276 duration 11322 ns interference 2
The trace automatically ends at the final line where we breach the 10μs threshold on core 0. It measures the total interrupt duration as 11,322ns, and points to the two prior lines as root causes (interference 2). Those interruptions include “thread_noise” from kworker/0:2 (9,334ns) and “irq_noise” from its local_timer (1,064ns).
Now that’s what I call a “System Interrupt Hunting Tool”! I demonstrate its usage in a real-world performance debugging scenario in another article. If you wanna learn more about osnoise and all its bells-and-whistles, check out this talk from Daniel Bristot de Oliveira:
Let Taylor Swift Finish!
You scoped out this hardware and tuned the OS specifically for your IT department’s hand-crafted and thoroughly tested production application. Once your app gets scheduled on a core, that’s its moment. Its time to shine. Don’t let system interrupts storm the stage, snatch the core away for heaven-only-knows-how-long, and trample over your app’s day in the “CPU-time-slice” sun. Track down those lurking digital Kanyes and cut ’em off at the pass. Your SRE Team and, more importantly, your paying customers will thank you for it.
- 1Paid affiliate link