Estimated reading time: 11 minutes
“Who hurt you?“
That’s the recurring question every confirmed bachelor, Lothario, and playboy expects to hear around this Valentine’s Day Season. That time of year when couples exchange gifts of love, share romantic dinners, or whisk away on a mini-trip. Meanwhile, free-wheeling casual daters skillfully avoid the festivities with their multiple paramours with well-rehearsed scripts criticizing the holiday’s emptiness and commercial nature. Once that Celebration of Exclusivity passes, they come back out of hiding, ready to mingle! But why is that? Could it just be that some of us enjoy the excitement of sparking a romance with new and interesting people? Or might it stem from a fear of commitment due to some past heartbreak? “Who hurt you?” assumes the latter.
You and I have formed a pretty solid bond over these past couple years, so I’ll come clean with you: I fear commitment. . . to the 1st CPU core. And not just core 0, either, which we all know usually catches the brunt of interrupts and typically is singled-out as a house-keeping core. I’m talkin’ about the 1st core on *any* CPU socket. And you’re right – it’s not due to some free-wheeling excitement gained from the willy-nilly scheduling of threads across cores. It’s because the 1st CPU core once hurt me deeply, and I haven’t looked at it the same way since. Here’s my story.
Table of contents
The 1st CPU Core and How We Met
My meet-cute with the 1st CPU core began around 2014. RHEL 7 had just been released with very few in the HFT space considering a migration from RHEL 6.8 and its 2.6.32 kernel until they could suss out all latency gotchas in RHEL 7’s 3.10.0 kernel. It was under this backdrop that I worked with a startup Options Trading Group of a Chicago-based HFT firm.
I deployed the usual configurations you’d expect for low latency, many of which I’ve alluded to in past articles. Most important was core isolation (isolcpus) on the 2nd CPU (odd-numbered cores), home to the NIC’s PCIe lanes. Free from the vagaries of the Linux kernel scheduler, these cores served as low-jitter locations for latency-critical application threads. One of the more important threads we pinned to core 1, the 1st core on that 2nd socket. And for a few weeks, it was a fairy-tale affair. High IPC, low jitter, low latency – the pairing seemed destined! But nothing lasts forever, does it?
Cracks Begin To Show
“Mark, we’re suddenly experiencing sporadic latency spikes of over 1ms on core 1!”
Weeks had gone by without a hitch on the Options Trading Group’s system. So how could this be? Now, for those of you who have never serviced customers in IT, I’ll let you in on a little secret – they *never* tell you the whole story. It’s always “No, we didn’t change anything. We were just moseying along and then BOOM!” But hey, it is what it is. So, I fired up the FTrace facility and began tracking scheduler events on that core:
[root@eltoro]# echo sched_switch > /sys/kernel/debug/tracing/current_tracer
[root@eltoro]# echo latency-format > /sys/kernel/debug/tracing/trace_options
Here’s some of the tracing output I found after enabling those FTrace options:
TradeApp-31594 1d.s.. 70251098us!: 31594:115:R + [001] 6:120:R ksoftirqd
TradeApp-31594 1d.... 70252191us!: 31594:115:R ==> [001] 6:120:R ksoftirqd
ksoftirqd-6 1d.... 70252599us!: 6:120:S ==> [001] 31594:115:R TradeApp
<<< above represents ~400 microsecond scheduling latency >>>
TradeApp-31594 1d.s.. 85246192us!: 31594:115:R + [001] 6:120:R ksoftirqd
TradeApp-31594 1d.... 85246652us!: 31594:115:R ==> [001] 6:120:R ksoftirqd
ksoftirqd-6 1d.... 85262686us!: 6:120:R ==> [001] 31594:115:R TradeApp
<<< above represents a whopping 16 millisecond scheduling latency >>>
Ksoftirqd was taking as much as 16 *milliseconds* of cpu time from the trading application thread! This kernel thread jumps into action in process-context when the interrupt-context softirq is overwhelmed with work and, therefore, needs to offload it to a schedule-able entity. But what type of work overloaded the softirq so much that it had to enlist the help of ksoftirqd?
For this, I enabled the function_graph tracer in FTrace to track the flow of kernel functions called during ksoftirqd‘s runtime. What did I find? A bunch of block request queue completion work (i.e., the cleaning of block layer structures created during the process of handling I/O) from the megasas disk I/O controller.
But why? No one pinned the megasas driver IRQ to core 1. So why was all its associated background processing running there?
The 1st CPU Core and its Wandering Eye
Every now and then, you might miss a configuration step or two, even while using configuration management tools like Puppet or Ansible. So I verified my IRQ-assignment assumptions by checking /proc/interrupts, which confirmed my understanding. However, I *did* find a torrent of “Function Call Interrupts” (or CAL in /proc/interrupts parlance) hitting our beloved core 1:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 47 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 25 0 0 0 0 0 0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 528 0 0 287 0 0 0 0 IO-APIC-fasteoi acpi
12: 207 0 0 0 0 0 0 0 IO-APIC-edge i8042
16: 493 0 187 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, nouveau
17: 29465 0 0 0 0 0 0 0 IO-APIC-fasteoi ath9k, snd_hda_intel
19: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi xhci_hcd:usb3
23: 9352 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2
45: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth0
46: 192342 0 0 0 0 0 0 0 PCI-MSI-edge ahci
47: 14 0 0 0 0 0 0 0 PCI-MSI-edge mei
48: 294 0 0 0 0 0 0 0 PCI-MSI-edge snd_hda_intel
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 552219 1010298 2272333 3179890 1445484 1226202 1800191 1894825 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts
RTR: 7 0 0 0 0 0 0 0 APIC ICR read retries
RES: 18708 9550 771 528 129 170 151 139 Rescheduling interrupts
CAL: 711 9153934 1312 1261 1446 1411 1433 1432 Function call interrupts
TLB: 4493 6108 4981 5014 1788 2327 1967 914 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
That’s 3 orders of magnitude more CALs on core 1 than any other core! Further debugging and source code reading (explained in the next section) revealed the culprit as the /sys/block/sda/queue/rq_affinity setting. From the kernel docs:
“If this option is ‘1’, the block layer will migrate request completions to the
cpu “group” that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.”
And of course, it was at this point that I noticed another core on that same CPU socket performing a heavy stream of writes to the disk attached to the megasas driver. “Oh yeah, Mark, we pinned a new thread to core 3 on that socket to perform our data writes. Did we forget to mention that?” *sigh*
Just to make sure you all are following me, here’s a recap: Core 1 and I agreed that it would dedicate itself to our trading workload, but then weeks later some other I/O workload came struttin’ by and stole its attention away from me. Really? Just like that, huh? What ever happened to trust? To fidelity? To ’til decommission do us part’?!
This I/O-heavy workload on core 3 adversely impacted latency on core 1 due to rq_affinity interrupting it via CALs to perform block request completion cleanup (as we’ll learn in the next section). But why was it always core 1? Why not share it with other cores within that same LLC-sharing “group”? To answer that, we gotta read the source.
How rq_affinity *Used* to Work
The workhorse for rq_affinity is the __blk_complete_request() function shown below from a 2.6.32 kernel release:
void __blk_complete_request(struct request *req)
{
struct request_queue *q = req->q;
unsigned long flags;
int ccpu, cpu, group_cpu;
BUG_ON(!q->softirq_done_fn);
local_irq_save(flags);
cpu = smp_processor_id();
group_cpu = blk_cpu_to_group(cpu);
/*
* Select completion CPU
*/
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
ccpu = req->cpu;
else
ccpu = cpu;
if (ccpu == cpu || ccpu == group_cpu) {
struct list_head *list;
do_local:
list = &__get_cpu_var(blk_cpu_done);
list_add_tail(&req->csd.list, list);
/*
* if the list only contains our just added request,
* signal a raise of the softirq. If there are already
* entries there, someone already raised the irq but it
* hasn't run yet.
*/
if (list->next == &req->csd.list)
raise_softirq_irqoff(BLOCK_SOFTIRQ);
} else if (raise_blk_irq(ccpu, req))
goto do_local;
local_irq_restore(flags);
Here, it finds the core that is currently servicing the IRQ (cpu variable), compares it to the core identified in the block I/O request struct as the originator of the I/O (req->cpu variable), and checks if rq_affinity is set (i.e., QUEUE_FLAG_SAME_COMP flag). If set, and both cores are not one and the same OR do not belong to the same LLC-sharing group, it will interrupt the originating core to process the block request completion using the raise_blk_irq() function:
/*
* Setup and invoke a run of 'trigger_softirq' on the given cpu.
*/
static int raise_blk_irq(int cpu, struct request *rq)
{
if (cpu_online(cpu)) {
struct call_single_data *data = &rq->csd;
data->func = trigger_softirq;
data->info = rq;
data->flags = 0;
__smp_call_function_single(cpu, data, 0);
return 0;
}
return 1;
}
Aha, so now we know the source of all those CALs we saw in /proc/interrupts – the __smp_call_function_single() function!
Finally, how is that req->cpu variable set up in the first place? Well, let’s take a look at a snippet of the function that starts it all, __make_request():
static int __make_request(struct request_queue *q, struct bio *bio)
{
struct request *req;
int el_ret;
unsigned int bytes = bio->bi_size;
const unsigned short prio = bio_prio(bio);
const int sync = bio_sync(bio);
const int unplug = bio_unplug(bio);
int rw_flags;spin_lock_irq(q->queue_lock);
. . .
. . .
. . .
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
bio_flagged(bio, BIO_CPU_AFFINE))
req->cpu = blk_cpu_to_group(smp_processor_id());
. . .
. . .
. . .
Ok, so this all seems reasonable now, right? We had cores 1 and 3 in the same “group”, with rq_affinity enabled, and the megasas IRQ pinned to a core which was NOT part of that group. Case closed, right? Wrong! We still don’t know why core 1 is *always* the victim among all the cores on that second CPU socket.
Let’s take a look at the blk_cpu_to_group() function referenced in the code snippets above for more insight:
static inline int blk_cpu_to_group(int cpu)
{
#ifdef CONFIG_SCHED_MC
const struct cpumask *mask = cpu_coregroup_mask(cpu);
return cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
return cpumask_first(topology_thread_cpumask(cpu));
#else
return cpu;
#endif
}
Uh, what’s with that “_first” suffix to the cpumask_first() function? Makes me nervous. Hmm, maybe I’m being paranoid. Let’s check out its definition:
/**
* cpumask_first - get the first cpu in a cpumask
* @srcp: the cpumask pointer
*
* Returns >= nr_cpu_ids if no cpus set.
*/
static inline unsigned int cpumask_first(const struct cpumask *srcp)
{
return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
}
NOOOO!!! It *always* picks the first core in the group, thus causing core 1’s milliseconds-long hiccups! You broke my heart, core 1. You broke my heart. I’ll never trust the 1st core on any CPU socket with a low-latency assignment ever again, especially not when there’s a cpumask_first() function defined in the kernel just waiting for someone to use in the future!
The 1st CPU Core Attempts To Rekindle Our Flame
Nowadays, rq_affinity offers an extra option which makes the core that is generating all the I/O clean up after itself:
“For storage configurations that need to maximize distribution of completion
processing setting this option to ‘2’ forces the completion to run on the
requesting cpu (bypassing the “group” aggregation logic).“
It also removes any consideration about the “first core in the group” as depicted in this snippet from the same __blk_complete_request() function from a more recent kernel release:
/*
* Select completion CPU
*/
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
ccpu = cpu;
/*
* If current CPU and requested CPU share a cache, run the softirq on
* the current CPU. One might concern this is just like
* QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
* running in interrupt handler, and currently I/O controller doesn't
* support multiple interrupts, so current CPU is unique actually. This
* avoids IPI sending from current CPU to the first CPU of a group.
*/
if (ccpu == cpu || shared) {
struct list_head *list;
But how can I be sure? When you go through a betrayal of this magnitude, you’re not eager to jump right back into that relationship. Let’s set up a test for this new setting on Rocky 8.6 (kernel release 4.18.0-372.9.1) using the fio microbenchmark and the osnoise latency tracker.
Testing the Updated rq_affinity = 1
For this test, we have isolated the cores of the 2nd CPU socket (all odd-numbered cores). The target disk IRQ is assigned to core 7, our fio microbenchmark is pinned to core 3, and rq_affinity is set to ‘1’ (the default).
In a separate window, I’ve set up the osnoise tracer to track any disturbance lasting for longer than a millisecond against any core on the 2nd CPU socket *except* core 3, where our fio workload will run:
[root@eltoro ~]# cd /sys/kernel/tracing/
[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# echo osnoise > set_event
[root@eltoro tracing]# echo "1,5,7,9,11,13,15" > osnoise/cpus
[root@eltoro tracing]# echo 1000 > osnoise/stop_tracing_us
In a separate window, I fire up fio to perform random writes for 300 seconds against a 5GB file using one thread pinned to core 3:
[root@eltoro ~]# fio -name=blogpost -numjobs=1 -cpus_allowed=3 -filename=/corefiles/mytest.file -direct=0 -rw=randwrite -bs=4k -size=5G -iodepth=64 -ioengine=libaio -time_based=1 -runtime=300
Once I hit ‘ENTER’ on the fio command, I start the trace in the other window like so:
[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace_pipe
After a few seconds, the trace automatically stops and displays this output as its final few lines:
. . .
. . .
. . .
osnoise/7-13353 [007] d.h. 37384.833988: irq_noise: ahci[0000:87:00.0]:47 start 37384.833985696 duration 2390 ns
osnoise/7-13353 [007] ..s. 37384.840123: softirq_noise: BLOCK:4 start 37384.833988294 duration 6134479 ns
osnoise/7-13353 [007] .... 37384.840124: sample_threshold: start 37384.833985490 duration 6138539 ns interference 2
[root@eltoro tracing]#
I keep reenabling the trace during the fio microbenchmark but the trace keeps stopping at lines showing disturbances only for core 7 from softirqs executing block-related code. This is great because it means that it’s not bugging core 1 like in older kernel releases! We get further proof from the preponderance of reported “BLOCK” softirqs, 256,724 in all (0 on the other cores), going to core 7 in /proc/softirqs:
[root@eltoro ~]# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
HI: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TIMER: 1311064 4732 4732 4733 4732 4731 4731 4732 4731 4731 4731 4731 4731 4731 4731 4731
NET_TX: 14369 1 2 1 2 0 1 0 1 0 0 0 2 0 1 0
NET_RX: 547910 525 151351 2080 3 0 3 0 3 2 4 1 4 1 3 0
BLOCK: 106 0 0 0 0 0 0 256724 0 0 0 0 0 0 0 0
IRQ_POLL: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TASKLET: 33514 0 0 0 0 0 0 0 19 0 64 0 0 0 0 0
SCHED: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HRTIMER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RCU: 999445 4754 4733 5599 4734 4765 4732 5147 4732 4823 4732 4767 4732 4816 4732 4751
Now let’s see what happens after a clean reboot with rq_affinity set to ‘2’.
Testing the Updated rq_affinity = 2
Everything is configured the exact same way as in the previous section except rq_affinity is now set to ‘2’. I fire up fio, then I commence tracing and begin reading from the pipe:
[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace_pipe
osnoise/5-2065 [005] d.h. 475.145177: irq_noise: local_timer:236 start 475.145176545 duration 584 ns
osnoise/5-2065 [005] ..s. 475.145181: softirq_noise: TIMER:1 start 475.145177244 duration 3489 ns
osnoise/5-2065 [005] d.H. 475.145183: irq_noise: irq_work:246 start 475.145182922 duration 425 ns
osnoise/5-2065 [005] .... 475.145184: sample_threshold: start 475.145176397 duration 8003 ns interference 3
osnoise/7-2066 [007] d.h. 475.156856: irq_noise: ahci[0000:87:00.0]:73 start 475.156846144 duration 9333 ns
osnoise/7-2066 [007] d.h. 475.156858: irq_noise: irq_work:246 start 475.156856920 duration 1280 ns
osnoise/7-2066 [007] .... 475.156859: sample_threshold: start 475.156845776 duration 12710 ns interference 2
osnoise/7-2066 [007] d.h. 475.160222: irq_noise: ahci[0000:87:00.0]:73 start 475.160219278 duration 2770 ns
osnoise/7-2066 [007] d.h. 475.160225: irq_noise: irq_work:246 start 475.160223535 duration 1506 ns
osnoise/7-2066 [007] .... 475.160225: sample_threshold: start 475.160218911 duration 6392 ns interference 2
osnoise/15-2070 [015] dNh. 475.184177: irq_noise: reschedule:253 start 475.184176989 duration 170 ns
osnoise/15-2070 [015] dNh. 475.184180: irq_noise: irq_work:246 start 475.184178651 duration 1389 ns
. . .
. . .
. . .
It just goes on and on until I explicitly disable tracing. No softirq hiccups to be had on cores 1, 5, 7, 9, 11, 13, or 15. So, then, where are they going? They shifted to the core responsible for all the disk I/O (fio), core 3 (114,729 in total), according to /proc/softirqs:
[root@eltoro ~]# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
HI: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TIMER: 110220 88 88 89 88 88 88 89 88 88 87 87 87 87 87 87
NET_TX: 658 1 2 1 0 0 1 0 1 0 2 0 2 0 1 0
NET_RX: 47970 86 2752 88 2 0 1 0 2 2 3 1 2 1 1 0
BLOCK: 6387 0 0 114729 0 0 0 4348 0 0 0 0 0 0 0 0
IRQ_POLL: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TASKLET: 3140 0 0 0 0 0 0 0 19 0 64 0 0 0 0 0
SCHED: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HRTIMER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RCU: 51021 98 90 707 89 118 89 214 89 132 88 118 88 119 88 119
It looks like the updated rq_affinity code works as advertised. So. . . does this mean the 1st CPU core can be trusted again?
What We Had Was Beautiful But. . .
You guys wanted to know why I can’t commit to the 1st CPU core. You wanted my answer to, “Who hurt you?” Well, now you know the history. True, the 1st CPU core has come a long way toward rebuilding the foundation of trust we once had. But for as long as the kernel defines a cpumask_first() function, I will always worry that some other Linux feature might employ it to force unrelated, latency-spiking workloads upon that core. For this reason, my fear of commitment to the 1st CPU core remains. And I will continue admonishing my clients to avoid the 1st CPU core for any latency-sensitive application threads.
Will I continue maintaining my wildly varied “confirmed bachelor” lifestyle, too? Well, that’s a topic for another day on another blog. Happy Valentine’s Day!