Thursday, March 24, 2022

Racing against the clock -- hitting a tiny kernel race window

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):