Monday, June 20, 2016

Exploiting Recursion in the Linux Kernel

Posted by Jann Horn, Google Project Zero

On June 1st, I reported an arbitrary recursion bug in the Linux kernel that can be triggered by a local user on Ubuntu if the system was installed with home directory encryption support. If you want to see the crasher, the exploit code and the shorter bug report, go to https://bugs.chromium.org/p/project-zero/issues/detail?id=836.

Prerequisites

On Linux, userland processes typically have a stack that is around 8 MB long. If a program overflows the stack, e.g. using infinite recursion, this is normally caught by a guard page below the stack.

Linux kernel stacks, which are e.g. used when handling system calls, are very different. They are relatively short: 4096 bytes on 32-bit x86, 16384 bytes on x86-64. (The kernel stack size is specified by THREAD_SIZE_ORDER and THREAD_SIZE.) They are allocated using the kernel's buddy allocator, which is the kernel's normal allocator for page-sized allocations (and power-of-two numbers of pages) and doesn't create guard pages. This means that if kernel stacks overflow, they overlap with normal data. For this reason, kernel code must be (and usually is) very careful to not make big allocations on the stack and has to prevent excessive recursion.

Most filesystems on Linux either don't use an underlying device (the pseudo-filesystems, like sysfs, procfs, tmpfs and so on) or use a block device (typically a partition on a harddisk) as the backing device. However, two filesystems, ecryptfs and overlayfs, are exceptions: They are stacking filesystems, filesystems that use a folder in another filesystem (or, in the case of overlayfs, multiple other folders, often from different filesystems) as the backing device. (The filesystem that serves as the backing device is called lower filesystem, the files in the lower filesystem are called lower files.) The idea with a stacking filesystem is that it more or less forwards accesses to the lower filesystem, but performs some modification on the passed-through data. overlayfs merges multiple filesystems into a common view, ecryptfs performs transparent encryption.

A potential danger with stacking filesystems is that because their virtual filesystem handlers often call into the handlers of the underlying filesystem, the stacked filesystem increases the stack usage compared to a direct access to the underlying filesystem. If it was possible to use the stacked filesystem as the backing device for another stacked filesystem and so on, at some point, the kernel stack would overflow because every layer of filesystem stacking increases the kernel stack depth. However, this is averted by placing a limit (FILESYSTEM_MAX_STACK_DEPTH) on the number of filesystem nesting layers - only up to two layers of nested filesystems can be placed on top of a non-stacking filesystem.

The procfs pseudo-filesystem contains one directory per running process on the system, and each directory contains various files describing the process. Of interest here are the per-process "mem", "environ" and "cmdline" files because reading from them causes synchronous accesses to the virtual memory of the target process. The files expose different virtual address ranges:

  • "mem" exposes the whole virtual address range (and requires PTRACE_MODE_ATTACH access)
  • "environ" exposes the memory range from mm->env_start to mm->env_end (and requires PTRACE_MODE_READ access)
  • "cmdline" exposes the memory range from mm->arg_start to mm->arg_end if the last byte before mm->arg_end is a nullbyte; otherwise, it gets a bit more complicated

If it was possible to mmap() the "mem" file (which wouldn't make much sense; don't think too hard about it), you could e.g. set up mappings like this:

mem mmap correct.png

Then, assuming that the /proc/$pid/mem mappings would have to be faulted in, a reading pagefault on the mapping in process C would cause pages to be faulted in from process B, which would cause another pagefault in process B, which in turn would cause pages to be faulted in from the memory of process A - a recursive pagefault.

However, this doesn't actually work - the mem, environ and cmdline files only have VFS handlers for normal reads and writes, not for mmap:

static const struct file_operations proc_pid_cmdline_ops = {
.read   = proc_pid_cmdline_read,
.llseek = generic_file_llseek,
};
[...]
static const struct file_operations proc_mem_operations = {
.llseek  = mem_lseek,
.read    = mem_read,
.write   = mem_write,
.open    = mem_open,
.release = mem_release,
};
[...]
static const struct file_operations proc_environ_operations = {
.open    = environ_open,
.read    = environ_read,
.llseek  = generic_file_llseek,
.release = mem_release,
};

An interesting detail about ecryptfs is that it supports mmap(). Because the memory mappings seen by the user have to be decrypted while memory mappings from the underlying filesystem would be encrypted, ecryptfs can't just forward mmap() operations to the underlying filesystem's mmap() handler. Instead, ecryptfs needs to use its own page cache for mappings.

When ecryptfs handles a page fault, it has to somehow read the encrypted page from the underlying filesystem. It could do this by reading through the lower filesystem's page cache (using the lower filesystem's mmap handler), but that would be a waste of memory. Instead, ecryptfs simply uses the lower filesystem's VFS read handler (via kernel_read()). This is more efficient and straightforward, but also has the side effect that it is possible to mmap() decrypted views of files that normally wouldn't be mappable (because the mmap handler of an ecryptfs file works as long as the lower file has a read handler and contains valid encrypted data).

The vulnerability

At this point, we can piece together the attack. We start by creating a process A with PID $A. Then, we create an ecryptfs mount /tmp/$A with /proc/$A as lower filesystem. (ecryptfs should be used with only one key so that filename encryption is disabled.) Now, /tmp/$A/mem, /tmp/$A/environ and /tmp/$A/cmdline are mappable if the corresponding files in /proc/$A/ contain valid ecryptfs headers at the start of the file. Unless we already have root privileges, we can't map anything to address 0x0 in process A, which corresponds to offset 0 in /proc/$A/mem, so a read from the start of /proc/$A/mem will always return -EIO and /proc/$A/mem never contains a valid ecryptfs header. Therefore, in a potential attack, only the environ and cmdline files are interesting.

On kernels that are compiled with CONFIG_CHECKPOINT_RESTORE (which is the case at least for Ubuntu's distro kernels), the arg_start, arg_end, env_start and env_end properties of the mm_struct can easily be set by an unprivileged process using prctl(PR_SET_MM, PR_SET_MM_MAP, &mm_map, sizeof(mm_map), 0. This permits pointing /proc/$A/environ and /proc/$A/cmdline at arbitrary virtual memory ranges. (On kernels without checkpoint-restore support, the attack is slightly more annoying to execute, but still possible by re-executing with the desired argument area and environment area lengths and then replacing part of the stack memory mapping.)

If a valid encrypted ecryptfs file is loaded into the memory of process A and its environment area is then configured to point to that area, the decrypted view of the data in the environment area becomes accessible at /tmp/$A/environ. This file can then be mapped into the memory of another process, process B. In order to be able to repeat the process, some data has to be encrypted with ecryptfs repeatedly, creating an ecryptfs matroska that can be loaded into the memory of process A. Now, a chain of processes that map each other's decrypted environment area can be set up:

ecryptfs-mem chain relayouted.png

If no data is present in the relevant ranges of the memory mappings in process C and process B, a pagefault in the memory of process C (either caused by a pagefault from userspace or caused by a userspace access in the kernel, e.g. via copy_from_user()) causes ecryptfs to read from /proc/$B/environ, causing a pagefault in process B, causing a read from /proc/$A/environ via ecryptfs, causing a pagefault in process A. This can be repeated arbitrarily, causing a stack overflow that crashes the kernel. The stack looks like this:

[...]
[<ffffffff811bfb5b>] handle_mm_fault+0xf8b/0x1820
[<ffffffff811bac05>] __get_user_pages+0x135/0x620
[<ffffffff811bb4f2>] get_user_pages+0x52/0x60
[<ffffffff811bba06>] __access_remote_vm+0xe6/0x2d0
[<ffffffff811e084c>] ? alloc_pages_current+0x8c/0x110
[<ffffffff811c1ebf>] access_remote_vm+0x1f/0x30
[<ffffffff8127a892>] environ_read+0x122/0x1a0
[<ffffffff8133ca80>] ? security_file_permission+0xa0/0xc0
[<ffffffff8120c1a8>] __vfs_read+0x18/0x40
[<ffffffff8120c776>] vfs_read+0x86/0x130
[<ffffffff812126b0>] kernel_read+0x50/0x80
[<ffffffff81304d53>] ecryptfs_read_lower+0x23/0x30
[<ffffffff81305df2>] ecryptfs_decrypt_page+0x82/0x130
[<ffffffff813040fd>] ecryptfs_readpage+0xcd/0x110
[<ffffffff8118f99b>] filemap_fault+0x23b/0x3f0
[<ffffffff811bc120>] __do_fault+0x50/0xe0
[<ffffffff811bfb5b>] handle_mm_fault+0xf8b/0x1820
[<ffffffff811bac05>] __get_user_pages+0x135/0x620
[<ffffffff811bb4f2>] get_user_pages+0x52/0x60
[<ffffffff811bba06>] __access_remote_vm+0xe6/0x2d0
[<ffffffff811e084c>] ? alloc_pages_current+0x8c/0x110
[<ffffffff811c1ebf>] access_remote_vm+0x1f/0x30
[<ffffffff8127a892>] environ_read+0x122/0x1a0
[...]

Regarding reachability of the bug: Exploiting the bug requires the ability to mount ecryptfs filesystems with /proc/$pid as source. When the ecryptfs-utils package is installed (Ubuntu installs it if you enable home directory encryption during the setup), this can be done using the /sbin/mount.ecryptfs_private setuid helper. (It verifies that the user owns the source directory, but that's not a problem because the user "owns" the procfs directories of his own processes.)

Exploiting the bug

The following explanation is sometimes architecture-specific; when that is the case, it refers to amd64.

It used to be pretty simple to exploit bugs like this: As described in Jon Oberheide's "The Stack is Back" slides, it used to be possible to overflow into the thread_info structure at the bottom of the stack, overwrite either the restart_block or the addr_limit in there with appropriate values and then, depending on which one is being attacked, either execute code from an executable userspace mapping or use copy_from_user() and copy_to_user() to read and write kernel data.

However, restart_block was moved out of the thread_info struct, and because the stack overflow is triggered with a stack that contains kernel_read() frames, the addr_limit is already KERNEL_DS and will be reset to USER_DS on the way back out. Additionally, at least Ubuntu Xenial's distro kernel turns on the CONFIG_SCHED_STACK_END_CHECK kernel config option, which causes a canary directly above the thread_info struct to be checked whenever the scheduler is invoked; if the canary is incorrect, the kernel recursively oopses and then panics (fixed to directly panic in 29d6455178a0).

Since it is hard to find anything worth targeting in the thread_info struct (and because it would be nice to show that moving stuff out of thread_info is not a sufficient mitigation), I chose a different strategy: Overflow the stack into an allocation in front of the stack, then exploit the fact that the stack and the other allocation overlap. The problem with this approach is that the canary and some components of the thread_info struct must not be overwritten. The layout looks like this (green is fine to clobber, red must not be clobbered, yellow might be unpleasant if clobbered depending on the value):

kernel_stack_bottom.png

Luckily, there are stack frames that contain holes - if the bottom of the recursion uses cmdline instead of environ, there is a five-QWORD hole that isn't touched on the way down the recursion, which is sufficient to cover everything from the STACK_END_MAGIC to the flags. This can be seen when using a safe recursion level together with a debugging helper kernel module that sprays the stack with markers to make holes (green) in the stack visible:


[...]
0xffff88015d115030: 0x0000000000000020 0xffff880077494640
0xffff88015d115040: 0xffffea000531feb0 0xffff88015d115118
0xffff88015d115050: 0xffffffff811bfc2b 0xdead505cdead5058
0xffff88015d115060: 0xdead5064dead5060 0xdead506cdead5068
0xffff88015d115070: 0xffff88014e3dff70 0xffff88015d1150d8
[...]
0xffff88015d115120: 0xffffffff811bacd5 0xdead512cdead5128
0xffff88015d115130: 0xdead5134dead5130 0xdead513cdead5138
0xffff88015d115140: 0xdead5144dead5140 0xdead514cdead5148
0xffff88015d115150: 0xffff8800d8364b00 0xffff88015d118000
[...]
0xffff88015d1154d0: 0xffffffff811bfc2b 0xdead54dcdead54d8
[...]
0xffff88015d1155a0: 0xffffffff811bacd5 0xdead55acdead55a8
0xffff88015d1155b0: 0xdead55b4dead55b0 0xdead55bcdead55b8
0xffff88015d1155c0: 0xdead55c4dead55c0 0xdead55ccdead55c8
[...]
0xffff88015d115950: 0xffffffff811bfc2b 0xdead595cdead5958
0xffff88015d115960: 0xdead5964dead5960 0xdead596cdead5968
[...]
0xffff88015d115a20: 0xffffffff811bacd5 0xdead5a2cdead5a28
0xffff88015d115a30: 0xdead5a34dead5a30 0xdead5a3cdead5a38
0xffff88015d115a40: 0xdead5a44dead5a40 0xdead5a4cdead5a48
[...]

The next problem is that this hole only appears at specific stack depths while, for successful exploitation, it needs to be precisely in the right position. However, there are several tricks that can be used together to align the stack:

  • On every recursion level, it is possible to choose whether the "environ" file or the "cmdline" file is used, and their stackframe sizes and hole patterns differ.
  • Any copy_from_user() is a valid entry point to a page fault. Even better, it is possible to combine an arbitrary write syscall with an arbitrary VFS write handler so that both the write syscall and the VFS write handler influence the depth. (And the combined depth can be calculated without having to test every variant.)

After testing various combinations, I ended up with a mix of environ files and cmdline files, the write() syscall and uid_map's VFS write handler.

At this point, we can recurse down into the previous allocation without touching any of the dangerous fields. Now, execution of the kernel thread has to be paused while the stack pointer points into the preceding allocation, the allocation the stack pointer points to should be overwritten with a new stack, and then execution of the kernel code should be resumed.

To pause the kernel thread while it is inside the recursion, after setting up the chain of mappings, the anonymous mapping at the end of the chain can be replaced with a FUSE mapping (userfaultfd won't work here; it doesn't catch remote memory accesses).

For the preceding allocation, my exploit uses pipes. When data is written to a newly allocated empty pipe, a page will be allocated for that data using the buddy allocator. My exploit simply spams the memory with pipe page allocations while creating the process that will trigger the pagefault using clone(). Using clone() instead of fork() has the advantage that with appropriate flags, less information has to be duplicated, so there is less memory allocation noise. During clone(), all of the pipe pages are filled up to, but excluding, the first saved RIP behind the expected RSP of the recursing process when it pauses in FUSE. Writing less would cause the second pipe write to clobber stack data that is used before RIP control is achieved, potentially crashing the kernel. Then, when the recursing process has paused in FUSE, a second write is performed on all pipes to overwrite the saved RIP and the data behind it with a completely attacker-controlled new stack.

oob_stack_overwrite.png

At this point, the last line of defense should be KASLR. Ubuntu, as described on its Security Features page, supports KASLR on x86 and amd64 - but you have to enable it manually because doing so breaks hibernation. This bug has been fixed very recently, so now distros should be able to turn on KASLR by default - even if the security benefit isn't huge, it probably makes sense to turn on the feature, since it doesn't really cost anything. Since most machines probably aren't configured to pass special parameters on the kernel command line, I'm assuming that KASLR, while compiled into the kernel, is inactive, and so the attacker knows the addresses of kernel text and static data.

Now that it is trivial to do arbitrary things in the kernel with ROP, there are two ways to continue with the exploit. You could e.g. use ROP to do the equivalent of commit_creds(prepare_kernel_cred(NULL)). I chose to go a different way. Note that during the stack overflow, the original KERNEL_DS value of addr_limit was preserved. Returning back through all the saved stackframes would eventually reset the addr_limit to USER_DS - but if we just return back to userspace directly, addr_limit will stay KERNEL_DS. So my exploit writes the following new stack, which is more or less a copy of the data at the top of the stack:

unsigned long new_stack[] = {
 0xffffffff818252f2, /* return pointer of syscall handler */
 /* 16 useless registers */
 0x1515151515151515, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 (unsigned long) post_corruption_user_code, /* user RIP */
 0x33, /* user CS */
 0x246, /* EFLAGS: most importantly, turn interrupts on */
 /* user RSP */
 (unsigned long) (post_corruption_user_stack + sizeof(post_corruption_user_stack)),
 0x2b /* user SS */
};

After the recursing process is resumed by killing the FUSE server process, it resumes execution at the post_corruption_user_code method. This method can then e.g. use pipes to write to arbitrary kernel addresses because the address check in copy_to_user() is disabled:

void kernel_write(unsigned long addr, char *buf, size_t len) {
 int pipefds[2];
 if (pipe(pipefds))
   err(1, "pipe");
 if (write(pipefds[1], buf, len) != len)
   errx(1, "pipe write");
 close(pipefds[1]);
 if (read(pipefds[0], (char*)addr, len) != len)
   errx(1, "pipe read to kernelspace");
 close(pipefds[0]);
}

Now you can comfortably perform arbitrary reads and writes from userland. If you want a root shell, you can e.g. overwrite the coredump handler, which is stored in a static variable, and then raise SIGSEGV to execute the coredump handler with root privileges:

 char *core_handler = "|/tmp/crash_to_root";
 kernel_write(0xffffffff81e87a60, core_handler, strlen(core_handler)+1);

Fixing the bug

The bug was fixed using two separate patches: 2f36db710093 disallows opening files without an mmap handler through ecryptfs, and just to be sure, e54ad7f1ee26 disallows stacking anything on top of procfs because there is a lot of magic going on in procfs and there really isn't any good reason to stack anything on top of it.

However, the reason why I wrote a full root exploit for this not exactly widely exploitable bug is that I wanted to demonstrate that Linux stack overflows can occur in very non-obvious ways, and even with the existing mitigations turned on, they're still exploitable. In my bug report, I asked the kernel security list to add guard pages to kernel stacks and remove the thread_info struct from the bottom of the stack to more reliably mitigate this bug class, similar to what other operating systems and grsecurity are already doing. Andy Lutomirski had actually already started working on this, and he has now published patches that add guard pages: https://lkml.org/lkml/2016/6/15/1064