Thursday, March 19, 2015

Taming the wild copy: Parallel Thread Corruption

Posted by Chris Evans, Winner of the occasional race

Back in 2002, a very interesting vulnerability was found and fixed in the Apache web server. Relating to a bug in chunked encoding handing, the vulnerability caused a memcpy() call with a negative length with the destination on the stack. Of course, many parties were quick to proclaim the vulnerability unexploitable beyond a denial of service. After all, a negative length to memcpy() represents a huge copy which is surely guaranteed to hit an unmapped page and terminate the process. So it was a surprise when a working remote code execution exploit turned up for FreeBSD and other BSD variants, due to a quirk in their memcpy() implementations! This GIAC paper archives both the exploit as well as an analysis (appendix D) of how it works.

For the purposes of this post, we define “wild copy” as a copy where both:

The attacker has limited control over the length of the copy.
The copy will always end up being so large as to be guaranteed to hit an unmapped page.

There have been plenty of other interesting cases where a wild copy has ended up being exploitable:

A general attack against corruption of structured exception handlers (SEH) on Windows.
The BSD variants don’t have a monopoly on memcpy() quirks. In 2011, there was a fascinating outright eglibc vulnerability in Ubuntu, triggering for any very large length to memcpy(). A memset() variant was exposed via the Chromium Vulnerability Rewards Program.
The Java Runtime Environment has a very complicated crash handler installed, which is called when a wild copy inevitably causes an access violation. If the crash handler is not careful to be robust in the presence of arbitrarily corrupted heap state, it might crash itself in a more controlled and exploitable way.

All of the above cases have one thing in common: the wild copy becomes exploitable due to a secondary issue or quirk.

Is it possible to exploit a wild copy without relying on a secondary issue? This is an interesting question for Project Zero to tackle -- when we explore exploitation, we usually only do so for cases where we think we can learn new insights or advance the public state of the art. Today, we’re going to explore a wild copy exploit in the Flash plug-in for 64-bit Chrome Linux.

It’s time to introduce issue 122 and “parallel thread corruption”. Issue 122 is an Adobe Flash vulnerability that was fixed back in November 2014. It’s a curious bug wherein the G711 sound codec is straight up broken on some platforms, always resulting in a wild copy when the codec tries to decode some samples. When triggered, it causes a memmove() on the heap with a small negative value as the length parameter. On modern Linux, the memmove() implementation appears to be solid, and the inexorable SIGSEGV appears to take the process down cleanly. We have our work cut out!

Step 1: choose an approach

We’re going to attempt a “parallel thread corruption”, which is where we observe and abuse the side effects in thread A of an asynchronous memory corruption in thread B. Since thread B’s corruption will always end in a crash, we need to run both threads in parallel. We also need to win a tricky race condition: get code execution before thread B’s crash takes out thread A.

As is common in Flash exploits, we’re going to attempt to have our memory corruption corrupt the header of a Vector.<uint> buffer object, expanding its length beyond reality thus leading to the ability to read and write out of bounds. Vector.<uint> buffer objects have their data allocated inline after the header, and in the case of this specific vulnerability, the errant memmove() always ends up shifting memory content backwards by 2048 bytes. So if we allocate a large enough Vector.<uint>, we can attempt to corrupt its header with its own data:

| len: 2000 | some ptr | data[0] = 0, data[1] = 0, …, data[508] = 0xfee1dead, … |

^ |

|-----------------------------------------------------------

The offsets and alignments work out such that the 508th 4-byte uint data element will clobber the length in the header.

Step 2: wrestle the Flash heap into submission

Before we get too carried away, we need to remember that the corruption thread is going to be in the middle of a wild memmove(). Even my laptop can copy memory at a rate of almost 10 gigabytes per second, so we’re going to need to land the copy on to a large runway of heap so that we have enough time to complete the exploit before the copy hits an unmapped page. Getting a large, contiguous heap mapping sounds easy but it is not. If you naively just spray the Flash heap with 512KB ByteArray buffers, you’ll see a large number of process mappings looking like this:

7fd138039000-7fd138679000 rw-p 00000000 00:00 0

7fd138679000-7fd139039000 ---p 00000000 00:00 0

7fd139039000-7fd139fb9000 rw-p 00000000 00:00 0

7fd139fb9000-7fd13a039000 ---p 00000000 00:00 0

7fd13a039000-7fd13ae79000 rw-p 00000000 00:00 0

7fd13ae79000-7fd13b039000 ---p 00000000 00:00 0

[...]

At first glance, as security professionals, we might proclaim “great! guard pages!” but what we are looking at here is an accidental inefficiency of the Flash heap. When trying to extend the limit of the heap forwards and failing due to another mapping being in the way, it lets the operating system decide where to put a new mapping. The default Linux heuristic happens to be “just before the last successful mapping”, so the Flash heap will enter a permanent cycle of failing to extend the heap forwards. The “guard pages” come from the fact that the Flash heap (on 64-bit) will reserve address space in 16MB chunks, and commit regions of that as needed, and with granularity of 128KB. It’s quite common for the series of commit requests to not add up exactly to 16MB, leaving a hole in the address space. Such a hole will cause a crash before our exploit is done!

To avoid all this and try and obtain a heap that grows forward, we follow the following heuristic:

Allocate a 1GB object, which will be placed just before the current heap region.
Spray 4KB allocations (the Flash heap block size) to fill in existing free blocks.
Spray some more 4KB allocations to cause the heap to extend backwards, into a new 16MB region allocated before the 1GB object.
Free the 1GB object.

If all has gone well, the heap will look something like this, with green representing the space available for uninterrupted linear growth forward:

Step 3: use heap grooming to align our exploitation objects

We’re now in good shape to use heap grooming to set up a row of important object allocations in a line, like this:

.----- wild copy -------------------------------------------------->

In red is the vector we decided to corrupt in step 1, and in green is an object we will deliberately free now that the heap is set up. Before we proceed, there’s one more heap quirk that we need to beware of: as the heap is extended, inline heap metadata is also extended to describe all the new 4KB blocks. This process of extending in-heap metadata creates free regions of blocks inside the heap. This wouldn’t be an issue, but for another quirk: when a block range is returned to the heap free list, it isn’t reallocated in MRU (most recently used) order. Therefore, we mount another heap spray to make sure all 8KB (2 block) heap chunks are allocated, leaving that particular freelist empty. We then free the middle 8KB buffer (in green above) to leave a heap free chunk in a useful place. Since we leave two allocated 8KB buffers either side, the freed chunk will not be coalesced and will be the chosen option for the next 8KB allocation.

After all that heap manipulation, we leave ourselves with a range of important objects lined up in a neat row, and with a preceding and free 8KB heap chunk primed to be re-used at the next opportunity.

Step 4: light the fuse

It’s worth noting that there are platform specific differences in how the sound subsystem initializes. On some platforms, the subsystem initializes and starts asynchronously when play() is called for the first time. For the Chrome Flash plug-in, this is not the case. If you look at the exploit, you’ll see that we have to kick off the subsystem by playing a different sound and then returning to the main event loop. This dance ensures that the second call to play() actually does kick off the wild copy asynchronously.

With our heap in nice shape, we trigger the wild memmove() by playing the G711 sound. The wild copy will start immediately and proceed asynchronously so we need to get a move on! The G711 sound decoder object is about 4424 bytes in size, but the heap allocation granularity for sizes >= 4096 is 4096. So, the object will be placed in an 8192-sized chunk -- specifically the one we just freed up if all is well. The wild heap corruption will start within this object and proceed forwards, very quickly corrupting all the following objects that we lined up, including the Vector.<uint> and Vector.<Object>.

The moment we’ve pulled the trigger, we spin in a tight ActionScript loop that waits for the length of our Vector.<uint> to change from 2000 to 0xfee1dead.

Step 5: defeat ASLR

The moment we detect the corrupted Vector.<uint>, we are in business: we can read and write arbitrary 4-byte values off-the-end of the object. As per the heap grooming in step 3, the object directly off the end is the buffer for a Vector.<Object>, which is ideal for us because it contains two important things for us to read:

A vtable value into the Flash library, which gives us the location of the code.
A pointer to itself, which is very important because it allows us to turn a relative read/write into an absolute read/write.

Step 6: ROP-free code execution

The exploit proceeds not via a ROP chain, but by rewiring some function pointers in the GOT / PLT machinery. (Note that this relies on RELRO not being present, which it rarely is on common Linux variants and binaries. If RELRO were present, we’d likely simply target vtable pointer rewiring instead.) We have the address of a vtable, and from this, the GOT / PLT function pointers we want are a fixed offset away. There is quite an elegant and simple sequence to get us code execution:

Read the function pointer for memmove(), which we know is resolved.
Calculate the address of __libc_system, which is a fixed offset from __memmove_ssse3_back which we just read.
Write the address of __libc_system to the function pointer for munmap().

The effect of this wiring is that when the program next calls munmap(buffer), it will actually be calling system(buffer). This is a useful primitive because large ByteArray buffers are backed by mmap() / munmap(), and we can control the exact content of the buffer and when it is freed. So we can pass any string we want to system() at will.

Step 7: continuation of execution and sandbox

To illustrate the level of control we now have over the process, our payload to system() is “killall -STOP chrome; gnome-calculator -e 31337”. This has two effects:

Displays the obligatory calculator to prove we have arbitrary code execution.
Sends a STOP signal to the existing Chrome processes. This will have the effect of stopping the wild memmove() before it has a chance to crash.

As a fun data point, if we attach to the stopped process and look at the wild copy thread, it has managed to copy 50MB(!) of data, even though we tried to make our exploit as fast and simple as possible.

At this point, a real exploit that didn’t wish to leave any traces could easily attach to the thread in the middle of the copy and stop the copy without having to kill the thread. Furthermore, the memory corrupted in the process of exploitation could be easily restored.

Of course, the sandbox escape is left as an exercise to the user. The system() library call won’t work at all in the Chrome sandbox; when demonstrating the exploit, we’re using the --no-sandbox flag to focus purely on the remote code execution aspect. An exploit inside the sandbox would instead likely rewire function pointers to attack the IPC channels with bad messages, or deploy a ROP chain after all to attack the kernel.

Closing notes

As always, Project Zero recommends that all memory corruptions are initially assumed to be exploitable. As we showed before in our posts on the Pwn4Fun Safari exploit, the glibc off-by-one NUL byte corruption and the Flash regex exploit, situations that initially look dire for exploitation can sometimes be turned around with a trick or two.

Wild copies have a long history of being declared unexploitable, but based on our findings here, we think it likely that most wild copies in complicated multi-threaded software (browsers, browser plug-ins) are likely to be exploitable using the parallel thread corruption method. We publish this finding to encourage developers and vendors to triage wild copies as more serious than previously believed.

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

Rowhammer blog post (draft)

Posted by Mark Seaborn, sandbox builder and breaker, with contributions by Thomas Dullien, reverse engineer

[This guest post continues Project Zero’s practice of promoting excellence in security research on the Project Zero blog]

Overview

“Rowhammer” is a problem with some recent DRAM devices in which repeatedly accessing a row of memory can cause bit flips in adjacent rows. We tested a selection of laptops and found that a subset of them exhibited the problem. We built two working privilege escalation exploits that use this effect. One exploit uses rowhammer-induced bit flips to gain kernel privileges on x86-64 Linux when run as an unprivileged userland process. When run on a machine vulnerable to the rowhammer problem, the process was able to induce bit flips in page table entries (PTEs). It was able to use this to gain write access to its own page table, and hence gain read-write access to all of physical memory.

We don’t know for sure how many machines are vulnerable to this attack, or how many existing vulnerable machines are fixable. Our exploit uses the x86 CLFLUSH instruction to generate many accesses to the underlying DRAM, but other techniques might work on non-x86 systems too.

We expect our PTE-based exploit could be made to work on other operating systems; it is not inherently Linux-specific. Causing bit flips in PTEs is just one avenue of exploitation; other avenues for exploiting bit flips can be practical too. Our other exploit demonstrates this by escaping from the Native Client sandbox.

Introduction to the rowhammer problem

We learned about the rowhammer problem from Yoongu Kim et al’s paper, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors” (Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, Onur Mutlu).

They demonstrate that, by repeatedly accessing two “aggressor” memory locations within the process’s virtual address space, they can cause bit flips in a third, “victim” location. The victim location is potentially outside the virtual address space of the process — it is in a different DRAM row from the aggressor locations, and hence in a different 4k page (since rows are larger than 4k in modern systems).

This works because DRAM cells have been getting smaller and closer together. As DRAM manufacturing scales down chip features to smaller physical dimensions, to fit more memory capacity onto a chip, it has become harder to prevent DRAM cells from interacting electrically with each other. As a result, accessing one location in memory can disturb neighbouring locations, causing charge to leak into or out of neighbouring cells. With enough accesses, this can change a cell’s value from 1 to 0 or vice versa.

The paper explains that this tiny snippet of code can cause bit flips:

code1a:
  mov (X), %eax  // Read from address X
  mov (Y), %ebx  // Read from address Y
  clflush (X)  // Flush cache for address X
  clflush (Y)  // Flush cache for address Y
  jmp code1a

Two ingredients are required for this routine to cause bit flips:

Address selection: For code1a to cause bit flips, addresses X and Y must map to different rows of DRAM in the same bank.

Some background: Each DRAM chip contains many rows of cells. Accessing a byte in memory involves transferring data from the row into the chip’s “row buffer” (discharging the row’s cells in the process), reading or writing the row buffer’s contents, and then copying the row buffer’s contents back to the original row’s cells (recharging the cells).

It is this process of “activating” a row (discharging and recharging it) that can disturb adjacent rows. If this is done enough times, in between automatic refreshes of the adjacent rows (which usually occur every 64ms), this can cause bit flips in the adjacent rows.

The row buffer acts as a cache, so if addresses X and Y point to the same row, then code1a will just read from the row buffer without activating the row repeatedly.

Furthermore, each bank of DRAM has its own notion of a “currently activated row”. So if addresses X and Y point to different banks, code1a will just read from those banks’ row buffers without activating rows repeatedly. (Banks are groups of DRAM chips whose rows are activated in lockstep.)

However, if X and Y point to different rows in the same bank, code1a will cause X and Y’s rows to be repeatedly activated. This is termed “row hammering”.
Bypassing the cache: Without code1a’s CLFLUSH instructions, the memory reads (MOVs) will be served from the CPU’s cache. Flushing the cache using CLFLUSH forces the memory accesses to be sent to the underlying DRAM, which is necessary to cause the rows to be repeatedly activated.

Note that the paper’s version of code1a also includes an MFENCE instruction. However, we found that using MFENCE was unnecessary and actually reduced the number of bit flips we saw. Yoongu Kim’s modified memtest also omits the MFENCE from its row hammering code.

Refining the selection of addresses to hammer

Using the physical address mapping

How can we pick pairs of addresses that satisfy the “different row, same bank” requirements?

One possibility is to use knowledge of how the CPU’s memory controller maps physical addresses to DRAM’s row, column and bank numbers, along with knowledge of either:

The absolute physical addresses of memory we have access to. Linux allows this via /proc/PID/pagemap.
The relative physical addresses of memory we have access to. Linux can allow this via its support for “huge pages”, which cover 2MB of contiguous physical address space per page. Whereas a normal 4k page is smaller than a typical DRAM row, a 2MB page will typically cover multiple rows, some of which will be in the same bank.

Yoongu Kim et al take this approach. They pick Y = X + 8MByte based on knowledge of the physical address mapping used by the memory controllers in Intel and AMD’s CPUs.

Random address selection

The CPU’s physical address mapping can be difficult to determine, though, and features such as /proc/PID/pagemap and huge pages are not available everywhere. Furthermore, if our guesses about the address mapping are wrong, we might pick an offset that pessimises our chances of successful row hammering. (For example, Y = X + 8 kByte might always give addresses in different banks.)

A simpler approach is to pick address pairs at random. We allocate a large block of memory (e.g. 1GB) and then pick random virtual addresses within that block. On a machine with 16 DRAM banks (as one of our test machines has: 2 DIMMs with 8 banks per DIMM), this gives us a 1/16 chance that the chosen addresses are in the same bank, which is quite high. (The chance of picking two addresses in the same row is negligible.)

Furthermore, we can increase our chances of successful row hammering by modifying code1a to hammer more addresses per loop iteration. We find we can hammer 4 or 8 addresses without slowing down the time per iteration.

Selecting addresses using timing

Another way to determine whether a pair of addresses has the “different row, same bank” property would be to time uncached accesses to those addresses using a fine-grained timer such as the RDTSC instruction. The access time will be slower for pairs that satisfy this property than those that don’t.

Double-sided hammering

We have found that we can increase the chances of getting bit flips in row N by row-hammering both of its neighbours (rows N-1 and N+1), rather than by hammering one neighbour and a more-distant row. We dub this “double-sided hammering”.

For many machines, double-sided hammering is the only way of producing bit flips in reasonable time. For machines where random selection is already sufficient to cause bit flips, double-sided hammering can lead to a vastly increased number of bits flipped. We have observed 25+ bits flipped in one row on one particularly fragile machine.

Performing double-sided hammering is made more complicated by the underlying memory geometry. It requires the attacker to know or guess what the offset will be, in physical address space, between two rows that are in the same bank and are adjacent. Let’s call this the “row offset”.

From our testing, we were able to naively extrapolate that the row offset for laptop Model #4 (see the table below) is 256k. We did this by observing the likelihood of bit flips relative to the distance the selected physical memory pages had from the victim page. This likelihood was maximized when we hammered the locations 256k below and above a given target row.

This “256k target memory area, 256k victim memory area, 256k target memory area” setup has shown itself to be quite effective on other laptops by the same vendor. It is likely that this setup needs to be tweaked for other vendors.

This 256k row offset could probably be explained as being a product of the row size (number of columns), number of banks, number of channels, etc., of the DRAM in this machine, though this requires further knowledge of how the hardware maps physical addresses to row and bank numbers.

Doing double-sided hammering does require that we can pick physically-contiguous pages (e.g. via /proc/PID/pagemap or huge pages).

Exploiting rowhammer bit flips

Yoongu Kim et al say that “With some engineering effort, we believe we can develop Code 1a into a disturbance attack that … hijacks control of the system”, but say that they leave this research task for the future. We took on this task!

We found various machines that exhibit bit flips (see the experimental results below). Having done that, we wrote two exploits:

The first runs as a Native Client (NaCl) program and escalates privilege to escape from NaCl’s x86-64 sandbox, acquiring the ability to call the host OS’s syscalls directly. We have mitigated this by changing NaCl to disallow the CLFLUSH instruction. (I picked NaCl as the first exploit target because I work on NaCl and have written proof-of-concept NaCl sandbox escapes before.)
The second runs as a normal x86-64 process on Linux and escalates privilege to gain access to all of physical memory. This is harder to mitigate on existing machines.

NaCl sandbox escape

Native Client is a sandboxing system that allows running a subset of x86-64 machine code (among other architectures) inside a sandbox. Before running an x86-64 executable, NaCl uses a validator to check that its code conforms to a subset of x86 instructions that NaCl deems to be safe.

However, NaCl assumes that the hardware behaves correctly. It assumes that memory locations don’t change without being written to! NaCl’s approach of validating machine code is particularly vulnerable to bit flips, because:

A bit flip in validated code can turn a safe instruction sequence into an unsafe one.
Under NaCl, the sandboxed program’s code segment is readable by the program. This means the program can check whether a bit flip has occurred and determine whether or how it can exploit the change.

Our exploit targets NaCl’s instruction sequence for sandboxed indirect jumps, which looks like this:

  andl $~31, %eax  // Truncate address to 32 bits and mask to be 32-byte-aligned.
  addq %r15, %rax  // Add %r15, the sandbox base address.
  jmp *%rax  // Indirect jump.

The exploit works by triggering bit flips in that code sequence. It knows how to exploit 13% of the possible bit flips. Currently it only handles bit flips that modify register numbers. (With more work, it could handle more exploitable cases, such as opcode changes.) For example, if a bit flip occurs in bit 0 of the register number in “jmp *%rax”, this morphs to “jmp *%rcx”, which is easily exploitable — since %rcx is unconstrained, this allows jumping to any address. Normally NaCl only allows indirect jumps to 32-byte-aligned addresses (and it ensures that instructions do not cross 32-byte bundle boundaries). Once a program can jump to an unaligned address, it can escape the sandbox, because it is possible to hide unsafe x86 instructions inside safe ones. For example:

  20ea0:       48 b8 0f 05 eb 0c f4 f4 f4 f4    movabs $0xf4f4f4f40ceb050f,%rax

This hides a SYSCALL instruction (0f 05) at address 0x20ea2.

Our NaCl exploit does the following:

It fills the sandbox’s dynamic code area with 250MB of NaClized indirect jump instruction sequences using NaCl’s dyncode_create() API.
In a loop:
- It row-hammers the dynamic code area using CLFLUSH, picking random pairs of addresses.
- It searches the dynamic code area for bit flips. If it sees an exploitable bit flip, it uses it to jump to shell code hidden inside NaCl-validated instructions. Otherwise, if the bit flip isn’t exploitable, it continues.

We have mitigated this by changing NaCl’s x86 validator to disallow the CLFLUSH instruction (tracked by CVE-2015-0565). However, there might be other ways to cause row hammering besides CLFLUSH (see below).

Prior to disallowing CLFLUSH in NaCl, it may have been possible to chain this NaCl exploit together with the kernel privilege escalation below so that a NaCl app in the Chrome Web Store app could gain kernel privileges, using just one underlying hardware bug for the whole chain. To our knowledge there was no such app in the Chrome Web Store. PNaCl — which is available on the open web — has an extra layer of protection because an attacker would have had to find an exploit in the PNaCl translator before being able to emit a CLFLUSH instruction.

Kernel privilege escalation

Our kernel privilege escalation works by using row hammering to induce a bit flip in a page table entry (PTE) that causes the PTE to point to a physical page containing a page table of the attacking process. This gives the attacking process read-write access to one of its own page tables, and hence to all of physical memory.

There are two things that help ensure that the bit flip has a high probability of being exploitable:

Rowhammer-induced bit flips tend to be repeatable. This means we can tell in advance if a DRAM cell tends to flip and whether this bit location will be useful for the exploit.

For example, bit 51 in a 64-bit word is the top bit of the physical page number in a PTE on x86-64. If this changes from 0 to 1, that will produce a page number that's bigger than the system's physical memory, which isn't useful for our exploit, so we can skip trying to use this bit flip. However, bit 12 is the bottom bit of the PTE's physical page number. If that changes from 0 to 1 or from 1 to 0, the PTE will still point to a valid physical page.
We spray most of physical memory with page tables. This means that when a PTE's physical page number changes, there's a high probability that it will point to a page table for our process.

We do this spraying by mmap()ing the same file repeatedly. This can be done quite quickly: filling 3GB of memory with page tables takes about 3 seconds on our test machine.

There are two caveats:

Our exploit runs in a normal Linux process. More work may be required for this to work inside a sandboxed Linux process (such as a Chromium renderer process).
We tested on a machine with low memory pressure. Making this work on a heavily-loaded machine may involve further work.

Break it down: exploit steps

The first step is to search for aggressor/victim addresses that produce useful bit flips:

mmap() a large block of memory.
Search this block for aggressor/victim addresses by row-hammering random address pairs. Alternatively, we can use aggressor/victim physical addresses that were discovered and recorded on a previous run; we use /proc/self/pagemap to search for these in memory.
If we find aggressor/victim addresses where the bit flipped within the 64-bit word isn’t useful for the exploit, just skip that address set.
Otherwise, munmap() all but the aggressor and victim pages and begin the exploit attempt.

In preparation for spraying page tables, we create a file in /dev/shm (a shared memory segment) that we will mmap() repeatedly. (See later for how we determine its size.) We write a marker value at the start of each 4k page in the file so that we can easily identify these pages later, when checking for PTE changes.

Note that we don’t want these data pages to be allocated from sequential physical addresses, because then flips in the lower bits of physical page numbers would tend to be unexploitable: A PTE pointing to one data page would likely change to pointing to another data page.

To avoid that problem, we first deliberately fragment physical memory so that the kernel’s allocations from physical memory are randomised:

mmap() (with MAP_POPULATE) a block of memory that’s a large fraction of the machine’s physical memory size.
Later, whenever we do something that will cause the kernel to allocate a 4k page (such as a page table), we release a page from this block using madvise() + MADV_DONTNEED.

We are now ready to spray memory with page tables. To do this, we mmap() the data file repeatedly:

We want each mapping to be at a 2MB-aligned virtual address, since each 4k page table covers a 2MB region of virtual address space. We use MAP_FIXED for this.
We cause the kernel to populate some of the PTEs by accessing their corresponding pages. We only need to populate one PTE per page table: We know our bit flip hits the Nth PTE in a page table, so, for speed, we only fault in the Nth 4k page in each 2MB chunk.
Linux imposes a limit of about 2^16 on the number of VMAs (mmap()’d regions) a process can have. This means that our /dev/shm data file must be large enough such that, when mapped 2^16 times, the mappings create enough page tables to fill most of physical memory. At the same time, we want to keep the data file as small as possible so as not to waste memory that could instead be filled with page tables. We pick its size accordingly.
In the middle of this, we munmap() the victim page. With a high probability, the kernel will reuse this physical page as a page table. We can’t touch this page directly any more, but we can potentially modify it via row hammering.

Having finished spraying, it’s hammer time. We hammer the aggressor addresses. Hopefully this induces the bit flip in the victim page. We can’t observe the bit flip directly (unlike in the NaCl exploit).

Now we can check whether PTEs changed exploitably. We scan the large region we mapped to see whether any of the PTEs now point to pages other than our data file. Again, for speed, we only need to check the Nth page within each 2MB chunk. We can check for the marker value we wrote earlier. If we find no marker mismatches, our attempt failed (and we could retry).

If we find a marker mismatch, then we have gained illicit access to a physical page. Hopefully this is one of the page tables for our address space. If we want to be careful, we can verify whether this page looks like one of our page tables. The Nth 64-bit field should look like a PTE (certain bits will be set or unset) and the rest should be zero. If not, our attempt failed (and we could retry).

At this point, we have write access to a page table, probably our own. However, we don’t yet know which virtual address this is the page table for. We can determine that as follows:

Write a PTE to the page (e.g. pointing to physical page 0).
Do a second scan of address space to find a second virtual page that now points to somewhere other than our data file. If we don’t find it, our attempt failed (and we could retry).

Exploiting write access to page tables

We now have write access to one of our process’s page tables. By modifying the page table, we can get access to any page in physical memory. We now have many options for how to exploit that, varying in portability, convenience and speed. The portable options work without requiring knowledge of kernel data structures. Faster options work in O(1) time, whereas slower options might require scanning all of physical memory to locate a data structure.

Some options are:

Currently implemented option: Modify a SUID-root executable such as /bin/ping, overwriting its entry point with our shell code, and then run it. Our shell code will then run as root. This approach is fast and portable, but it does require access to /proc/PID/pagemap: We load /bin/ping (using open() and mmap()+MAP_POPULATE) and query which physical pages it was loaded into using /proc/self/pagemap.
A similar approach is to modify a library that a SUID executable uses, such as /lib64/ld-linux-x86-64.so.2. (On some systems, SUID executables such as /bin/ping can’t be open()’d because their permissions have been locked down.)
Other, less portable approaches are to modify kernel code or kernel data structures.
- We could modify our process’s UID field. This would require locating the “struct cred” for the current process and knowing its layout.
- We could modify the kernel’s syscall handling code. We can quickly determine its physical address using the SIDT instruction, which is exposed to unprivileged code.

Routes for causing row hammering

Our proof-of-concept exploits use the x86 CLFLUSH instruction, because it’s the easiest way to force memory accesses to be sent to the underlying DRAM and thus cause row hammering.

The fact that CLFLUSH is usable from unprivileged code is surprising, because the number of legitimate uses for it outside of a kernel or device driver is probably very small. For comparison, ARM doesn’t have an unprivileged cache-flush instruction. (ARM Linux does have a cacheflush() syscall, used by JITs, for synchronising instruction and data caches. On x86, the i-cache and d-cache are synchronised automatically, so CLFLUSH isn’t needed for this purpose.)

We have changed NaCl’s x86 validator to disallow CLFLUSH. Unfortunately, kernels can’t disable CLFLUSH for normal userland code. Currently, CLFLUSH can’t be intercepted or disabled, even using VMX (x86 virtualisation). (For example, RDTSC can be intercepted without VMX support. VMX allows intercepting more instructions, including WBINVD and CPUID, but not CLFLUSH.) There might be a case for changing the x86 architecture to allow CLFLUSH to be intercepted. From a security engineering point of view, removing unnecessary attack surface is good practice.

However, there might be ways of causing row hammering without CLFLUSH, which might work on non-x86 architectures too:

Normal memory accesses: Is it possible that normal memory accesses, in sufficient quantity or in the right pattern, can trigger enough cache misses to cause rowhammer-induced bit flips? This would require generating cache misses at every cache level (L1, L2, L3, etc.). Whether this is feasible could depend on the associativity of these caches.

If this is possible, it would be a serious problem, because it might be possible to generate bit flips from JavaScript code on the open web, perhaps via JavaScript typed arrays.
Non-temporal memory accesses: On x86, these include non-temporal stores (MOVNTI, MOVNTQ, MOVNTDQ(A), MOVNTPD, MOVNTSD and MOVNTSS) and non-temporals reads (via prefetches — PREFETCHNTA).
Atomic memory accesses: Some reports claim that non-malicious use of spinlocks can cause row hammering, although the reports have insufficient detail and we’ve not been able to verify this. (See “The Known Failure Mechanism in DDR3 memory called ‘Row Hammer’”, Barbara Aichinger.) This seems unlikely on a multi-core system where cores share the highest-level cache. However, it might be possible on multi-socket systems where some pairs of cores don’t share any cache.
Misaligned atomic memory accesses: x86 CPUs guarantee that instructions with a LOCK prefix access memory atomically, even if the address being accessed is misaligned, and even if it crosses a cache line boundary. (See section 8.1.2.2, “Software Controlled Bus Locking”, in Intel’s architecture reference, which says “The integrity of a bus lock is not affected by the alignment of the memory field”.) This is done for backwards compatibility. In this case, the CPU doesn’t use modern cache coherency protocols for atomicity. Instead, the CPU falls back to the older mechanism of locking the bus, and we believe it might use uncached memory accesses. (On some multi-CPU-socket NUMA machines, this locking is implemented via the QPI protocol rather than via a physical #LOCK pin.)

If misaligned atomic ops generate uncached DRAM accesses, they might be usable for row hammering.

Initial investigation suggests that these atomic ops do bypass the cache, but that they are too slow for this to generate enough memory accesses, within a 64ms refresh period, to generate bit flips.
Uncached pages: For example, Windows’ CreateFileMapping() API has a SEC_NOCACHE flag for requesting a non-cacheable page mapping.
Other OS interfaces: There might be cases in which kernels or device drivers, such as GPU drivers, do uncached memory accesses on behalf of userland code.

Experimental results

We tested a selection of x86 laptops that were readily available to us (all with non-ECC memory) using CLFLUSH with the “random address selection” approach above. We found that a large subset of these machines exhibited rowhammer-induced bit flips. The results are shown in the table below.

The testing was done using the rowhammer-test program available here:
https://github.com/google/rowhammer-test

Note that:

Our sample size was not large enough that it can be considered representative.
A negative result (an absence of bit flips) on a given machine does not definitively mean that it is not possible for rowhammer to cause bit flips on that machine. We have not performed enough testing to determine that a given machine is not vulnerable.

As a result, we have decided to anonymize our results below.

All of the machines tested used DDR3 DRAM. It was not possible to identify the age of the DRAM in all cases.

	Laptop model	Laptop year	CPU family (microarchitecture)	DRAM manufacturer	Saw bit flip
1	Model #1	2010	Family V	DRAM vendor E	yes
2	Model #2	2011	Family W	DRAM vendor A	yes
3	Model #2	2011	Family W	DRAM vendor A	yes
4	Model #2	2011	Family W	DRAM vendor E	no
5	Model #3	2011	Family W	DRAM vendor A	yes
6	Model #4	2012	Family W	DRAM vendor A	yes
7	Model #5	2012	Family X	DRAM vendor C	no
8	Model #5	2012	Family X	DRAM vendor C	no
9	Model #5	2013	Family X	DRAM vendor B	yes
10	Model #5	2013	Family X	DRAM vendor B	yes
11	Model #5	2013	Family X	DRAM vendor B	yes
12	Model #5	2013	Family X	DRAM vendor B	yes
13	Model #5	2013	Family X	DRAM vendor B	yes
14	Model #5	2013	Family X	DRAM vendor B	yes
15	Model #5	2013	Family X	DRAM vendor B	yes
16	Model #5	2013	Family X	DRAM vendor B	yes
17	Model #5	2013	Family X	DRAM vendor C	no
18	Model #5	2013	Family X	DRAM vendor C	no
19	Model #5	2013	Family X	DRAM vendor C	no
20	Model #5	2013	Family X	DRAM vendor C	no
21	Model #5	2013	Family X	DRAM vendor C	yes
22	Model #5	2013	Family X	DRAM vendor C	yes
23	Model #6	2013	Family Y	DRAM vendor A	no
24	Model #6	2013	Family Y	DRAM vendor B	no
25	Model #6	2013	Family Y	DRAM vendor B	no
26	Model #6	2013	Family Y	DRAM vendor B	no
27	Model #6	2013	Family Y	DRAM vendor B	no
28	Model #7	2012	Family W	DRAM vendor D	no
29	Model #8	2014	Family Z	DRAM vendor A	no

We also tested some desktop machines, but did not see any bit flips on those. That could be because they were all relatively high-end machines with ECC memory. The ECC could be hiding bit flips.

Testing your own machine

Users may wish to test their own machines using the rowhammer-test tool above. If a machine produces bit flips during testing, users may wish to adjust security and trust decisions regarding the machine accordingly.

While an absence of bit flips during testing on a given machine does not automatically imply safety, it does provide some baseline assurance that causing bit flips is at least difficult on that machine.

Mitigations

Targeted refreshes of adjacent rows

Some schemes have been proposed for preventing rowhammer-induced bit flips by changing DRAM, memory controllers, or both.

A system could ensure that, within a given refresh period, it does not activate any given row too many times without also ensuring that neighbouring rows are refreshed. Yoongu Kim et al discuss this in their paper. They refer to proposals “to maintain an array of counters” (either in the memory controller or in DRAM) for counting activations. The paper proposes an alternative, probabilistic scheme called “PARA”, which is stateless and thus does not require maintaining counters.

There are signs that some newer hardware implements mitigations:

JEDEC’s recently-published LPDDR4 standard for DRAM (where “LP” = “Low Power”) specifies two rowhammer mitigation features that a memory controller would be expected to use. (See JEDEC document JESD209-4 — registration is required to download specs from the JEDEC site, but it’s free.)
- “Targeted Row Refresh” (TRR) mode, which allows the memory controller to ask the DRAM device to refresh a row’s neighbours.
- A “Maximum Activate Count” (MAC) metadata field, which specifies how many activations a row can safely endure before its neighbours need refreshing.
(The LPDDR4 spec does not mention “rowhammer” by name, but it does use the term “victim row”.)
We found that at least one DRAM vendor indicates, in their public data sheets, that they implement rowhammer mitigations internally within a DRAM device, requiring no special memory controller support.

Some of the newer models of laptops that we tested did not exhibit bit flips. A possible explanation is that these laptops implement some rowhammer mitigations.

BIOS updates and increasing refresh rates

Have hardware vendors silently rolled out any BIOS updates to mitigate the rowhammer problem by changing how the BIOS configures the CPU’s memory controller?

As an experiment, we measured the time required to cause a bit flip via double-sided hammering on one Model #4 laptop. This ran in the “less than 5 minutes” range. Then we updated the laptop’s BIOS to the latest version and re-ran the hammering test.

We initially thought this BIOS update had fixed the issue. However, after almost 40 minutes of sequentially hammering memory, some locations exhibited bit flips.

We conjecture that the BIOS update increased the DRAM refresh rate, making it harder — but not impossible — to cause enough disturbance between DRAM refresh cycles. This fits with data from Yoongu Kim et al’s paper (see Figure 4) which shows that, for some DRAM modules, a refresh period of 32ms is not short enough to reduce the error rate to zero.

We have not done a wider test of BIOS updates on other laptops.

Monitoring for row hammering using perf counters

It might be possible to detect row hammering attempts using CPUs’ performance counters. In order to hammer an area of DRAM effectively, an attacker must generate a large number of accesses to the underlying DRAM in a short amount of time. Whether this is done using CLFLUSH or using only normal memory accesses, it will generate a large number of cache misses.

Modern CPUs provide mechanisms that allow monitoring of cache misses for purposes of performance analysis. These mechanisms can be repurposed by a defender to monitor the system for sudden bursts of cache misses, as truly cache-pessimal access patterns appear to be rare in typical laptop and desktop workloads. By measuring “time elapsed per N cache misses” and monitoring for abnormal changes, we have been able to detect aggressive hammering even on systems that were running under a heavy load (a multi-core Linux kernel compile) during the attack. Unfortunately, while detection seems possible for aggressive hammering, it is unclear what to do in response, and unclear how common false positives will be.

While it is likely that attackers can adapt their attacks to evade such monitoring, this would increase the required engineering effort, making this monitoring somewhat comparable to an intrusion detection system.

On disclosures

The computing industry (of which Google is a part) is accustomed to security bugs in software. It has developed an understanding of the importance of public discussion and disclosure of security issues. Through these public discussions, it has developed a better understanding of when bugs have security implications. Though the industry is less accustomed to hardware bugs, hardware security can benefit from the same processes of public discussion and disclosure.

With this in mind, we can draw two lessons:

Exploitability of the bug: Looking backward, had there been more public disclosures about the rowhammer problem, it might have been identified as an exploitable security issue sooner. It appears that vendors have known about rowhammer for a while, as shown by the presence of rowhammer mitigations in LPDDR4. It may be that vendors only considered rowhammer to be a reliability problem.
Evaluating machines: Looking forward, the release of more technical information about rowhammer would aid evaluation of which machines are vulnerable and which are not. At the time of writing, it is difficult to tell which machines are definitely safe from rowhammer. Testing can show that a machine is vulnerable, but not that it is invulnerable.

We explore these two points in more detail below.

On exploitability of bugs

Vendors may have considered rowhammer to be only a reliability issue, and assumed that it is too difficult to exploit. None of the public material we have seen on rowhammer (except for the paper by Yoongu Kim et al) discusses security implications.

However, many bugs that appear to be difficult to exploit have turned out to be exploitable. These bugs might initially appear to be “only” reliability issues, but are really security issues.

An extreme example of a hard-to-exploit bug is described in a recent Project Zero blog post (see “The poisoned NUL byte, 2014 edition”). This shows how an off-by-one NUL byte overwrite could be exploited to gain root privileges from a normal user account.

To many security researchers, especially those who practice writing proof-of-concept exploits, it is well known that bit flips can be exploitable. For example, a 2003 paper explains how to use random bit flips to escape from a Java VM. (See “Using Memory Errors to Attack a Virtual Machine” by Sudhakar Govindavajhala and Andrew W. Appel.)

Furthermore, as we have shown, rowhammer-induced bit flips are sometimes more easily exploitable than random bit flips, because they are repeatable.

On vulnerability of machines

We encourage vendors to publicly release information about past, current and future devices so that security researchers, and the public at large, can evaluate them with reference to the rowhammer problem.

The following information would be helpful:

For each model of DRAM device:
- Is the DRAM device susceptible to rowhammer-induced bit flips at the physical level?
- What rowhammer mitigations does the DRAM device implement? Does it implement TRR and MAC? Does it implement mitigations that require support from the memory controller, or internal mitigations that don't require this?
For each model of CPU:
- What mitigations does the CPU's memory controller implement? Do these mitigations require support from the DRAM devices?
- Is there public documentation for how to program the memory controller on machine startup?
- Is it possible to read or write the memory controller's settings after startup, to verify mitigations or enable mitigations?
- What scheme does the memory controller use for mapping physical addresses to DRAM row, bank and column numbers? This is useful for determining which memory access patterns can cause row hammering.
For each BIOS: What rowhammer mitigations does the BIOS enable in the CPU's memory controller settings? For example, does the BIOS enable a double refresh rate, or enable use of TRR? Is it possible to review this?

At the time of writing, we weren't able to find publicly available information on the above in most cases.

If more of this information were available, it would be easier to assess which machines are vulnerable. It would be easier to evaluate a negative test result, i.e. the absence of bit flips during testing. We could explain that a negative result for a machine is because (for example) its DRAM implements mitigations internally, or because its DRAM isn't susceptible at the physical level (because it was manufactured using an older process), or because its BIOS enables 2x refresh. Such an explanation would give us more confidence that the negative test result occurred not because our end-to-end testing was insufficient in some way, but because the machine is genuinely not vulnerable to rowhammer.

We expect researchers will be interested in evaluating the details of rowhammer mitigation algorithms. For example, does a device count row activations (as the MAC scheme suggests they should), or does it use probabilistic methods like PARA? Will the mitigations be effective against double-sided row hammering as well as single-sided hammering? Could there be any problems if both the DRAM device and memory controller independently implement their own rowhammer mitigations?

Conclusion

We have shown two ways in which the DRAM rowhammer problem can be exploited to escalate privileges. History has shown that issues that are thought to be “only” reliability issues often have significant security implications, and the rowhammer problem is a good example of this. Many layers of software security rest on the assumption the contents of memory locations don't change unless the locations are written to.

The public discussion of software flaws and their exploitation has greatly expanded our industry’s understanding of computer security in past decades, and responsible software vendors advise users when their software is vulnerable and provide updates. Though the industry is less accustomed to hardware bugs than to software bugs, we would like to encourage hardware vendors to take the same approach: thoroughly analyse the security impact of “reliability” issues, provide explanations of impact, offer mitigation strategies and — when possible — supply firmware or BIOS updates. Such discussion will lead to more secure hardware, which will benefit all users.

Credits

Matthew Dempsky proposed that bit flips in PTEs could be an effective route for exploiting rowhammer.
Thomas Dullien helped with investigating how many machines are affected, came up with double-sided hammering, ran the BIOS upgrade experiment, and helped fill in the details of the PTE bit flipping exploit.