Project Zero: MTE As Implemented, Part 1: Implementation Testing

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

Through mid-2022 and early 2023, Project Zero had access to pre-production hardware implementing this instruction set extension to evaluate the security properties of the implementation. In particular, we're interested in whether it's possible to use this instruction set extension to implement effective security mitigations, or whether its use is limited to debugging/fault detection purposes.

As of the v8.5a specification, MTE can operate in two distinct modes, which are switched between on a per-thread basis. The first mode is sync-MTE, where tag-check failure on a memory access will cause the instruction performing the access to deliver a fault at retirement. The second mode is async-MTE, where tag-check failure does not directly (at the architectural level) cause a fault. Instead, tag-check failure will cause the setting of a per-core flag, which can then be polled from the kernel context to detect when an invalid access has occurred.

This blog post documents the tests that we have performed so far, and the conclusions that we've drawn from them, together with the code necessary to repeat these tests. This testing was intended to explore both the details of the hardware implementation of MTE, and the current state of the software support for MTE in the Linux kernel. All of the testing is based on manually implemented tagging in statically-linked standalone binaries, so it should be easy to reproduce these results on any compatible hardware.

Terminology

When designing and implementing security features, it's important to be conscious of the specific protection goal. In order to provide clarity in the rest of this post, we'll define some specific terminology that we use when talking about this:

Mitigation - A mitigation is something that reduces real exploitability of a vulnerability or class of vulnerability. The expectation is that attackers can (and eventually will) find their way around it. Examples would be DEP, ASLR, CFI.

"Soft" Mitigation - We consider a mitigation to be "soft" if the expectation is that an attacker pays a one-time cost in order to bypass the mitigation. Typically this would be a per-target cost, for example developing a ROP chain to bypass DEP, which can usually be largely re-used between different exploits for the same target software.
"Hard" Mitigation - We consider a mitigation to be "hard" if the expectation is that an attacker cannot develop a bypass technique which is reusable between different vulnerabilities (without, e.g. incorporating an additional vulnerability). An example would be ASLR, which is typically bypassed either by the use of a separate information leak vulnerability, or by developing an information leak primitive using the same vulnerability.

Note that the context can have an impact on the "hardness" of a mitigation -
if a codebase is particularly rich in code-patterns which allow the construction of

information leaks, it's quite possible that an attacker can develop a reliable, reusable

technique for turning various memory corruption primitives into an information leak

within that codebase.

Solution - a solution is something that eliminates exploitability of a vulnerability or class of vulnerability. The expectation is that the only way for an attacker to bypass a solution would be an unintended implementation weakness in the solution.

For example (based purely on a theoretical implementation of memory tagging):

- Randomly-allocating tags to heap allocations cannot be considered a solution for any class of heap-related memory corruption, since this provides at best probabilistic protection.

- Allocating odd and even tags to adjacent heap allocations might theoretically be able to provide a solution for linear heap-buffer overflows.

Hardware Implementation

The main hardware implementation question we had was, does a speculative side-channel exist that would allow leaking whether or not a tag-check has succeeded, without needing to architecturally execute the tag-check?

It is expected that Spectre-type speculative execution side-channels will still allow an attacker to leak pointer values from memory, indirectly leaking the tags. Here we consider whether the implementation introduces additional speculative side-channels that would allow an attacker to more efficiently leak tag-check success/failure.

TL; DR; In our testing we did not identify an additional1 speculative side-channel that would allow such an attack to be performed.

1. Does MTE block Spectre? (NO)

The only way that MTE could prevent exploitation of Spectre-type weaknesses would be to have speculative execution stall the pipeline until the tag-check completes2. This might sound desirable ("prevent exploitation of Spectre-type weaknesses") but it's not - this would create a much stronger side-channel to allow an attacker to create an oracle for tag-check success, weakening the overall security properties of the MTE implementation.

This is easy to test. If we can still leak data out of speculative execution using Spectre when the pointer used for the speculative access has an incorrect tag, then this is not the case.

We wrote a small patch to the safeside demo spectre_v1_pht_sa that demonstrates this:

mte_device:/data/local/tmp $ ./spectre_v1_pht_sa_mte

Leaking the string: 1t's a s3kr3t!!!

Done!

Checking that we would crash during architectural access:

Segmentation fault

2. Does tag-check success/failure have a measurable impact on the speculation window length? (Probably NOT)

There is a deeper question that we'd like to understand: is the length of speculative execution after a memory access influenced by whether the tag-check for that access succeeds or fails? If this was the case, then we might be able to build a more complex speculative side-channel that we could use in a similar way.

In order to measure this, we need to force a mis-speculation, and then perform a read access to memory for which a tag-check would either succeed or fail, and then we need to use a speculative side-channel to measure how many further instructions were successfully speculatively executed. We wrote and tested a self-contained harness to measure this, which can be found in speculation_window.c

This harness works by generating code for a test function with a variable number of no-op instructions at runtime, and then repeatedly executing this test function in a loop to train the branch-predictor before finally triggering mis-speculation. The test function is as follows:

ldr x0, [x0] ; this load is slow (*x0 is uncached)

cbnz x0, speculation: ; this branch is always taken during warmup

ret

speculation:

ldr x1, [x1] ; this load is fast (*x1 is cached)

; the tag-check success or fail will happen on

; this access, but during warmup the tag-check

; will always be a success.

orr x2, x2, x1 ; this is a no-op (as x1 is always 0) but it

... n times ... ; maintains a data dependency between the

orr x2, x2, x1 ; loads (and the no-ops), hopefully preventing

; too much re-ordering.

ldr x2, [x2] ; *x2 is uncached, if it is cached later then

; this instruction was (probably) executed.

ret

In this way we can measure the number of no-ops that we can insert before the final load no longer executes (ie. the result from the first load is received and we realise that we've mispredicted the branch so we flush the pipeline before the final load is executed).

In order to reduce bias due to branch-predictor behaviour, the test program is run separately for the tag-check success and tag-check failure case, and the control-flow being run is identical in both cases.

In order to reduce bias due to cpu power-management/throttling behaviour, each time the test program is run, it collects a single set of samples and then exits. We then run this program repeatedly, and by interleaving the runs for tag-check success and tag-check fail we look to reduce this influence on the results. In addition, the cores are down-clocked to the lowest supported frequency using the standard linux cpu scaling interface, which should reduce the impact of cpu throttling to a minimum.

Most modern mobile devices also have non-homogenous core designs, with for example a 4+4, 2+2+4 or 1+3+4 core design. Since the device under test is no exception, the program needs to pin itself to a specific core and we need to run these tests for each core design separately.

Reading from the virtual timer is extremely slow (relative to memory access instructions), and this results in noisy data when attempting to measure single-cache-hit/single-cache-miss. In a normal real-world environment, a shared-memory timer is often a better approach to timing with this level of granularity, and that's an approach that we've used previously with success. In this case, since we want to get data for each core, it was difficult to get consistency across all cores, and we had better results using an amplification approach performing accesses to multiple cache-lines.

However, this approach adds more complexity to the measuring code, and more possibilities for smart prefetching or similar to interfere with our measurements. Since we are trying to test the properties of the hardware, rather than develop a practical attack that needs to run in a timely fashion, we decided to minimise these risks and instead collect enough data that we'd be able to see the patterns through the noise.

The first graphs below show this data for a region of the x-axis (nop-count) on each core around where the speculation window ends (so the x-axis is anchored differently for each core), and on the y-axis we plot the observed probability that our measurement results in a cache-miss. Two separate lines are plotted in each graph; one in red for the "tag-check-fail" case and one in green for the "tag-check-pass" case. We can't really see the green line at all - the two lines overlap so closely as to be indistinguishable.

If we really zoom in on measurements from the "Biggest" core, we can see the two lines deviating due to the noise:

Another interesting point is to notice that the graph we have for the biggest core tops out at a probability of ~30%; this is (probably) not because we're hitting the cache, but instead that the timer read is slow that we can't use it to reliably differentiate between cache misses and cache hits most of the time. If we look at a graph of raw latency measurements, we can see this:

The important thing is that there's no significant difference between the results for the failure and success cases, which suggests that there's no clear speculative side-channel that would allow an attacker to build a side-channel tag-check oracle directly.

Software Implementation

There are several ways that we identified that the current software implementation around the use of MTE could lead to bypasses if MTE were used as a security mitigation.

As described below, Issues 1 and 2 are quirks of the current implementation that it may or may not be possible to address with kernel changes. They are both of limited scope and only apply to very specific conditions, so they likely don't have a particularly significant impact on the applicability of MTE as a security mitigation (although they have some implications for the coverage of such a mitigation.)

Issue 3 is a detail of the current implementation of Breakpad, and serves to highlight a particular class of weakness that will require careful audit of signal handling code in any application that wishes to use MTE as a security mitigation.

Issue 4 is a fundamental weakness that likely cannot be effectively addressed. We would not characterize this as meaning that MTE could not be applied as a security mitigation, but rather a limitation on the claims that one could make about the effectiveness of such a mitigation.

To be more specific, both issues 3 and 4 are bound by some fairly significant limitations, which would have varying impact on exploitability of memory corruption issues depending on the specific context in which the issue occurs. See below for more discussion on this topic.

TL;DR; In our testing we identified some areas for improvement on the software side, and demonstrated that it is possible to use the narrow "exploitation window" to avoid the side-effects of tag-check failures under some circumstances, meaning that any mitigation based on async MTE is likely limited to a "soft mitigation" regardless of tagging strategy.

1. Handling of system call arguments [ASYNC]

There's a documented limitation of the current handling of the async MTE check mode in the linux kernel, which is that the kernel does not catch invalid accesses to userspace pointers. This means that kernel-space accesses to incorrectly tagged user-space pointers during a system call will not be detected.

There are likely good technical reasons for this limitation, but we plan to investigate improving this coverage in the future.

We provide a sample that demonstrates this limitation, software_issue_1.c.

mte_enable(false, DEFAULT_TAG_MASK);

uint64_t* ptr = mmap(NULL, 0x1000,

PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

uint64_t* tagged_ptr = mte_tag_and_zero(ptr, 0x1000);

memset(tagged_ptr, 0x23, 0x1000);

int fd = open("/dev/urandom", O_RDONLY);

fprintf(stderr, "%p %p\n", ptr, tagged_ptr);

read(fd, ptr, 0x1000);

assert(*tagged_ptr == 0x2323232323232323ull);

taro:/ $ /data/local/tmp/software_issue_1

0x7722c5d000 0x800007722c5d000

software_issue_1: ./software_issue_1.c:46: int main(int, char **): Assertion `*tagged_ptr == 0x2323232323232323ull' failed.

2. Handling of system call arguments [SYNC]

The way that the sync MTE check mode is currently implemented in the linux kernel means that kernel-space accesses to incorrectly tagged user-space pointers result in the system call returning EFAULT. While this is probably the cleanest and simplest approach, and is consistent with current kernel behaviour especially when it comes to handling results for partial operations, this has the potential to lead to bypasses/oracles in some circumstances.

We plan to investigate replacing this behaviour with SIGSEGV delivery instead (specifically for MTE tag check failures).

The provided sample, software_issue_2.c is a very simple demonstration of this situation.

size_t readn(int fd, void* ptr, size_t len) {

char* start_ptr = ptr;

char* read_ptr = ptr;

while (read_ptr < start_ptr + len) {

ssize_t result = read(fd, read_ptr, start_ptr + len - read_ptr);

if (result <= 0) {

return read_ptr - start_ptr;

} else {

read_ptr += result;

}

return len;

}

int main(int argc, char** argv) {

int pipefd[2];

mte_enable(true, DEFAULT_TAG_MASK);

uint64_t* ptr = mmap(NULL, 0x1000,

PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

ptr = mte_tag(ptr, 0x10);

strcpy(ptr, "AAAAAAAAAAAAAAA");

assert(!pipe(pipefd));

write(pipefd[1], "BBBBBBBBBBBBBBB", 0x10);

// In sync MTE mode, kernel MTE tag-check failures cause system calls to fail

// with EFAULT rather than triggering a SIGSEGV. Existing code doesn't

// generally expect to receive EFAULT, and is very unlikely to handle it as a

// critical error.

uint64_t* new_ptr = ptr;

while (!strcmp(new_ptr, "AAAAAAAAAAAAAAA")) {

// Simulate a use-after-free, where new_ptr is repeatedly free'd and ptr

// is accessed after the free via a syscall.

new_ptr = mte_tag(new_ptr, 0x10);

strcpy(new_ptr, "AAAAAAAAAAAAAAA");

size_t bytes_read = readn(pipefd[0], ptr, 0x10);

fprintf(stderr, "read %zu bytes\nnew_ptr string is %s\n", bytes_read, new_ptr);

}

taro:/ $ /data/local/tmp/software_issue_2

read 0 bytes