Wednesday, May 10, 2017

Exploiting the Linux kernel via packet sockets

Guest blog post, posted by Andrey Konovalov


Lately I’ve been spending some time fuzzing network-related Linux kernel interfaces with syzkaller. Besides the recently discovered vulnerability in DCCP sockets, I also found another one, this time in packet sockets. This post describes how the bug was discovered and how we can exploit it to escalate privileges.

The bug itself (CVE-2017-7308) is a signedness issue, which leads to an exploitable heap-out-of-bounds write. It can be triggered by providing specific parameters to the PACKET_RX_RING option on an AF_PACKET socket with a TPACKET_V3 ring buffer version enabled. As a result the following sanity check in the packet_set_ring() function in net/packet/af_packet.c can be bypassed, which later leads to an out-of-bounds access.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

The bug was introduced on Aug 19, 2011 in the commit f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation") together with the TPACKET_V3 implementation. There was an attempt to fix it on Aug 15, 2014 in commit dc808110 ("packet: handle too big packets for PACKET_V3") by adding additional checks, but this was not sufficient, as shown below. The bug was fixed in 2b6867c2 ("net/packet: fix overflow in check for priv area size") on Mar 29, 2017.

The bug affects a kernel if it has AF_PACKET sockets enabled (CONFIG_PACKET=y), which is the case for many Linux kernel distributions. Exploitation requires the CAP_NET_RAW privilege to be able to create such sockets. However it's possible to do that from a user namespace if they are enabled (CONFIG_USER_NS=y) and accessible to unprivileged users.

Since packet sockets are a quite widely used kernel feature, this vulnerability affects a number of popular Linux kernel distributions including Ubuntu and Android. It should be noted, that access to AF_PACKET sockets is expressly disallowed to any untrusted code within Android, although it is available to some privileged components. Updated Ubuntu kernels are already out, Android’s update is scheduled for July.


The bug was found with syzkaller, a coverage guided syscall fuzzer, and KASAN, a dynamic memory error detector. I’m going to provide some details on how syzkaller works and how to use it for fuzzing some kernel interface in case someone decides to try this.

Let’s start with a quick overview of how the syzkaller fuzzer works. Syzkaller is able to generate random programs (sequences of syscalls) based on manually written template descriptions for each syscall. The fuzzer executes these programs and collects code coverage for each of them. Using the coverage information, syzkaller keeps a corpus of programs, which trigger different code paths in the kernel. Whenever a new program triggers a new code path (i.e. gives new coverage), syzkaller adds it to the corpus. Besides generating completely new programs, syzkaller is able to mutate the existing ones from the corpus.

Syzkaller is meant to be used together with dynamic bug detectors like KASAN (detects memory bugs like out-of-bounds and use-after-frees, available upstream since 4.0), KMSAN (detects uses of uninitialized memory, prototype was just released) or KTSAN (detects data races, prototype is available). The idea is that syzkaller stresses the kernel and executes various interesting code paths and the detectors detect and report bugs.

The usual workflow for finding bugs with syzkaller is as follows:
  1. Setup syzkaller and make sure it works. README and wiki provides quite extensive information on how to do that.
  2. Write template descriptions for a particular kernel interface you want to test.
  3. Specify the syscalls that are used in this interface in the syzkaller config.
  4. Run syzkaller until it finds bugs. Usually this happens quite fast for the interfaces, that haven’t been tested with it previously.

Syzkaller uses it’s own declarative language to describe syscall templates. Checkout sys/sys.txt for an example or sys/ for the information on the syntax. Here’s an excerpt from the syzkaller descriptions for AF_PACKET sockets that I used to discover the bug:

resource sock_packet[sock]

define ETH_P_ALL_BE htons(ETH_P_ALL)

socket$packet(domain const[AF_PACKET], type flags[packet_socket_type], proto const[ETH_P_ALL_BE]) sock_packet

packet_socket_type = SOCK_RAW, SOCK_DGRAM

setsockopt$packet_rx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_RX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
setsockopt$packet_tx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_TX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])

tpacket_req {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32

tpacket_req3 {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32
tp_retire_blk_tov int32
tp_sizeof_priv int32
tp_feature_req_word int32

tpacket_req_u [
req tpacket_req
req3 tpacket_req3
] [varlen]

The syntax is mostly self-explanatory. First, we declare a new type sock_packet. This type is inherited from an existing type sock. That way syzkaller will use syscalls which have arguments of type sock on sock_packet sockets as well.

After that, we declare a new syscall socket$packet. The part before the $ sign tells syzkaller what syscall it should use, and the part after the $ sign is used to differentiate between different kinds of the same syscall. This is particularly useful when dealing with syscalls like ioctl. The socket$packet syscall returns a sock_packet socket.

Then setsockopt$packet_rx_ring and setsockopt$packet_tx_ring are declared. These syscalls set the PACKET_RX_RING and PACKET_TX_RING socket options on a sock_packet socket. I’ll talk about these options in details below. Both of them use the tpacket_req_u union as a socket option value. This union has two struct members tpacket_req and tpacket_req3.

Once the descriptions are added, syzkaller can be instructed to fuzz packet-related syscalls specifically. This is what I provided in the syzkaller manager config:

"enable_syscalls": [
"socket$packet", "socketpair$packet", "accept$packet", "accept4$packet", "bind$packet", "connect$packet", "sendto$packet", "recvfrom$packet", "getsockname$packet", "getpeername$packet", "listen", "setsockopt", "getsockopt", "syz_emit_ethernet"

After a few minutes of running syzkaller with these descriptions I started getting kernel crashes. Here’s one of the syzkaller programs that triggered the mentioned bug:

mmap(&(0x7f0000000000/0xc8f000)=nil, (0xc8f000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = socket$packet(0x11, 0x3, 0x300)
setsockopt$packet_int(r0, 0x107, 0xa, &(0x7f000061f000)=0x2, 0x4)
setsockopt$packet_rx_ring(r0, 0x107, 0x5, &(0x7f0000c8b000)=@req3={0x10000, 0x3, 0x10000, 0x3, 0x4, 0xfffffffffffffffe, 0x5}, 0x1c)

And here’s one of the KASAN reports. It should be noted, that since the access is quite far past the block bounds, allocation and deallocation stacks don’t correspond to the overflown object.

BUG: KASAN: slab-out-of-bounds in prb_close_block net/packet/af_packet.c:808
Write of size 4 at addr ffff880054b70010 by task syz-executor0/30839

CPU: 0 PID: 30839 Comm: syz-executor0 Not tainted 4.11.0-rc2+ #94
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x292/0x398 lib/dump_stack.c:52
print_address_description+0x73/0x280 mm/kasan/report.c:246
kasan_report_error mm/kasan/report.c:345 [inline]
kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:368
kasan_report mm/kasan/report.c:393 [inline]
__asan_report_store4_noabort+0x2c/0x30 mm/kasan/report.c:393
prb_close_block net/packet/af_packet.c:808 [inline]
prb_retire_current_block+0x6ed/0x820 net/packet/af_packet.c:970
__packet_lookup_frame_in_block net/packet/af_packet.c:1093 [inline]
packet_current_rx_frame net/packet/af_packet.c:1122 [inline]
tpacket_rcv+0x9c1/0x3750 net/packet/af_packet.c:2236
packet_rcv_fanout+0x527/0x810 net/packet/af_packet.c:1493
deliver_skb net/core/dev.c:1834 [inline]
__netif_receive_skb_core+0x1cff/0x3400 net/core/dev.c:4117
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4244
netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4272
netif_receive_skb+0xae/0x3b0 net/core/dev.c:4296
tun_rx_batched.isra.39+0x5e5/0x8c0 drivers/net/tun.c:1155
tun_get_user+0x100d/0x2e20 drivers/net/tun.c:1327
tun_chr_write_iter+0xd8/0x190 drivers/net/tun.c:1353
call_write_iter include/linux/fs.h:1733 [inline]
new_sync_write fs/read_write.c:497 [inline]
__vfs_write+0x483/0x760 fs/read_write.c:510
vfs_write+0x187/0x530 fs/read_write.c:558
SYSC_write fs/read_write.c:605 [inline]
SyS_write+0xfb/0x230 fs/read_write.c:597
RIP: 0033:0x40b031
RSP: 002b:00007faacbc3cb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000002a RCX: 000000000040b031
RDX: 000000000000002a RSI: 0000000020002fd6 RDI: 0000000000000015
RBP: 00000000006e2960 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000708000
R13: 000000000000002a R14: 0000000020002fd6 R15: 0000000000000000

Allocated by task 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:617
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
slab_post_alloc_hook mm/slab.h:456 [inline]
slab_alloc_node mm/slub.c:2720 [inline]
slab_alloc mm/slub.c:2728 [inline]
kmem_cache_alloc+0x1af/0x250 mm/slub.c:2733
getname_flags+0xcb/0x580 fs/namei.c:137
getname+0x19/0x20 fs/namei.c:208
do_sys_open+0x2ff/0x720 fs/open.c:1045
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064

Freed by task 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:590
slab_free_hook mm/slub.c:1358 [inline]
slab_free_freelist_hook mm/slub.c:1381 [inline]
slab_free mm/slub.c:2963 [inline]
kmem_cache_free+0xb5/0x2d0 mm/slub.c:2985
putname+0xee/0x130 fs/namei.c:257
do_sys_open+0x336/0x720 fs/open.c:1060
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064

Object at ffff880054b70040 belongs to cache names_cache of size 4096
The buggy address belongs to the page:
page:ffffea000152dc00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x500000000008100(slab|head)
raw: 0500000000008100 0000000000000000 0000000000000000 0000000100070007
raw: ffffea0001549a20 ffffea0001b3cc20 ffff88003eb44f40 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff880054b6ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880054b6ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff880054b70000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff880054b70080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880054b70100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

You can find more details about syzkaller in it’s repository and more details about KASAN in the kernel documentation. If you decide to try syzkaller or KASAN and run into any troubles drop an email to or to

Introduction to AF_PACKET sockets

To better understand the bug, the vulnerability it leads to and how to exploit it, we need to understand what AF_PACKET sockets are and how they are implemented in the kernel.


AF_PACKET sockets allow users to send or receive packets on the device driver level. This for example lets them to implement their own protocol on top of the physical layer or to sniff packets including Ethernet and higher levels protocol headers. To create an AF_PACKET socket a process must have the CAP_NET_RAW capability in the user namespace that governs its network namespace. More details can be found in the packet sockets documentation. It should be noted that if a kernel has unprivileged user namespaces enabled, then an unprivileged user is able to create packet sockets.

To send and receive packets on a packet socket, a process can use the send and recv syscalls. However, packet sockets provide a way to do this faster by using a ring buffer, that’s shared between the kernel and the userspace. A ring buffer can be created via the PACKET_TX_RING and PACKET_RX_RING socket options. The ring buffer can then be mmaped by the user and the packet data can then be read or written directly to it.

There are a few different variants of the way the ring buffer is handled by the kernel. This variant can be chosen by the user by using the PACKET_VERSION socket option. The difference between ring buffer versions can be found in the kernel documentation (search for “TPACKET versions”).

One of the widely known users of AF_PACKET sockets is the tcpdump utility. This is roughly what happens when tcpdump is used to sniff all packets on a particular interface:

# strace tcpdump -i eth0
socket(PF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0
setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65616, frame_nr=31}, 16) = 0
mmap(NULL, 4063232, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f73a6817000

This sequence of syscalls corresponds to the following actions:
  1. A socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) is created.
  2. The socket is bound to the eth0 interface.
  3. Ring buffer version is set to TPACKET_V2 via the PACKET_VERSION socket option.
  4. A ring buffer is created via the PACKET_RX_RING socket option.
  5. The ring buffer is mmapped in the userspace.

After that the kernel will start putting all packets coming through the eth0 interface in the ring buffer and tcpdump will read them from the mmapped region in the userspace.

Ring buffers

Let’s see how to use ring buffers for packet sockets. For consistency all of the kernel code snippets below will come from the Linux kernel 4.8. This is the version the latest Ubuntu 16.04.2 kernel is based on.

The existing documentation mostly focuses on TPACKET_V1 and TPACKET_V2 ring buffer versions. Since the mentioned bug only affects the TPACKET_V3 version, I’m going to assume that we deal with that particular version for the rest of the post. Also I’m going to mostly focus on PACKET_RX_RING ignoring PACKET_TX_RING.

A ring buffer is a memory region used to store packets. Each packet is stored in a separate frame. Frames are grouped into blocks. In TPACKET_V3 ring buffers frame size is not fixed and can have arbitrary value as long as a frame fits into a block.

To create a TPACKET_V3 ring buffer via the PACKET_RX_RING socket option a user must provide the exact parameters for the ring buffer. These parameters are passed to the setsockopt call via a pointer to a request struct called tpacket_req3, which is defined as:

274 struct tpacket_req3 {
275         unsigned int    tp_block_size;  /* Minimal size of contiguous block */
276         unsigned int    tp_block_nr;    /* Number of blocks */
277         unsigned int    tp_frame_size;  /* Size of frame */
278         unsigned int    tp_frame_nr;    /* Total number of frames */
279         unsigned int    tp_retire_blk_tov; /* timeout in msecs */
280         unsigned int    tp_sizeof_priv; /* offset to private data area */
281         unsigned int    tp_feature_req_word;
282 };

Here’s what each field means in the tpacket_req3 struct:
  1. tp_block_size - the size of each block.
  2. tp_block_nr - the number of blocks.
  3. tp_frame_size - the size of each frame, ignored for TPACKET_V3.
  4. tp_frame_nr - the number of frames, ignored for TPACKET_V3.
  5. tp_retire_blk_tov - timeout after which a block is retired, even if it’s not fully filled with data (see below).
  6. tp_sizeof_priv - the size of per-block private area. This area can be used by a user to store arbitrary information associated with each block.
  7. tp_feature_req_word - a set of flags (actually just one at the moment), which allows to enable some additional functionality.

Each block has an associated header, which is stored at the very beginning of the memory area allocated for the block. The block header struct is called tpacket_block_desc and has a block_status field, which indicates whether the block is currently being used by the kernel or available to the user. The usual workflow is that the kernel stores packets into a block until it’s full and then sets block_status to TP_STATUS_USER. The user then reads required data from the block and releases it back to the kernel by setting block_status to TP_STATUS_KERNEL.

186 struct tpacket_hdr_v1 {
187         __u32   block_status;
188         __u32   num_pkts;
189         __u32   offset_to_first_pkt;
233 };
235 union tpacket_bd_header_u {
236         struct tpacket_hdr_v1 bh1;
237 };
239 struct tpacket_block_desc {
240         __u32 version;
241         __u32 offset_to_priv;
242         union tpacket_bd_header_u hdr;
243 };

Each frame also has an associated header described by the struct tpacket3_hdr. The tp_next_offset field points to the next frame within the same block.

162 struct tpacket3_hdr {
163         __u32 tp_next_offset;
176 };

When a block is fully filled with data (a new packet doesn’t fit into the remaining space), it’s closed and released to userspace or “retired” by the kernel. Since the user usually wants to see packets as soon as possible, the kernel can release a block even if it’s not filled with data completely. This is done by setting up a timer that retires current block with a timeout controlled by the tp_retire_blk_tov parameter.

There’s also a way so specify per-block private area, which the kernel won’t touch and the user can use to store any information associated with a block. The size of this area is passed via the tp_sizeof_priv parameter.

If you’d like to better understand how a userspace program can use TPACKET_V3 ring buffer you can read the example provided in the documentation (search for “TPACKET_V3 example“).

Implementation of AF_PACKET sockets

Let’s take a quick look at how some of this is implemented in the kernel.

Struct definitions

Whenever a packet socket is created, an associated packet_sock struct is allocated in the kernel:

103 struct packet_sock {
105         struct sock             sk;
108         struct packet_ring_buffer       rx_ring;
109         struct packet_ring_buffer       tx_ring;
123         enum tpacket_versions   tp_version;
130         int                     (*xmit)(struct sk_buff *skb);
132 };

The tp_version field in this struct holds the ring buffer version, which in our case is set to TPACKET_V3 by a PACKET_VERSION setsockopt call. The rx_ring and tx_ring fields describe the receive and transmit ring buffers in case they are created via PACKET_RX_RING and PACKET_TX_RING setsockopt calls. These two fields have type packet_ring_buffer, which is defined as:

56 struct packet_ring_buffer {
57         struct pgv              *pg_vec;
70         struct tpacket_kbdq_core        prb_bdqc;
71 };

The pg_vec field is a pointer to an array of pgv structs, each of which holds a reference to a block. Blocks are actually allocated separately, not as a one contiguous memory region.

52 struct pgv {
53         char *buffer;
54 };
The prb_bdqc field is of type tpacket_kbdq_core and its fields describe the current state of the ring buffer:

14 struct tpacket_kbdq_core {
21         unsigned short  blk_sizeof_priv;
36         char            *nxt_offset;
49         struct timer_list retire_blk_timer;
50 };

The blk_sizeof_priv fields contains the size of the per-block private area. The nxt_offset field points inside the currently active block and shows where the next packet should be saved. The retire_blk_timer field has type timer_list and describes the timer which retires current block on timeout.

12 struct timer_list {
17         struct hlist_node       entry;
18         unsigned long           expires;
19         void                    (*function)(unsigned long);
20         unsigned long           data;
31 };

Ring buffer setup

The kernel uses the packet_setsockopt() function to handle setting socket options for packet sockets. When the PACKET_VERSION socket option is used, the kernel sets po->tp_version to the provided value.

With the PACKET_RX_RING socket option a receive ring buffer is created. Internally it’s done by the packet_set_ring() function. This function does a lot of things, so I’ll just show the important parts. First, packet_set_ring() performs a bunch of sanity checks on the provided ring buffer parameters:

4202                 err = -EINVAL;
4203                 if (unlikely((int)req->tp_block_size <= 0))
4204                         goto out;
4205                 if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))
4206                         goto out;
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
4211                 if (unlikely(req->tp_frame_size < po->tp_hdrlen +
4212                                         po->tp_reserve))
4213                         goto out;
4214                 if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
4215                         goto out;
4217                 rb->frames_per_block = req->tp_block_size / req->tp_frame_size;
4218                 if (unlikely(rb->frames_per_block == 0))
4219                         goto out;
4220                 if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
4221                                         req->tp_frame_nr))
4222                         goto out;

Then, it allocates the ring buffer blocks:

4224                 err = -ENOMEM;
4225                 order = get_order(req->tp_block_size);
4226                 pg_vec = alloc_pg_vec(req, order);
4227                 if (unlikely(!pg_vec))
4228                         goto out;

It should be noted that alloc_pg_vec() uses the kernel page allocator to allocate blocks (we’ll use this in the exploit):

4104 static char *alloc_one_pg_vec_page(unsigned long order)
4105 {
4110         buffer = (char *) __get_free_pages(gfp_flags, order);
4111         if (buffer)
4112                 return buffer;
4127 }
4129 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
4130 {
4139         for (i = 0; i < block_nr; i++) {
4140                 pg_vec[i].buffer = alloc_one_pg_vec_page(order);
4143         }
4152 }

Finally, packet_set_ring() calls init_prb_bdqc(), which performs some additional steps to set up a TPACKET_V3 receive ring buffer specifically:

4229                 switch (po->tp_version) {
4230                 case TPACKET_V3:
4234                         if (!tx_ring)
4235                                 init_prb_bdqc(po, rb, pg_vec, req_u);
4236                         break;
4237                 default:
4238                         break;
4239                 }

The init_prb_bdqc() function copies provided ring buffer parameters to the prb_bdqc field of the ring buffer struct, calculates some other parameters based on them, sets up the block retire timer and calls prb_open_block() to initialize the first block:

604 static void init_prb_bdqc(struct packet_sock *po,
605                         struct packet_ring_buffer *rb,
606                         struct pgv *pg_vec,
607                         union tpacket_req_u *req_u)
608 {
609         struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb);
610         struct tpacket_block_desc *pbd;
616         pbd = (struct tpacket_block_desc *)pg_vec[0].buffer;
617         p1->pkblk_start = pg_vec[0].buffer;
618         p1->kblk_size = req_u->req3.tp_block_size;
630         p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
632         p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
633         prb_init_ft_ops(p1, req_u);
634         prb_setup_retire_blk_timer(po);
635         prb_open_block(p1, pbd);
636 }

On of the things that the prb_open_block() function does is it sets the nxt_offset field of the tpacket_kbdq_core struct to point right after the per-block private area:

841 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
842         struct tpacket_block_desc *pbd1)
843 {
862         pkc1->pkblk_start = (char *)pbd1;
863         pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
876 }

Packet reception

Whenever a new packet is received, the kernel is supposed to save it into the ring buffer. The key function here is __packet_lookup_frame_in_block(), which does the following:
  1. Checks whether the currently active block has enough space for the packet.
  2. If yes, saves the packet to the current block and returns.
  3. If no, dispatches the next block and saves the packet there.

1041 static void *__packet_lookup_frame_in_block(struct packet_sock *po,
1042                                             struct sk_buff *skb,
1043                                                 int status,
1044                                             unsigned int len
1045                                             )
1046 {
1047         struct tpacket_kbdq_core *pkc;
1048         struct tpacket_block_desc *pbd;
1049         char *curr, *end;
1051         pkc = GET_PBDQC_FROM_RB(&po->rx_ring);
1052         pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
1075         curr = pkc->nxt_offset;
1076         pkc->skb = skb;
1077         end = (char *)pbd + pkc->kblk_size;
1079         /* first try the current block */
1080         if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
1081                 prb_fill_curr_block(curr, pkc, pbd, len);
1082                 return (void *)curr;
1083         }
1085         /* Ok, close the current block */
1086         prb_retire_current_block(pkc, po, 0);
1088         /* Now, try to dispatch the next block */
1089         curr = (char *)prb_dispatch_next_block(pkc, po);
1090         if (curr) {
1091                 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
1092                 prb_fill_curr_block(curr, pkc, pbd, len);
1093                 return (void *)curr;
1094         }
1101 }



Let’s look closely at the following check from packet_set_ring():

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

This is supposed to ensure that the length of the block header together with the per-block private data is not bigger than the size of the block. Which totally makes sense, otherwise we won’t have enough space in the block for them let alone the packet data.

However turns out this check can be bypassed. In case req_u->req3.tp_sizeof_priv has the higher bit set, casting the expression to int results in a big positive value instead of negative. To illustrate this behavior:

A = req->tp_block_size = 4096 = 0x1000
B = req_u->req3.tp_sizeof_priv = (1 << 31) + 4096 = 0x80001000
BLK_PLUS_PRIV(B) = (1 << 31) + 4096 + 48 = 0x80001030
A - BLK_PLUS_PRIV(B) = 0x1000 - 0x80001030 = 0x7fffffd0
(int)0x7fffffd0 = 0x7fffffd0 > 0

Later, when req_u->req3.tp_sizeof_priv is copied to p1->blk_sizeof_priv in init_prb_bdqc() (see the snippet above), it’s clamped to two lower bytes, since the type of the latter is unsigned short. So this bug basically allows us to set the blk_sizeof_priv of the tpacket_kbdq_core struct to arbitrary value bypassing all sanity checks.


If we search through the net/packet/af_packet.c source looking for blk_sizeof_priv usage, we’ll find that it’s being used in the two following places.

The first one is in init_prb_bdqc() right after it gets assigned (see the code snippet above) to set max_frame_len. The value of p1->max_frame_len denotes the maximum size of a frame that can be saved into a block. Since we control p1->blk_sizeof_priv, we can make BLK_PLUS_PRIV(p1->blk_sizeof_priv) bigger than p1->kblk_size. This will result in p1->max_frame_len having a huge value, higher than the size of a block. This allows us to bypass the size check when a frame is being copied into a block, thus causing a kernel heap out-of-bounds write.

That’s not all. Another user of blk_sizeof_priv is prb_open_block(), which initializes a block (the code snippet is above as well). There pkc1->nxt_offset denotes the address, where the kernel will write a new packet when it’s being received. The kernel doesn’t intend to overwrite the block header and per-block private data, so it makes this address to point right after them. Since we control blk_sizeof_priv, we can control the lowest two bytes of nxt_offset. This allows us to control offset of the out-of-bounds write.

To sum up, this bug leads to a kernel heap out-of-bounds write of controlled maximum size and controlled offset up to about 64k bytes. 


Let’s see how we can exploit this vulnerability. I’m going to be targeting x86-64 Ubuntu 16.04.2 with 4.8.0-41-generic kernel version with KASLR, SMEP and SMAP enabled. Ubuntu kernel has user namespaces available to unprivileged users (CONFIG_USER_NS=y and no restrictions on it’s usage), so the bug can be exploited to gain root privileges by an unprivileged user. All of the exploitation steps below are performed from within a user namespace.

The Linux kernel has support for a few hardening features that make exploitation more difficult. KASLR (Kernel Address Space Layout Randomization) puts the kernel text at a random offset to make jumping to a particular fixed address useless. SMEP (Supervisor Mode Execution Protection) causes an oops whenever the kernel tries to execute code from the userspace memory and SMAP (Supervisor Mode Access Prevention) does the same whenever the kernel tries to access the userspace memory directly.

Shaping heap

The idea of the exploit is to use the heap out-of-bounds write to overwrite a function pointer in the memory adjacent to the overflown block. For that we need to specifically shape the heap, so some object with a triggerable function pointer is placed right after a ring buffer block. I chose the already mentioned packet_sock struct to be this object. We need to find a way to make the kernel allocate a ring buffer block and a packet_sock struct one next to the other.

As I mentioned above, ring buffer blocks are allocated with the kernel page allocator (buddy allocator). It allows to allocate blocks of 2^n contiguous memory pages. The allocator keeps a freelist of such block for each n and returns the freelist head when a block is requested. If the freelist for some n is empty, it finds the first m > n, for which the freelist is not empty and splits it in halves until the required size is reached. Therefore, if we start repeatedly allocating blocks of size 2^n, at some point they will start coming from one high order memory block being split and they will be adjacent each one to the next.

A packet_sock is allocated via the kmalloc() function by the slab allocator. The slab allocator is mostly used to allocate objects of a smaller-than-one-page size. It uses the page allocator to allocate a big block of memory and splits this block into smaller objects. The big blocks are called slabs, hence the name of the allocator. A set of slabs together with their current state and a set of operations like “allocate an object” and “free an object” is called a cache. The slab allocator creates a set of general purpose caches for objects of size 2^n. Whenever kmalloc(size) is called, the slab allocator rounds size up to the nearest power of 2 and uses the cache of that size.

Since the kernel uses kmalloc() all the time, if we try to allocate an object it will most likely come from one of the slabs already created during previous usage. However, if we start allocating objects of the same size, at some point the slab allocator will run out of slabs for this size and will have to allocate another one via the page allocator.

The size of a newly allocated slab depends on the size of objects this slab is meant for. The size of the packet_sock struct is ~1920 and 1024 < 1920 <= 2048, which means that it’ll be rounded to 2048 and the kmalloc-2048 cache will be used. Turns out, for this particular cache the SLUB allocator (which is the kind of slab allocator used in Ubuntu) uses slabs of size 0x8000. So whenever the allocator runs out of slabs for the kmalloc-2048 cache, it allocates 0x8000 bytes with the page allocator.

Keeping all that in mind, this is how we can allocate a kmalloc-2048 slab next to a ring buffer block:
  1. Allocate a lot (512 worked for me) of objects of size 2048 to fill currently existing slabs in the kmalloc-2048 cache. To do that we can create a bunch of packet sockets to cause allocation of packet_sock structs.
  2. Allocate a lot (1024 worked for me) page blocks of size 0x8000 to drain the page allocator freelists and cause some high-order page block to be split. To do that we can create another packet socket and attach a ring buffer with 1024 blocks of size 0x8000.
  3. Create a packet socket and attach a ring buffer with blocks of size 0x8000. The last one of these blocks (I’m using 2 blocks, the reason is explained below) is the one we’re going to overflow.
  4. Create a bunch of packet sockets to allocate packet_sock structs and cause an allocation of at least one new slab.
This way we can shape the heap in the following way:
The exact number of allocations to drain freelists and shape the heap the way we want might be different for different setups and depend on the memory usage activity. The numbers above are for a mostly idle Ubuntu machine.

Controlling the overwrite

Above I explained that the bug results in a write of a controlled maximum size at a controlled offset out of the bounds of a ring buffer block. Turns out not only we can control the maximum size and offset, we can actually control the exact data (and it’s size) that’s being written. Since the data that’s being stored in a ring buffer block is the packet that’s passing through a particular network interface, we can manually send packets with arbitrary content on a raw socket through the loopback interface. If we’re doing that in an isolated network namespace no external traffic will interfere.

There are a few caveats though.

First, it seems that the size of a packet must be at least 14 bytes (12 bytes for two mac addresses and 2 bytes for the EtherType apparently) for it to be passed to the packet socket layer. That means that we have to overwrite at least 14 bytes. The data in the packet itself can be arbitrary.

Then, the lowest 3 bits of nxt_offset always have the value of 2 due to the alignment. That means that we can’t start overwriting at an 8-byte aligned offset.

Besides that, when a packet is being received and saved into a block, the kernel updates some fields in the block and frame headers. If we point nxt_offset to some particular offset we want to overwrite, some data where the block and frames headers end up will probably be corrupted.

Another issue is that if we make nxt_offset point past the block end, the first block will be immediately closed when the first packet is being received, since the kernel will (correctly) decide that there’s no space left in the first block (see the __packet_lookup_frame_in_block() snippet). This is not really an issue, since we can create a ring buffer with 2 blocks. The first one will be closed, the second one will be overflown.

Executing code

Now, we need to figure out which function pointers to overwrite. There are a few of function pointers fields in the packet_sock struct, but I ended up using the following two:
  1. packet_sock->xmit
  2. packet_sock->rx_ring->prb_bdqc->retire_blk_timer->func

The first one is called whenever a user tries to send a packet via a packet socket. The usual way to elevate privileges to root is to execute the commit_creds(prepare_kernel_cred(0)) payload in a process context. The xmit pointer is called from a process context, which means we can simply point it to the executable memory region, which contains the payload.

To do that we need to put our payload to some executable memory region. One of the possible ways for that is to put the payload in the userspace, either by mmapping an executable memory page or by just defining a global function within our exploit program. However, SMEP & SMAP will prevent the kernel from accessing and executing user memory directly, so we need to deal with them first.

For that I used the retire_blk_timer field (the same field used by Philip Pettersson in his CVE-2016-8655 exploit). It contains a function pointer that’s triggered whenever the retire timer times out. During normal packet socket operation, retire_blk_timer->func points to prb_retire_rx_blk_timer_expired() and it’s called with retire_blk_timer->data as an argument, which contains the address of the packet_sock struct. Since we can overwrite the data field along with the func field, we get a very nice func(data) primitive.

The state of SMEP & SMAP on the current CPU core is controlled by the 20th and 21st bits of the CR4 register. To disable them we should zero out these two bits. For this we can use the func(data) primitive to call native_write_cr4(X), where X has 20th and 21st bits set to 0. The exact value of X might depend on what other CPU features are enabled. On the machine where I tested the exploit, the value of CR4 is 0x10407f0 (only the SMEP bit is enabled since the CPU has no SMAP support), so I used X = 0x407f0. We can use the sched_setaffinity syscall to force the exploit program to be executed on one CPU core and thus making sure that the userspace payload will be executed on the same core as where we disable SMAP & SMEP.

Putting this all together, here are the exploitation steps:
  1. Figure out the kernel text address to bypass KASLR (described below).
  2. Pad heap as described above.
  3. Disable SMEP & SMAP.
    1. Allocate a packet_sock after a ring buffer block.
    2. Schedule a block retire timer on the packet_sock by attaching a receive ring buffer to it.
    3. Overflow the block and overwrite retire_blk_timer field. Make retire_blk_timer->func point to native_write_cr4 and make retire_blk_timer->data equal to the desired CR4 value.
    4. Wait for the timer to be executed, now we have SMEP & SMAP disabled on the current core.
  4. Get root privileges.
    1. Allocate another pair of a packet_sock and a ring buffer block.
    2. Overflow the block and overwrite xmit field. Make xmit point to a commit_creds(prepare_kernel_cred(0)) allocated in userspace.
    3. Send a packet on the corresponding packet socket, xmit will get triggered and the current process will obtain root privileges.

The exploit code can be found here.

It should be noted, that when we overwrite these two fields in the packet_sock structs, we’ll end up corrupting some of the fields before them (the kernel will write some values to the block and frame headers), which can lead to a kernel crash. However, as long as these other fields don’t get used by the kernel we should be good. I found that one of the fields that caused crashes if we try to close all packet sockets after the exploit finished is the mclist field, but simply zeroing it out helps.

KASLR bypass

I didn’t bother to come up with some elaborate KASLR bypass technique which exploits the same bug. Since Ubuntu doesn’t restrict dmesg by default, we can just grep the kernel syslog for the “Freeing SMP” string, which contains a kernel pointer, that looks suspiciously similar to the kernel text address:

# Boot #1
$ dmesg | grep 'Freeing SMP'
[    0.012520] Freeing SMP alternatives memory: 32K (ffffffffa58ee000 - ffffffffa58f6000)
$ sudo cat /proc/kallsyms | grep 'T _text'
ffffffffa4800000 T _text

# Boot #2
$ dmesg | grep 'Freeing SMP'
[    0.017487] Freeing SMP alternatives memory: 32K (ffffffff85aee000 - ffffffff85af6000)
$ sudo cat /proc/kallsyms | grep 'T _text'
ffffffff84a00000 T _text

By doing simple math we can calculate the kernel text address based on the one exposed through dmesg. This way of figuring out the kernel text location works only for some time after boot, as syslog only stores a fixed number of lines and starts dropping them at some point.

There are a few Linux kernel hardening features that can be used to prevent this kind of information disclosures. The first one is called dmesg_restrict and it restricts the ability of unprivileged users to read the kernel syslog. It should be noted, that even with dmesg restricted the first user on Ubuntu can still read the syslog from /var/log/kern.log and /var/log/syslog since he belongs to the adm group.

Another feature is called kptr_restrict and it doesn’t allow unprivileged users to see pointers printed by the kernel with the %pK format specifier. However in 4.8 the free_reserved_area() function uses %p, so kptr_restrict doesn’t help in this case. In 4.10 free_reserved_area() was fixed not to print address ranges at all, but the change was not backported to older kernels.


Let’s take a look at the fix. The vulnerable code as it was before the fix is below. Remember that the user fully controls both tp_block_size and tp_sizeof_priv.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

When thinking about a way to fix this, the first idea that comes to mind is that we can compare the two values as is without that weird conversion to int:

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv))
4210                         goto out;

Funny enough, this doesn’t actually help. The reason is that an overflow can happen while evaluating BLK_PLUS_PRIV in case tp_sizeof_priv is close to the unsigned int maximum value.

177 #define BLK_PLUS_PRIV(sz_of_priv) \
178         (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT))

One of the ways to fix this overflow is to cast tp_sizeof_priv to uint64 before passing it to BLK_PLUS_PRIV. That’s exactly what I did in the fix that was sent upstream.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))
4210                         goto out;


Creating packet socket requires the CAP_NET_RAW privilege, which can be acquired by an unprivileged user inside a user namespaces. Unprivileged user namespaces expose a huge kernel attack surface, which resulted in quite a few exploitable vulnerabilities (CVE-2017-7184, CVE-2016-8655, ...). This kind of kernel vulnerabilities can be mitigated by completely disabling user namespaces or disallowing using them to unprivileged users.

To disable user namespaces completely you can rebuild your kernel with CONFIG_USER_NS disabled. Restricting user namespaces usage only to privileged users can be done by writing 0 to /proc/sys/kernel/unprivileged_userns_clone in Debian-based kernel. Since version 4.9 the upstream kernel has a similar /proc/sys/user/max_user_namespaces setting.


Right now the Linux kernel has a huge number of poorly tested (from a security standpoint) interfaces and a lot of them are enabled and exposed to unprivileged users in popular Linux distributions like Ubuntu. This is obviously not good and they need to be tested or restricted.

Syzkaller is an amazing tool that allows to test kernel interfaces via fuzzing. Even adding barebone descriptions for another syscall usually uncovers numbers of bugs. We certainly need people writing syscall descriptions and fixing existing ones, since there’s a huge surface that’s still not covered and probably a ton of security bugs buried in the kernel. If you decide to contribute, we’ll be glad to see a pull request.


Just a bunch of related links.

Our Linux kernel bug finding tools:

A collection of Linux kernel exploitation materials:

1 comment:

  1. It seems to be common that the systemserver context on Android has packet capabilities. I was able to use dirtycow to escalate from app context to systemserver context and run these packet sock exploits.