Pages

Thursday, November 21, 2024

Simple macOS kernel extension fuzzing in userspace with IDA and TinyInst

Posted by Ivan Fratric, Google Project Zero

Recently, one of the projects I was involved in had to do with video decoding on Apple platforms, specifically AV1 decoding. On Apple devices that support AV1 video format (starting from Apple A17 iOS / M3 macOS), decoding is done in hardware. However, despite this, during decoding, a large part of the AV1 format parsing happens in software, inside the kernel, more specifically inside the AppleAVD kernel extension (or at least, that used to be the case in macOS 14/ iOS 17). As fuzzing is one of the techniques we employ regularly, the question of how to effectively fuzz this code inevitably came up.

It should be noted that I wasn’t the first person to look into the problem of Apple kernel extension fuzzing, so before going into the details of my approach, other projects in this space should be mentioned.

In the Fairplay research project, @pwn0rz utilized a custom loader to load the kernel extension into userspace. A coworker tried to run this code on the current AppleAVD extension, however it didn’t work for them (at least not out of the box) so we didn’t end up using it. It should be noted here that my approach also loads the kernel code into userspace, albeit in a more lightweight way.

In the Cinema time! presentation at Hexacon 2022, Andrey Labunets and Nikita Tarakanov presented their approach for fuzzing AppleAVD where the decompiled code was first extracted using IDA and then rebuilt. I used this approach in the past in some more constrained scenarios, however the decompiled code from IDA is not perfect and manual fixing was often required (such as, for example, when IDA would get the stack layout of a function incorrectly).

In the KextFuzz project, Tingting Yin with the co-authors statically instrumented kernel extensions by replacing pointer authentication instructions with a jump to a coverage-collecting trampoline, which results in a partial coverage.

Most recently, the Pishi project by Meysam Firouzi was released just before this research. The project statically instruments kernel extension code by using Ghidra to identify all basic blocks, and then replacing one instruction from each basic block with a branch to a dedicated trampoline. The trampoline records the coverage, executes the replaced instruction and jumps back to the address of the next instruction. This was reported to run on a real device.

Given the existence of these other projects, it is worth saying that my goal was not to create necessarily the “best” method for kernel extension fuzzing, but what for me was the simplest (if we don’t count the underlying complexity of the off-the shelf tools being used). In short, my approach, that will be discussed in detail in other sections, was

  1. Load AppleAVD extension or full kernelcache into IDA
  2. Rebase the module to an address that can be reliably allocated in userspace
  3. Export raw memory using an IDA Python script
  4. Load exported bytes using custom loader
  5. Use custom TinyInst module to hook and instrument the extension
  6. Use Jackalope for fuzzing

All the project code can be found here. Various components will be explained in more detail throughout the rest of the blog post.

Extracting kernel extension code

Normally, on macOS, kernel extensions are packaged inside “kernel collections” files that serve as containers for multiple extensions. At first OS boot (and whenever something is changed with regards to kernel extensions), the kernel extensions needed by the machine are repackaged into what is called the “kernel cache” (kernelcache file on the filesystem). Kernel extensions can be extracted from these caches and collections, but existing tooling can’t really produce individual .dylib files that can be loaded into userspace and run without issues.

However, reverse engineering tooling, specifically IDA Pro which I used in this research, comes with a surprisingly good loader for Apple kernel cache. I haven’t tried how other reverse engineering tools compare, but if they are comparable and someone would like to contribute to the project, I would gladly accept export scripts for these other tools.

So, instead of writing our own loader, we can simply piggyback on IDA’s. The idea is simple:

  • we let IDA load the kernel extension we want (or even the entire kernelcache)
  • we use IDA to rebase the code so it’s in memory range that is mappable in userspace (see image)
  • using a simple IDA Python script, we export for each memory segment its start address, end address, protection flags and raw bytes
  • optionally, we can also, using the same script, export all the symbol names and the corresponding addresses so we can later refer to symbols by name

The following image shows rebasing of the kernel extension. This functionality is accessible in IDA via Edit->Segments->Rebase program… menu. When choosing the new base address, it is convenient to only change the high bits which makes it easy to manually convert rebased to original addresses and vice versa when needed. In the example below the image base was changed from 0xFFFFFE000714C470 to  0xAB0714C470.

screenshot with image base selected, with the value set at  0xAB0714C470, with both the fix up the program and rebase the whole image options selected

Figure 1: Rebasing the extension

The IDA script for exporting the data can be found here. You can run it using the following commands in IDA:

sys.path.append('/directory/containing/export/script')

import segexport

segexport.export('/path/to/output/file)

Loading and running

Loading the exported data should now be only the matter of memory mapping the correct addresses and copying the corresponding data from the exported file. You can see it in the load() function here.

However, since we are now loading and running kernel code in userspace, there will be functions that won’t run well or that we would want to change. One example for this are the kernel allocator functions that we’ll want to replace with the system malloc.

One way of replacing these functions would be to rewrite the prolog of each function we want to replace with a jump to its replacement. However, since we will later be using TinyInst to extract code coverage, there is a simpler way. We will simply write a breakpoint instruction to each function we want to replace. Since TinyInst is (among other things) a debugger, it will catch each of these breakpoints and, from the TinyInst process, we can replace the instruction pointer with the address of the corresponding replacement function. More details on this can be found in the next section.

Besides replacing the memory allocation functions, logging functions etc., we will also need to replace all functions that interact with the hardware that we can’t access from userspace (or, in our case, that isn’t even present on the machine). In the case of AV1 parsing code in the AppleAVD kernel extension, a function called AppleAVD::sendCommandToCommandGate gets called, which I assume is meant to communicate with the decoding hardware. Thus, as a part of the harness, this function was replaced with a function that always returns 0 (success).

The final code of the AV1 harness code can be found here. It can be compiled as

clang++ loader.cpp avdharness.cpp -oloader

and might need some additional entitlements to run which can be applied with

codesign --entitlements entitlements.txt -f -s - ./loader

Note that, in the harness code, although I tried to rely on symbol names instead of hardcoded offsets wherever possible, it still contains some struct offsets. This version of the harness was based on macos 14.5 kernel, which was the most recent OS version at the time the loader was written.

Writing a custom TinyInst module

This section explains the custom TinyInst module that accompanies the loader (and is required for the correct functioning of the loader). This code doesn’t contain anything specific for a particular kernel extension and thus can be reused as is. If you are not interested in how it works or writing custom TinyInst modules, then you can skip this section.

Firstly, since we will want to extract code coverage for the purposes of fuzzing, we will base our custom module on LiteCov, the “default” TinyInst module for code coverage:

class AVDInst : public LiteCov {

 

};

Secondly, we need a way for our custom loader to communicate with the TinyInst module

  • It needs to tell TinyInst which functions in the kext should be replaced with which replacement functions.
  • It needs to tell TinyInst where the kext was loaded so that TinyInst can instrument it.

While TinyInst provides an API for function hooking that we could use here, there is also a more direct (albeit, also more low-level) way. From our loader, we will simply call a function at some hardcoded non-mapped address. This will, once again, cause an exception that TinyInst (being a debugger) will catch, read the parameters from registers, do the required action and “return” (by replacing the instruction pointer with the value inside the link register). The loader uses the hardcoded address 0x747265706C616365 to register a replacement and 0x747265706C616366 to tell TinyInst about the address range to instrument:

#define TINYINST_REGISTER_REPLACEMENT 0x747265706C616365

#define TINYINST_CUSTOM_INSTRUMENT 0x747265706C616366

We can catch those in the exception handler of our custom module

bool AVDInst::OnException(Exception *exception_record) {

  …

  if(exception_address == TINYINST_REGISTER_REPLACEMENT) {

    RegisterReplacementHook();

    return true;

  }

  if(exception_address == TINYINST_CUSTOM_INSTRUMENT) {

    InstrumentCustomRange();

    return true;

  }

  …

}

and then read the parameters and do the required action

void AVDInst::RegisterReplacementHook() {

  uint64_t original_address = GetRegister(X0);

  uint64_t replacement_address = GetRegister(X1);

  redirects[original_address] = replacement_address;

  SetRegister(ARCH_PC, GetRegister(LR));

}

void AVDInst::InstrumentCustomRange() {

  uint64_t min_address = GetRegister(X0);

  uint64_t max_address = GetRegister(X1);

  InstrumentAddressRange("__custom_range__", min_address, max_address);

  SetRegister(ARCH_PC, GetRegister(LR));

}

Where InstrumentAddressRange is a recently added TinyInst function that will instrument all code between addresses given in its parameters. “__custom_range__” is simply a name that we give to this region of memory so we can differentiate between multiple instrumented modules (if there are more than one).

Next, TinInst needs to perform the actual function replacements. As explained above, this can be done in the exception handler of our module.

  auto iter = redirects.find(exception_address);

  if(iter != redirects.end()) {

    // printf("Redirecting...\n");

    SetRegister(ARCH_PC, iter->second);

    return true;

  }

This is mostly sufficient for running the kernel extension without instrumenting it (e.g. to collect coverage). However, if we also want to instrument the extension, then the process of instrumentation involves rewriting the extension code in another location and inserting e.g. additional instructions to record coverage. The consequence of this is that our breakpoint instructions (that we inserted for the purpose of redirects) will be rewritten at different addresses. We need to make TinyInst aware of this (as a side note, TinyInst Hook API does this under the hood, but it wasn’t used in this module). We can do this in the InstrumentInstruction function which gets called for every instruction as it’s being instrumented:

InstructionResult AVDInst::InstrumentInstruction(ModuleInfo *module,

                                        Instruction& inst,

                                        size_t bb_address,

                                        size_t instruction_address)

{

  auto iter = redirects.find(instruction_address);

  if(iter != redirects.end()) {

    instrumented_redirects[assembler_->Breakpoint(module)] = iter->second;

    return INST_STOPBB;

  }

  return LiteCov::InstrumentInstruction(module, inst, bb_address, instruction_address);

}

The INST_STOPBB return value tells TinyInst to stop processing the current basic blocks. Since on breakpoints/redirects, we redirect the execution to another function, no other instructions from the same basic block ever get executed and are thus unneeded.

After this, we now know the addresses of breakpoints (and the corresponding replacements) in both instrumented and non-instrumented code. The final exception handler of our custom module looks like this:

bool AVDInst::OnException(Exception *exception_record) {

  size_t exception_address;

  if(exception_record->type == BREAKPOINT)

  {

    exception_address = (size_t)exception_record->ip;

  } else if(exception_record->type == ACCESS_VIOLATION) {

    exception_address = (size_t)exception_record->access_address;

  } else {

    return LiteCov::OnException(exception_record);

  }

  if(exception_address == TINYINST_REGISTER_REPLACEMENT) {

    RegisterReplacementHook();

    return true;

  }

  if(exception_address == TINYINST_CUSTOM_INSTRUMENT) {

    InstrumentCustomRange();

    return true;

  }

  auto iter = redirects.find(exception_address);

  if(iter != redirects.end()) {

    // printf("Redirecting...\n");

    SetRegister(ARCH_PC, iter->second);

    return true;

  }

  iter = instrumented_redirects.find(exception_address);

  if(iter != instrumented_redirects.end()) {

    // printf("Redirecting...\n");

    SetRegister(ARCH_PC, iter->second);

    return true;

  }

  return LiteCov::OnException(exception_record);

}

The entire code, with all the housekeeping functions can be found here.

Fuzzing and findings

Once our custom module is ready, we still need to make sure TinyInst and Jackalope will use this module instead of the default LiteCov module. See the appropriate patches for TinyInst and Jackalope.

Our harness should now run correctly under TinyInst, both without and with coverage instrumentation:

./Jackalope/build/TinyInst/Release/litecov -- ./loader avd_rebased.dat -f <sample>

Where avd_rebased.dat contains the kernel extension code exported from IDA. We can also add the -trace_basic_blocks flag to trace basic blocks as they are being executed (primarily useful for debugging). We can also run a fuzzing session with Jackalope like this:

./Jackalope/build/Release/fuzzer -in in -out out -t 1000 -delivery shmem -target_module loader -target_method __Z4fuzzPc -nargs 1 -iterations 5000 -persist -loop -cmp_coverage -mute_child -nthreads 6 -- ./loader avd_rebased.dat -m @@

This tells jackalope to run in persistent mode (with the function “fuzz” being looped), with sample delivery over shared memory (-delivery shmem fuzzer flag and -m being implemented in the harness code).

Fuzzing is useful not only for finding bugs in the target, but in our case also for finding bugs in the harness, e.g. finding other kernel functions we need to replace in order for the target to work correctly.

After several iterations of fixups, the harness appeared to be working correctly. However, the fuzzer also caught some crashes that appeared to have been caused by genuine issues in the AV1 parsing code. I did a root cause analysis and reported the issues to Apple. The reports can be seen in the following entries in the Project Zero issue tracker:

Unfortunately, at the time of reporting these issues I still didn’t have access to a machine with AV1 decoding capabilities. Thus, instead of full end-to-end PoCs, the issues were reported in the form of a full root cause analysis and a binary stream that causes a crash when used as a parameter to a particular decoding function. Eventually, we did get a Macbook with a M3 chip that supports AV1 hardware decoding and tried to reproduce the reported issues. Unsurprisingly, all three issues reproduced exactly the same on the real hardware as in the userspace harness.

Conclusion

The goal of this project was to create userspace kernel extension fuzzing tooling that was as simple as possible, and at least one of the reasons for this simplicity was that it could be easily adapted to other pieces of kernel code. The process is versatile enough that it allowed us to fuzz AV1 parsing code that normally requires hardware we didn’t even have at the time. While the three issues found during this research are not critical, they demonstrate the correctness of the approach and the potential for finding other issues.

Friday, November 1, 2024

From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code

Posted by the Big Sleep team

Introduction

In our previous post, Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models, we introduced our framework for large-language-model-assisted vulnerability research and demonstrated its potential by improving the state-of-the-art performance on Meta's CyberSecEval2 benchmarks. Since then, Naptime has evolved into Big Sleep, a collaboration between Google Project Zero and Google DeepMind.

Today, we're excited to share the first real-world vulnerability discovered by the Big Sleep agent: an exploitable stack buffer underflow in SQLite, a widely used open source database engine. We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted.

We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software. Earlier this year at the DARPA AIxCC event, Team Atlanta discovered a null-pointer dereference in SQLite, which inspired us to use it for our testing to see if we could find a more serious vulnerability.

We think that this work has tremendous defensive potential. Finding vulnerabilities in software before it's even released, means that there's no scope for attackers to compete: the vulnerabilities are fixed before attackers even have a chance to use them. Fuzzing has helped significantly, but we need an approach that can help defenders to find the bugs that are difficult (or impossible) to find by fuzzing, and we're hopeful that AI can narrow this gap. We think that this is a promising path towards finally turning the tables and achieving an asymmetric advantage for defenders.

The vulnerability itself is quite interesting, along with the fact that the existing testing infrastructure for SQLite (both through OSS-Fuzz, and the project's own infrastructure) did not find the issue, so we did some further investigation.

Methodology

A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities. As this trend continues, it's clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.

We also feel that this variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. By providing a starting point – such as the details of a previously fixed vulnerability – we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere".

Our project is still in the research stage, and we are currently using small programs with known vulnerabilities to evaluate progress. Recently, we decided to put our models and tooling to the test by running our first extensive, real-world variant analysis experiment on SQLite. We collected a number of recent commits to the SQLite repository, manually removing trivial and documentation-only changes. We then adjusted the prompt to provide the agent with both the commit message and a diff for the change, and asked the agent to review the current repository (at HEAD) for related issues that might not have been fixed.

Discovered Vulnerability

The vulnerability is an interesting one where a special sentinel value -1 is used in an (otherwise) index-typed field iColumn:

7476:   struct sqlite3_index_constraint {

7477:      int iColumn;              /* Column constrained.  -1 for ROWID */

7478:      unsigned char op;         /* Constraint operator */

7479:      unsigned char usable;     /* True if this constraint is usable */

7480:      int iTermOffset;          /* Used internally - xBestIndex should ignore */

7481:   } *aConstraint;            /* Table of WHERE clause constraints */

This pattern creates a potential edge-case that needs to be handled by all code that uses the field, since the expectation would be that a valid column index is non-negative.

The function seriesBestIndex failed to correctly handle this edge-case, resulting in a write into a stack buffer with a negative index when handling a query with a constraint on the rowid column. In the build that we provided to our agent, debug assertions were enabled, and this condition was checked by the assertion at line 706:

619 static int seriesBestIndex(

620   sqlite3_vtab *pVTab,

621   sqlite3_index_info *pIdxInfo

622 ){

...

630   int aIdx[7];           /* Constraints on start, stop, step, LIMIT, OFFSET,

631                          ** and value.  aIdx[5] covers value=, value>=, and

632                          ** value>,  aIdx[6] covers value<= and value< */

633   const struct sqlite3_index_constraint *pConstraint;

...

642   for(i=0; i<pIdxInfo->nConstraint; i++, pConstraint++){

643     int iCol;    /* 0 for start, 1 for stop, 2 for step */

644     int iMask;   /* bitmask for those column */

645     int op = pConstraint->op;

...

705     iCol = pConstraint->iColumn - SERIES_COLUMN_START;

706     assert( iCol>=0 && iCol<=2 );

707     iMask = 1 << iCol;

...

713     if( pConstraint->usable==0 ){

714       unusableMask |=  iMask;

715       continue;

716     }else if( op==SQLITE_INDEX_CONSTRAINT_EQ ){

717       idxNum |= iMask;

718       aIdx[iCol] = i;

719     }

720   }

In a release build, however, this assertion is not present, and in our testing (this will vary depending on compiler and optimization level) the subsequent write at line 718 will write below the aIdx buffer, corrupting the least significant 32 bits of the pConstraint pointer, which will be dereferenced in the next iteration of the loop, leading to a likely exploitable condition.

However, given this explanation of the vulnerability – it's not trivial (at least for us) as human researchers to understand precisely how to trigger it – clearly a constraint on the ROWID column would be a good starting point, but more detailed reading of the code would certainly be required. The agent already seems to know a lot more about SQLite than we do, so it can cut some corners!

One common case of this is that the model would immediately use the generate_series virtual table when generating testcases. (We also saw cases where the model researched the available virtual tables first, but it's clearly able to apply pre-existing knowledge here).

Trajectory Highlights

In this successful run based on Gemini 1.5 Pro, the seed commit was [1976c3f7]; which is a fairly large and non-obvious change. The bug found by our agent is only loosely related to the changes in the seed commit - this is not uncommon in manual variant analysis, understanding one bug in a codebase often leads a researcher to other problems.

Selected highlights are below, with our commentary in
italics – all text in the ASSISTANT blocks comes directly from the agent.

The agent produces a reasonable assessment of the changes in the diff, and uses this to form a starting point for research:

ASSISTANT


The CL improves the handling of WHERE terms shared between OR branches in virtual table queries by introducing the termFromWhereClause function. This function retrieves a specific term from a WhereClause, considering nested clauses. Prior to this change, common terms were likely not handled correctly due to an incorrect mapping of term indices, potentially leading to inefficient query plans or incorrect results. The most likely vulnerability would involve incorrect handling of WHERE clause constraints within the allocateIndexInfo and whereLoopAddVirtualOne functions, where the CL has modified how constraint offsets are used with the new helper function. I'll first examine the allocateIndexInfo function.

. . .

After a several (omitted) steps exploring the relevant code, the agent has formed a more concrete hypothesis and attempts to check it by running a modified version of the testcase from the patch, but this doesn't work since the testcase included in the patch relies on an extension that is not available in our configuration:

TOOL(debugger_run)

The program exited without crashing or hitting the breakpoint. The exit code was 1.

The output on STDERR was:

Runtime error near line 2: no such module: tcl

Parse error near line 3: no such table: x1


ASSISTANT


The program terminated because the TCL module was not available. The tests rely on the TCL virtual table module, which needs to be enabled when compiling SQLite. The provided program input is based on the bestindexC.test file, which uses the TCL module to test virtual table best index selection.

To fix this, the program input needs to be changed to use a different virtual table that does not require the TCL module. However, the vulnerability is related to virtual table query planning. So we still need a virtual table to trigger the issue.

I will use the code_browser_source tool to search for examples of built-in virtual tables in SQLite. I'll then adapt the input to use one of these and set a breakpoint in whereLoopAddVirtualOne again.

. . .

The agent adapts to this setback gracefully, recognises the steps it needs to take to fix the testcase, and a few steps later has retrieved all of the information it needs to build a testcase that reaches whereLoopAddVirtualOne:

ASSISTANT

The sqlite3_create_module function registers virtual table modules. The shell.c file contains several instances of this, including the generate_series, fsdir, completion, and expert modules. The json table also looks interesting. I'll focus on generate_series for now as a starting point. I'll try to craft a query that uses this module and might trigger incorrect constraint handling.

. . .


Note above that during the original discovery and reproduction of the vulnerability, the agent was quite vague about the "incorrect constraint handling" that it's trying to trigger; and in this case the full trajectories read a little bit like a human researcher - starting out with one idea, thinking of a few other ones while investigating, and in the end producing a test case that reproduces a slightly different bug. In this case, when whereLoopAddVirtualOne is invoked on a generate_series table, we end  up in seriesBestIndex, where the constraint is indeed handled incorrectly.

However, it's particularly interesting to see that once the program has crashed the model does a much better job of explaining things - while producing a root-cause analysis and a full crash report we can see the agent clearly crafting a testcase to trigger the specific, identified edge-case:

ASSISTANT

The relevant field is iColumn inside the sqlite3_index_constraint struct. This field represents the column number of the constraint. To cause the assertion failure, we need a constraint on a column with index greater than 3 or smaller than 1.

Let's craft an SQL query that involves constraints on these columns. The generate_series table has columns named value, start, step and stop. The indexes are:

- value: 0

- start: 1

- step: 2

- stop: 3

To trigger the bug, we can include a constraint on the ROWID. Constraints on the ROWID use iColumn = -1.

Here is an example query:

SELECT * FROM generate_series(1,10,1) WHERE ROWID = 1;

This query should cause a crash in the seriesBestIndex function.

. . .

At the end we get a clear and accurate summary of the issue that's almost ready to report directly!

TOOL(debugger_run)

The program crashed: SIGABRT at :0.

The output on STDERR was:

sqlite3: src/shell.c:6816: int seriesBestIndex(sqlite3_vtab *, sqlite3_index_info *): Assertion `iCol>=0 &&  iCol<=2' failed.

ASSISTANT(report_success)


The program crashes with the input
SELECT * FROM generate_series(1,10,1) WHERE ROWID = 1; due to an assertion failure in function seriesBestIndex.

The assertion
assert(iCol>=0 && iCol<=2) fails because the iColumn field in the sqlite3_index_info struct can contain values outside of the valid range for column indexes, such as -1 when a constraint involves the ROWID.

This leads to iCol being calculated as -2, causing the assertion to fail.

What about Fuzzing?

Given the apparent simplicity of the reproduction case, an obvious question arises: why wasn’t this bug discovered earlier by traditional fuzzing? The “simple” answer lies in the configuration of the fuzzing harnesses. The harness used by OSS-Fuzz isn't built with the generate_series extension enabled, and the alternative fuzzingshell.c harness contained an older version of the seriesBestIndex function, unaffected by the bug. Although the SQLite AFL repo contains a configuration for fuzzing the same CLI binary that we provided to the Big Sleep agent, it appears not to be widely used.

To understand whether the bug is truly “shallow", we attempted to rediscover it through fuzzing. We followed the fuzzing instructions from the SQLite documentation and used the CLI target. We also verified that the fuzzing corpus contained the required generate_series and rowid keywords before launching an AFL run. However, the issue remained undiscovered after 150 CPU-hours of fuzzing.

We then tried to simplify the task for the fuzzer by, for example, adding the necessary keywords to AFL's SQL dictionary. However, it seems the bug can only be quickly found if the corpus contains an example very close to the crashing input, as code coverage doesn't appear to be a reliable guide for this particular issue.

Admittedly, AFL isn't the most suitable tool for a text-based format like SQL, where most inputs are syntactically invalid and will be rejected by the parser. Nevertheless, it's interesting to compare this result with Michal Zalewski’s blog post on fuzzing SQLite from 2015. Back then, AFL was quite effective at uncovering bugs in SQLite; after years of fuzzing, it seems the tool has reached a natural saturation point. While our results so far seem minor in comparison to the dramatic step-change in effectiveness that came with the release of AFL, it's interesting to see that it has its own strengths and might be able to effectively uncover a distinct set of vulnerabilities.

Conclusion

For the team this is a moment of validation and success - finding a vulnerability in a widely-used and well fuzzed open source project is an exciting result! When provided with the right tools, current LLMs can perform vulnerability research.

However, we want to reiterate that these are highly experimental results. The position of the Big Sleep team is that at present, it's likely that a target-specific fuzzer would be at least as effective (at finding vulnerabilities).

We hope that in the future this effort will lead to a significant advantage to defenders - with the potential not only to find crashing testcases, but also to provide high-quality root-cause analysis, triaging and fixing issues could be much cheaper and more effective in the future. We aim to continue sharing our research in this space, keeping the gap between the public state-of-the-art and private state-of-the-art as small as possible.

The Big Sleep team will continue to work in this space, advancing Project Zero's mission of making 0-day hard.

The Big Sleep Team

This isn't just a Project Zero effort any more, and everyone who has contributed to this effort is listed below (names in alphabetical order):

Miltos Allamanis, Martin Arjovsky, Charles Blundell, Lars Buesing, Mark Brand, Sergei Glazunov, Dominik Maier, Petros Maniatis, Guilherme Marinho, Henryk Michalewski, Koushik Sen, Charles Sutton, Vaibhav Tulsyan, Marco Vanotti, Theophane Weber, Dan Zheng