Thursday, December 13, 2018

Adventures in Video Conferencing Part 5: Where Do We Go from Here?

Posted by Natalie Silvanovich, Project Zero

Overall, our video conferencing research found a total of 11 bugs in WebRTC, FaceTime and WhatsApp. The majority of these were found through less than 15 minutes of mutation fuzzing RTP. We were surprised to find remote bugs so easily in code that is so widely distributed. There are several properties of video conferencing that likely led to the frequency and shallowness of these issues.

WebRTC Bug Reporting

When we started looking at WebRTC, we were surprised to discover that their website did not describe how to report vulnerabilities to the project. They had an open bug tracker, but no specific guidance on how to flag or report vulnerabilities. They also provided no security guidance for integrators, and there was no clear way for integrators to determine when they needed to update their source for security fixes. Many integrators seem to have branched WebRTC without consideration for applying security updates. The combination of these factors make it more likely that vulnerabilities did not get reported, vulnerabilities or fixes got ‘lost’ in the tracker, fixes regressed or fixes did not get applied to implementations that use the source in part.

We worked with the WebRTC team to add this guidance to the site, and to clarify their vulnerability reporting process. Despite these changes, several large software vendors reached out to our team with questions about how to fix the vulnerabilities we reported. This shows there is still a lack of clarity on how to fix vulnerabilities in WebRTC.

Video Conferencing Test Tools

We also discovered that most video conferencing solutions lack adequate test tools. In most implementations, there is no way to collect data that allows for problems with an RTP stream to be diagnosed. The vendors we asked did not have such a tool, even internally.  WebRTC had a mostly complete tool that allows streams to be recorded in the browser and replayed, but it did not work with streams that used non-default settings. This tool has now been updated to collect enough data to be able to replay any stream. The lack of tooling available to test RTP implementations likely contributed to the ease of finding vulnerabilities, and certainly made reproducing and reporting vulnerabilities more difficult

Video Conferencing Standards

The standards that comprise video conferencing such as RTP, RTCP and FEC introduce a lot of complexity in achieving their goal of enabling reliable audio and video streams across any type of connection. While the majority of this complexity provides value to the end user, it also means that it is inherently difficult to implement securely.

The Scope of Video Conferencing

WebRTC has billions of users. While it was originally created for use in the Chrome browser, it is now integrated by at least two Android applications that eclipse Chrome in terms of users: Facebook and WhatsApp (which only uses part of WebRTC). It is also used by Firefox and Safari. It is likely that most mobile devices run multiple copies of the WebRTC library. The ubiquity of WebRTC coupled with the lack of a clear patch strategy make it an especially concerning target for attackers.

Recommendations for Developers

This section contains recommendations for developers who are implementing video conferencing based on our observations from this research.

First, it is a good idea to use an existing solution for video conferencing (either WebRTC or PJSIP) as opposed to implementing a new one. Video conferencing is very complex, and every implementation we looked at had vulnerabilities, so it is unlikely a new implementation would avoid these problems. Existing solutions have undergone at least some security testing and would likely have fewer problems.

It is also advisable to avoid branching existing video conferencing code. We have received questions from vendors who have branched WebRTC, and it is clear that this makes patching vulnerabilities more difficult. While branching can solve problems in the short term, integrators often regret it in the long term.

It is important to have a patch strategy when implementing video conferencing, as there will inevitably be vulnerabilities found in any implementation that is used. Developers should understand how security patches are distributed for any third-party library they integrate, and have a plan for applying them as soon as they are available.

It is also important to have adequate test tools for a video conferencing application, even if a third-party implementation is used. It is a good idea to have a way to reproduce a call from end to end. This is useful in diagnosing crashes, which could have a security impact, as well as functional problems.

Several mobile applications we looked at had unnecessary attack surface. Specifically codecs and other features of the video conferencing implementation were enabled and accessible via RTP even though no legitimate call would ever use them. WebRTC and PJSIP support disabling specific features such as codecs and FEC. It is a good idea to disable the features that are not being used.

Finally, video conferencing vulnerabilities can generally be split into those that require the target to answer the incoming call, and those that do not. Vulnerabilities that do not require the call to be answered are more dangerous. We observed that some video conferencing applications perform much more parsing of untrusted data before a call is answered than others. We recommend that developers put as much functionality after the call is answered as possible.

Tools


In order to open up the most popular video conferencing implementations to more security research, we are releasing the tools we developed to do this research. Street Party is a suite of tools that allows the RTP streams of video conferencing implementations to be viewed and modified. It includes:

  • WebRTC: instructions for recording and replaying RTP packets using WebRTC’s existing tools
  • FaceTime: hooks for recording and replaying FaceTime calls
  • WhatsApp: hooks for recording and replaying WhatsApp calls on Android

We hope these tools encourage even more investigation into the security properties of video conferencing. Contributions are welcome.

Conclusion


We reviewed WebRTC, FaceTime and WhatsApp and found 11 serious vulnerabilities in their video conferencing implementations. Accessing and altering their encrypted content streams required substantial tooling. We are releasing this tooling to enable additional security research on these targets. There are many properties of video conferencing that make it susceptible to vulnerabilities. Adequate testing, conservative design and frequent patching can reduce the security risk of video conferencing implementations.

Wednesday, December 12, 2018

Adventures in Video Conferencing Part 4: What Didn't Work Out with WhatsApp

Posted by Natalie Silvanovich, Project Zero

Not every attempt to find bugs is successful. When looking at WhatsApp, we spent a lot of time reviewing call signalling hoping to find a remote, interaction-less vulnerability. No such bugs were found. We are sharing our work with the hopes of saving other researchers the time it took to go down this very long road. Or maybe it will give others ideas for vulnerabilities we didn’t find.

As discussed in Part 1, signalling is the process through which video conferencing peers initiate a call. Usually, at least part of signalling occurs before the receiving peer answers the call. This means that if there is a vulnerability in the code that processes incoming signals before the call is answered, it does not require any user interaction.

WhatsApp implements signalling using a series of WhatsApp messages. Opening libwhatsapp.so in IDA, there are several native calls that handle incoming signalling messages.

Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAck
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfo
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRekeyRequest
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallFlowControl
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferReceipt
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallAcceptReceipt
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAccept
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferPreAccept
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChanged
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChangedAck
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferReject
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallTerminate
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallTransport
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayLatency
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayElection
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallInterrupted
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallMuted
Java_com_whatsapp_voipcalling_Voip_nativeHandleWebClientMessage

Using apktool to extract the WhatsApp APK, it appears these natives are called from a loop in the com.whatsapp.voipcalling.Voip class. Looking at the smali, it looks like signalling messages are sent as WhatsApp messages via the WhatsApp server, and this loop handles the incoming messages.

Immediately, I noticed that there was a peer-to-peer encrypted portion of the message (the rest of the message is only encrypted peer-to-server). I thought this had the highest potential of reaching bugs, as the server would not be able to sanitize the data. In order to be able to read and alter encrypted packets, I set up a remote server with a python script that opens a socket. Whenever this socket receives data, the data is displayed on the screen, and I have the option of either sending the unaltered packet or altering the packet before it is sent. I then looked for the point in the WhatsApp smali where messages are peer-to-peer encrypted.

Since WhatsApp uses libsignal for peer-to-peer encryption, I was able to find where messages are encrypted by matching log entries. I then added smali code that sends a packet with the bytes of the message to the server I set up, and then replaces it with the bytes the server returns (changing the size of the byte array if necessary). This allowed me to view and alter the peer-to-peer encrypted message. Making a call using this modified APK, I discovered that the peer-to-peer message was always exactly 24 bytes long, and appeared to be random. I suspected that this was the encryption key used by the call, and confirmed this by looking at the smali.

A single encryption key doesn’t have a lot of potential for malformed data to lead to bugs (I tried lengthening and shortening it to be safe, but got nothing but unexploitable null pointer issues), so I moved on to looking at the peer-to-server encrypted messages. Looking at the Voip loop in smali, it looked like the general flow is that the device receives an incoming message, it is deserialized and if it is of the right type, it is forwarded to the messaging loop. Then certain properties are read from the message, and it is forwarded to a processing function based on its type. Then the processing function reads even more properties, and calls one of the above native methods with the properties as its parameters. Most of these functions have more than 20 parameters.

Many of these functions perform logging when they are called, so by making a test call, I could figure out which functions get called before a call is picked up. It turns out that during a normal incoming call, the device only receives an offer and calls Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer, and then spawns the incoming call screen in WhatsApp. All the other signal types are not used until the call is picked up.

An immediate question I had was whether other signal types are processed if they are received before a call is picked up. Just because the initiating device never sends these signal types before the call is picked up doesn’t mean the receiving device wouldn’t process them if it received them.

Looking through the APK smali, I found the class com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback that has several methods like sendOffer and sendAccept that appeared to send the messages that are processed by these native calls. I changed sendOffer to call other send methods, like sendAccept instead of its normal messaging functionality. Trying this, I discovered that the Voip loop will process any signal type regardless of whether the call has been answered. The native methods will then parse the parameters, process them and put the results in a buffer, and then call a single method to process the buffer. It is only at that point processing will stop if the message is of the wrong type.
I then reviewed all of the above methods in IDA. The code was very conservatively written, and most needed checks were performed. However, there were a few areas that potentially had bugs that I wanted to investigate more. I decided that changing the parameters to calls in the com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback was too slow to test the number of cases I wanted to test, and went looking for another way to alter the messages.

Ideally, I wanted a way to pass peer-to-server encrypted messages to my server before they were sent, so I could view and alter them. I went through the WhatsApp APK smali looking for a point after serialization but before encryption where I could add my smali function that sends and alters the packets. This was fairly difficult and time consuming, and I eventually put my smali in every method that wrote to a non-file ByteArrayOutputStream in the com.whatsapp.protocol and com.whatsapp.messaging packages (about 10 total) and looked for where it got called. I figured out where it got called, and fixed the class so that anywhere a byte array was written out from a stream, it got sent to my server, and removed the other calls. (If you’re following along at home, the smali file I changed included the string “Double byte dictionary token out of range”, and the two methods I changed contained calls to toByteArray, and ended with invoking a protocol interface.) Looking at what got sent to my server, it seemed like a reasonably comprehensive collection of WhatsApp messages, and the signalling messages contained what I thought they would.

WhatsApp messages are in a compressed XMPP format. A lot of parsers have been written for reverse engineering this protocol, but I found the whatsapp-reveng parser worked the best. I did have to replace the tokens in whatsapp_defines.py with a list extracted from the APK for it to work correctly though. This made it easier to figure out what was in each packet sent to the server.

Playing with this a bit, I discovered that there are three types of checks in WhatsApp signalling messages. First, the server validates and modifies incoming signalling messages. Secondly, the messages are deserialized, and this can cause errors if the format is incorrect, and generally limits the contents of the Java message object that is passed on. Finally, the native methods perform checks on their parameters.

These additional checks prevented several of the areas I thought were problems from actually being problems. For example, there is a function called by Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer that takes in an array of byte arrays, an array of integers and an array of booleans. It uses these values to construct candidates for the call. It checks that the array of byte arrays and the array of integers are of the same length before it loops through them, using values from each, but it does not perform the same check on the boolean array. I thought that this could go out of bounds, but it turns out that the integer and booleans are serialized as a vector of <int,bool> pairs, and the arrays are then copied from the vector, so it is not actually possible to send arrays with different lengths.

One area of the signalling messages that looked especially concerning was the voip_options field of the message. This field is never sent from the sending device, but is added to the message by the server before it is forwarded to the receiving device. It is a buffer in JSON format that is processed by the receiving device and contains dozens of configuration parameters.

{"aec":{"offset":"0","mode":"2","echo_detector_mode":"4","echo_detector_impl":"2","ec_threshold":"50","ec_off_threshold":"40","disable_agc":"1","algorithm":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"aecm_adapt_step_size":"2"},"agc":{"mode":"0","limiterenable":"1","compressiongain":"9","targetlevel":"1"},"bwe":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"encode":{"complexity":"5","cbr":"0"},"init_bwe":{"use_local_probing_rx_bitrate":"1","test_flags":"982188032","max_tx_rott_based_bitrate":"128000","max_bytes":"8000","max_bitrate":"350000"},"ns":{"mode":"1"},"options":{"connecting_tone_desc": "test","video_codec_priority":"2","transport_stats_p2p_threshold":"0.5","spam_call_threshold_seconds":"55","mtu_size":"1200","media_pipeline_setup_wait_threshold_in_msec":"1500","low_battery_notify_threshold":"5","ip_config":"1","enc_fps_over_capture_fps_threshold":"1","enable_ssrc_demux":"1","enable_preaccept_received_update":"1","enable_periodical_aud_rr_processing":"1","enable_new_transport_stats":"1","enable_group_call":"1","enable_camera_abtest_texture_preview":"1","enable_audio_video_switch":"1","caller_end_call_threshold":"1500","call_start_delay":"1200","audio_encode_offload":"1","android_call_connected_toast":"1"}
Sample voip_options (truncated)

If a peer could send a voip_options parameter to another peer, it would open up a lot of attack surface, including a JSON parser and the processing of these parameters. Since this parameter almost always appears in an offer, I tried modifying an offer to contain one, but the offer was rejected by the WhatsApp server with error 403. Looking at the binary, there were three other signal types in the incoming call flow that could accept a voip_options parameter. Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAccept and Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChanged were accepted by the server if a voip_options parameter was included, but it was stripped before the message was sent to the peer. However, if a voip_options parameter was attached to a Java_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfo message, it would be forwarded to the peer device. I confirmed this by sending malformed JSON looking at the log of the receiving device for an error.

The voip_options parameter is processed by WhatsApp in three stages. First, the JSON is parsed into a tree. Then the tree is transformed to a map, so JSON object properties can be looked up efficiently even though there are dozens of them. Finally, WhatsApp goes through the map, looking for specific parameters and processes them, usually copying them to an area in memory where they will set a value relevant to the call being made.

Starting off with the JSON parser, it was clearly the PJSIP JSON parser. I compiled the code and fuzzed it, and only found one minor out-of-bounds read issue.

I then looked at the conversion of the JSON tree output from the parser into the map. The map is a very efficient structure. It is a hash map that uses FarmHash as its hashing algorithm, and it is designed so that the entire map is stored in a single slab of memory, even if the JSON objects are deeply nested. I looked at many open source projects that contained similar structures, but could not find one that looked similar. I looked through the creation of this structure in great detail, looking especially for type confusion bugs as well as errors when the memory slab is expanded, but did not find any issues.

I also looked at the functions that go through the map and handle specific parameters. These functions are extremely long, and I suspect they are generated using a code generation tool such as bison. They mostly copy parameters into static areas of memory, at which point they become difficult to trace. I did not find any bugs in this area either. Other than going through parameter names and looking for value that seemed likely to cause problems, I did not do any analysis of how the values fetched from JSON are actually used. One parameter that seemed especially promising was an A/B test parameter called setup_video_stream_before_accept. I hoped that setting this would allow the device to accept RTP before the call is answered, which would make RTP bugs interaction-less, but I was unable to get this to work.

In the process of looking at this code, it became difficult to verify its functionality without the ability to debug it. Since WhatsApp ships an x86 library for Android, I wondered if it would be possible to run the JSON parser on Linux.

Tavis Ormandy created a tool that can load the libwhatsapp.so library on Linux and run native functions, so long as they do not have a dependency on the JVM. It works by patching the .dynamic ELF section to remove unnecessary dependencies by replacing DT_NEEDED tags with DT_DEBUG tags. We also needed to remove constructors and deconstructors by changing the DT_FINI_ARRAYSZ and DT_INIT_ARRAYSZ to zero. With these changs in place, we could load the library using dlopen() and use dlsym() and dlclose() as normal.

Using this tool, I was able to look at the JSON parsing in more detail. I also set up distributed fuzzing of the JSON binary. Unfortunately, it did not uncover any bugs either.

Overall, WhatsApp signalling seemed like a promising attack surface, but we did not find any vulnerabilities in it. There were two areas where we were able to extend the attack surface beyond what is used in the basic call flow. First, it was possible to send signalling messages that should only be sent after a call is answered before the call is answered, and they were processed by the receiving device. Second, it was possible for a peer to send voip_options JSON to another device. WhatsApp could reduce the attack surface of signalling by removing these capabilities.

I made these suggestions to WhatsApp, and they responded that they were already aware of the first issue as well as variants of the second issue. They said they were in the process of limiting what signalling messages can be processed by the device before a call is answered. They had already fixed other issues where a peer can send voip_options JSON to another peer, and fixed the method I reported as well. They said they are also considering adding cryptographic signing to the voip_options parameter so a device can verify it came from the server to further avoid issues like this. We appreciate their quick resolution of the voip_options issue and strong interest in implementing defense-in-depth measures.

In Part 5, we will discuss the conclusions of our research and make recommendations for better securing video conferencing.

Tuesday, December 11, 2018

Adventures in Video Conferencing Part 3: The Even Wilder World of WhatsApp

Posted by Natalie Silvanovich, Project Zero

WhatsApp is another application that supports video conferencing that does not use WebRTC as its core implementation. Instead, it uses PJSIP, which contains some WebRTC code, but also contains a substantial amount of other code, and predates the WebRTC project. I fuzzed this implementation to see if it had similar results to WebRTC and FaceTime.

Fuzzing Set-up

PJSIP is open source, so it was easy to identify the PJSIP code in the Android WhatsApp binary (libwhatsapp.so). Since PJSIP uses the open source library libsrtp, I started off by opening the binary in IDA and searching for the string srtp_protect, the name of the function libsrtp uses for encryption. This led to a log entry emitted by a function that looked like srtp_protect. There was only one function in the binary that called this function, and called memcpy soon before the call. Some log entries before the call contained the file name srtp_transport.c, which exists in the PJSIP repository. The log entries in the WhatsApp binary say that the function being called is transport_send_rtp2 and the PJSIP source only has a function called transport_send_rtp, but it looks similar to the function calling srtp_protect in WhatsApp, in that it has the same number of calls before and after the memcpy. Assuming that the code in WhatsApp is some variation of that code, the memcpy copies the entire unencrypted packet right before it is encrypted.

Hooking this memcpy seemed like a possible way to fuzz WhatsApp video calling. I started off by hooking memcpy for the entire app using a tool called Frida. This tool can easily hook native function in Android applications, and I was able to see calls to memcpy from WhatsApp within minutes. Unfortunately though, video conferencing is very performance sensitive, and a delay sending video packets actually influences the contents of the next packet, so hooking every memcpy call didn’t seem practical. Instead, I decided to change the single memcpy to point to a function I wrote.

I started off by writing a function in assembly that loaded a library from the filesystem using dlopen, retrieved a symbol by calling dlsym and then called into the library. Frida was very useful in debugging this, as it could hook calls to dlopen and dlsym to make sure they were being called correctly. I overwrote a function in the WhatsApp GIF transcoder with this function, as it is only used in sending text messages, which I didn’t plan to do with this altered version. I then set the memcpy call to point to this function instead of memcpy, using this online ARM branch finder.

sub_2F8CC
MOV             X21, X30
MOV             X22, X0
MOV             X23, X1
MOV             X20, X2
MOV             X1, #1
ADRP            X0, #aDataDataCom_wh@PAGE ; "/data/data/com.whatsapp/libn.so"
ADD             X0, X0, #aDataDataCom_wh@PAGEOFF ; "/data/data/com.whatsapp/libn.so"
BL              .dlopen
ADRP            X1, #aApthread@PAGE ; "apthread"
ADD             X1, X1, #aApthread@PAGEOFF ; "apthread"
BL              .dlsym
MOV             X8, X0
MOV             X0, X22
MOV             X1, X23
MOV             X2, X20
NOP
BLR             X8
MOV             X30, X21
RET
The library loading function

I then wrote a library for Android which had the same parameters as memcpy, but fuzzed and copied the buffer instead of just copying it, and put it on the filesystem where it would be loaded by dlopen. I then tried making a WhatsApp call with this setup. The video call looked like it was being fuzzed and crashed in roughly fifteen minutes.

Replay Set-up


To replay the packets I added logging to the library, so that each buffer that was altered would also be saved to a file. Then I created a second library that copied the logged packets into the buffer being copied instead of altering it. This required modifying the WhatsApp binary slightly, because the logged packet will usually not be the same size as the packet currently being sent. I changed the length of the hooked memcpy to be passed by reference instead of by value, and then had the library change the length to the length of the logged packet. This changed the value of the length so that it would be correct for the call to srtp_protect. Luckily, the buffer that the packet is copied into is a fixed length, so there is no concern that a valid packet will overflow the buffer length. This is a common design pattern in RTP processing that improves performance by reducing length checks. It was also helpful in modifying FaceTime to replay packets of varying length, as described in the previous post.

This initial replay setup did not work, and looking at the logged packets, it turned out that WhatsApp uses four streams with different SSRCs for video conferencing (possibly one for video, one for audio, one for synchronization and one for good luck). The streams each had only one payload type, and they were all different, so it was fairly easy to map each SSRC to its stream. So I modified the replay library to determine the current SSRC for each stream based on the payload types of incoming packets, and then to replace the SSRC of the replayed packets with the correct one based on their payload type. This reliably replayed a WhatsApp call. I was then able to fuzz and reproduce crashes on WhatsApp.

Results

Using this setup, I reported one heap corruption issue on WhatsApp, CVE-2018-6344. This issue has since been fixed. After this issue was resolved, fuzzing did not yield any additional crashes with security impact, and we moved on to other methodologies. Part 4 will describe our other (unsuccessful) attempts to find vulnerabilities in WhatsApp.

Wednesday, December 5, 2018

Adventures in Video Conferencing Part 2: Fun with FaceTime

Posted by Natalie Silvanovich, Project Zero

FaceTime is Apple’s video conferencing application for iOS and Mac. It is closed source, and does not appear to use any third-party libraries for its core functionality. I wondered whether fuzzing the contents of FaceTime’s audio and video streams would lead to similar results as WebRTC.

Fuzzing Set-up


Philipp Hancke performed an excellent analysis of FaceTime’s architecture in 2015. It is similar to WebRTC, in that it exchanges signalling information in SDP format and then uses RTP for audio and video streams. Looking at the FaceTime implementation on a Mac, it seemed the bulk of the calling functionality of FaceTime is in a daemon called avconferenced. Opening up the binary that supports its functionality, AVConference in IDA, it contains a function called SRTPEncryptData. This function then calls CCCryptorUpdate, which appeared to encrypt RTP packets below the header.

To do a quick test of whether fuzzing was likely to be effective, I hooked this function and altered the underlying encrypted data. Normally, this can be done by setting the DYLD_INSERT_LIBRARIES environment variable, but since avconferenced is a daemon that restarts automatically when it dies, there wasn’t an easy way to set an environment variable. I eventually used insert_dylib to alter the AVConference binary to load a library on startup, and restarted the process. The library loaded used DYLD_INTERPOSE to replace CCCryptorUpdate with a version that fuzzed every input buffer (using fuzzer q from Part 1) before it was processed. This implementation had a lot of problems: it fuzzed both encryption and decryption, it affected every call to CCCryptorUpdate from avconferenced, not just ones involved in SRTP and there was no way to reproduce a crash. But using the modified FaceTime to call an iPhone led to video output that looked corrupted, and the phone crashed in a few minutes. This confirmed that this function was indeed where FaceTime calls are encrypted, and that fuzzing was likely to find bugs.

I made a few changes to the function that hooked CCCryptorUpdate to attempt to solve these problems. I limited fuzzing the input buffer to the two threads that write audio and video output to RTP, which also solved the problem of decrypted packets being fuzzed, as these threads only ever encrypt. I then added functionality that wrote the encrypted, fuzzed contents of each packet to a series of log files, so that test cases could be replayed. This required altering the sandbox of avconferenced so that it could write files to the log location, and adding spinlocks to the hook, as calling CCCryptorUpdate is thread safe, but logging packets isn’t.

Call Replay


I then wrote a second library that hooks CCCryptorUpdate and replays packets logged by the first library by copying the logged packets in sequence into the packet buffers passed into the function. Unfortunately, this required a small modification to the AVConference binary, as the SRTPEncryptData function does not respect the length returned by CCCryptorUpdate; instead, it assumes that the length of the encrypted data is the same as the length as the plaintext data, which is reasonable when CCCryptorUpdate isn’t being hooked. Since SRTPEncryptData always uses a large fixed-size buffer for encryption, and encryption is in-place, I changed the function to retrieve the length of the encrypted buffer from the very end of the buffer, which was set in the hooked CCCryptorUpdate call. This memory is unlikely to be used for other purposes due to the typical shorter length of RTP packets. Unfortunately though, even though the same encrypted data was being replayed to the target, it wasn’t being processed correctly by the receiving device.

To understand why requires an explanation of how RTP works. An RTP packet has the following format.


It contains several fields that impact how its payload is interpreted. The SSRC is a random identifier that identifies a stream. For example, in FaceTime the audio and video streams have different SSRCs. SSRCs can also help differentiate between streams in a situation where a user could potentially have an unlimited number of streams, for example, multiple participants in a video call. RTP packets also have a payload type (PT in the diagram) which is used to differentiate different types of data in the payload. The payload type for a certain data type is consistent across calls. In FaceTime, the video stream has a single payload type for video data, but the audio stream has two payload types, likely one for audio data and the other for synchronization. The marker (M in the diagram) field of RTP is also used by FaceTime to represent when a packet is fragmented, and needs to be reassembled.

From this it is clear that simply copying logged data into the current encrypted packet won’t function correctly, because the data needs to have the correct SSRC, payload type and marker, or it won’t be interpreted correctly. This wasn’t necessary in WebRTC, because I had enough control over WebRTC that I could create a connection with a single SSRC and payload type for fuzzing purposes. But there is no way to do this in FaceTime, even muting a video call leads to silent audio packets being sent as opposed to the audio stream shutting down. So these values needed to be manually corrected.

An RTP feature called extensions made correcting these fields difficult. An extension is an optional header that can be added to an RTP packet. Extensions are not supposed to depend on the RTP payload to be interpreted, and extensions are often used to transmit network or display features. Some examples of supported extensions include the orientation extension, which tells the endpoint the orientation of the receiving device and the mute extension, which tells the endpoint whether the receiving device is muted.

Extensions mean that even if it is possible to determine the payload type, marker and SSRC of data, this is not sufficient to replay the exact packet that was sent. Moreover, FaceTime creates extensions after the packet is encrypted, so it is not possible to create the complete RTP packet by hooking CCCryptorUpdate, because extensions could be added later.

At this point, it seemed necessary to hook sendmsg as well as CCCryptorUpdate. This would allow the outgoing RTP header to be modified once it is complete. There were a few challenges in doing this. To start, audio and video packets are sent by different threads in FaceTime, and can be reordered between the time they are encrypted and the time they are sent by sendmsg. So I couldn’t assume that if sendmsg received an RTP packet that it was necessarily the last one that was encrypted. There was also the problem that SSRCs are dynamic, so replaying an RTP packet with the same SSRC it is recorded with won’t work, it needs to have the new SSRC for the audio or video stream.

Note that in MacOS Mojave, FaceTime can call sendmsg via either the AVConference binary or the IDSFoundation binary, depending on the network configuration. So to capture and replay unencrypted RTP traffic on newer systems, it is necessary  to hook CCCryptorUpdate in AConference and sendmsg in IDSFoundation (AVConference calls into IDSFoundation when it calls sendmsg). Otherwise, the process is the same as on older systems.

I ended up implementing a solution that recorded packets by recording the unencrypted payload, and then recorded its RTP header, and using a snippet of the encrypted payload to pair headers with the correct unencrypted payload. Then to replay packets, the packets encrypted in CCCryptorUpdate were replaced with the logged packets, and once the encrypted payload came through to sendmsg, the header was replaced with the logged one for that payload. Fortunately, the two streams with unique SSRCs used by FaceTime do not share any payload types, so it was possible to determine the new SSRC for each stream by waiting for an incoming packet with the correct payload type. Then in each subsequent packet, the SSRC was replaced with the correct one.

Unfortunately, this still did not replay a FaceTime call correctly, and calls often experienced decryption failures. I eventually determined that audio and video on FaceTime are encrypted with different keys, and updated the replay script to queue the CCCryptor used by CCCryptorUpdate function based on whether it was audio or video content. Then in sendmsg, the entire logged RTP packet, including the unencrypted payload, was copied into the outgoing packet, the SSRC was fixed, and then the payload encrypted with the next CCCryptor out of the appropriate queue. If a CCCryptor wasn’t available, outgoing packets were dropped until a new one was created. At this point, it was possible to stop using the modified AVConference binary, as all the packet modification was now happening in sendmsg. This implementation still had reliability problems.

Digging more deeply into how FaceTime encryption works, packets are encrypted in CTS mode, which requires a counter. The counter is initialized to a unique value for each packet that is sent. During the initialization of the RTP stream, the peers exchange two 16-byte random tokens, one for audio and one for video. The counter value for each packet is then calculated by exclusive or-ing the token with several values found in the packet, including the SSRC and the sequence number. Only one value in this calculation, the sequence number, changes between each packet. So it is possible to calculate the counter value for each packet by knowing the initial counter value and sequence number, which can be retrieved by hooking CCCryptorCreateWithMode. The sequence number is xor-ed with the random token at index 0x12 when FaceTime constructs a counter, so by xor-ing this location with the initial sequence number and then a packet’s sequence number, the counter value for that packet can be calculated. The key can also be retrieved by hooking CCCryptorCreateWithMode.This allowed me to dispense with queuing cryptors, as I now had all the information I needed to construct a cryptor for any packet. This allowed for packets to be encrypted faster and more accurately.

Sequence numbers still posed a problem though, as the initial sequence number of an RTP stream is randomly generated at the beginning of the call, and is different between subsequent calls. Also, sequence numbers are used to reconstruct video streams in order, so they need to be correct. I altered the replay tool determine the starting sequence number of each stream, and then calculate the difference between the starting sequence number of each logged stream and the sequence number of the logged packet and then add it to this value. These two changes finally made the replay tool work, though replay gets slower and slower as a stream is replayed due to dropped packets.   

Results


Using this setup, I was able to fuzz FaceTime calls and reproduce the crashes. I reported three bugs in FaceTime based on this work. All these issues have been fixed in recent updates.

CVE-2018-4366 is an out-of-bounds read in video processing that occurs on Macs only.

CVE-2018-4367 is a stack corruption vulnerability that affects iOS and Mac. There are a fair number of variables on the stack of the affected function before the stack cookie, and several fuzz crashes due to this issue caused segmentation faults as opposed to stack_chk crashes, so it is likely exploitable.

CVE-2018-4384 is a kernel heap corruption issue in video processing that affects iOS. It is likely similar to this vulnerability found by Adam Donenfeld of Zimperium.

All of these issues took less than 15 minutes of fuzzing to find on a live device. Unfortunately, this was the limit of fuzzing that could be performed on FaceTime, as it would be difficult to create a command line fuzzing tool with coverage like we did for WebRTC as it is closed source.

In Part 3, we will look at video calling in WhatsApp.