Wednesday, December 12, 2018

Adventures in Video Conferencing Part 4: What Didn't Work Out with WhatsApp

Posted by Natalie Silvanovich, Project Zero

Not every attempt to find bugs is successful. When looking at WhatsApp, we spent a lot of time reviewing call signalling hoping to find a remote, interaction-less vulnerability. No such bugs were found. We are sharing our work with the hopes of saving other researchers the time it took to go down this very long road. Or maybe it will give others ideas for vulnerabilities we didn’t find.

As discussed in Part 1, signalling is the process through which video conferencing peers initiate a call. Usually, at least part of signalling occurs before the receiving peer answers the call. This means that if there is a vulnerability in the code that processes incoming signals before the call is answered, it does not require any user interaction.

WhatsApp implements signalling using a series of WhatsApp messages. Opening libwhatsapp.so in IDA, there are several native calls that handle incoming signalling messages.

Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAck
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfo
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRekeyRequest
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallFlowControl
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferReceipt
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallAcceptReceipt
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAccept
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferPreAccept
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChanged
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChangedAck
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferReject
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallTerminate
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallTransport
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayLatency
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayElection
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallInterrupted
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallMuted
Java_com_whatsapp_voipcalling_Voip_nativeHandleWebClientMessage

Using apktool to extract the WhatsApp APK, it appears these natives are called from a loop in the com.whatsapp.voipcalling.Voip class. Looking at the smali, it looks like signalling messages are sent as WhatsApp messages via the WhatsApp server, and this loop handles the incoming messages.

Immediately, I noticed that there was a peer-to-peer encrypted portion of the message (the rest of the message is only encrypted peer-to-server). I thought this had the highest potential of reaching bugs, as the server would not be able to sanitize the data. In order to be able to read and alter encrypted packets, I set up a remote server with a python script that opens a socket. Whenever this socket receives data, the data is displayed on the screen, and I have the option of either sending the unaltered packet or altering the packet before it is sent. I then looked for the point in the WhatsApp smali where messages are peer-to-peer encrypted.

Since WhatsApp uses libsignal for peer-to-peer encryption, I was able to find where messages are encrypted by matching log entries. I then added smali code that sends a packet with the bytes of the message to the server I set up, and then replaces it with the bytes the server returns (changing the size of the byte array if necessary). This allowed me to view and alter the peer-to-peer encrypted message. Making a call using this modified APK, I discovered that the peer-to-peer message was always exactly 24 bytes long, and appeared to be random. I suspected that this was the encryption key used by the call, and confirmed this by looking at the smali.

A single encryption key doesn’t have a lot of potential for malformed data to lead to bugs (I tried lengthening and shortening it to be safe, but got nothing but unexploitable null pointer issues), so I moved on to looking at the peer-to-server encrypted messages. Looking at the Voip loop in smali, it looked like the general flow is that the device receives an incoming message, it is deserialized and if it is of the right type, it is forwarded to the messaging loop. Then certain properties are read from the message, and it is forwarded to a processing function based on its type. Then the processing function reads even more properties, and calls one of the above native methods with the properties as its parameters. Most of these functions have more than 20 parameters.

Many of these functions perform logging when they are called, so by making a test call, I could figure out which functions get called before a call is picked up. It turns out that during a normal incoming call, the device only receives an offer and calls Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer, and then spawns the incoming call screen in WhatsApp. All the other signal types are not used until the call is picked up.

An immediate question I had was whether other signal types are processed if they are received before a call is picked up. Just because the initiating device never sends these signal types before the call is picked up doesn’t mean the receiving device wouldn’t process them if it received them.

Looking through the APK smali, I found the class com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback that has several methods like sendOffer and sendAccept that appeared to send the messages that are processed by these native calls. I changed sendOffer to call other send methods, like sendAccept instead of its normal messaging functionality. Trying this, I discovered that the Voip loop will process any signal type regardless of whether the call has been answered. The native methods will then parse the parameters, process them and put the results in a buffer, and then call a single method to process the buffer. It is only at that point processing will stop if the message is of the wrong type.
I then reviewed all of the above methods in IDA. The code was very conservatively written, and most needed checks were performed. However, there were a few areas that potentially had bugs that I wanted to investigate more. I decided that changing the parameters to calls in the com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback was too slow to test the number of cases I wanted to test, and went looking for another way to alter the messages.

Ideally, I wanted a way to pass peer-to-server encrypted messages to my server before they were sent, so I could view and alter them. I went through the WhatsApp APK smali looking for a point after serialization but before encryption where I could add my smali function that sends and alters the packets. This was fairly difficult and time consuming, and I eventually put my smali in every method that wrote to a non-file ByteArrayOutputStream in the com.whatsapp.protocol and com.whatsapp.messaging packages (about 10 total) and looked for where it got called. I figured out where it got called, and fixed the class so that anywhere a byte array was written out from a stream, it got sent to my server, and removed the other calls. (If you’re following along at home, the smali file I changed included the string “Double byte dictionary token out of range”, and the two methods I changed contained calls to toByteArray, and ended with invoking a protocol interface.) Looking at what got sent to my server, it seemed like a reasonably comprehensive collection of WhatsApp messages, and the signalling messages contained what I thought they would.

WhatsApp messages are in a compressed XMPP format. A lot of parsers have been written for reverse engineering this protocol, but I found the whatsapp-reveng parser worked the best. I did have to replace the tokens in whatsapp_defines.py with a list extracted from the APK for it to work correctly though. This made it easier to figure out what was in each packet sent to the server.

Playing with this a bit, I discovered that there are three types of checks in WhatsApp signalling messages. First, the server validates and modifies incoming signalling messages. Secondly, the messages are deserialized, and this can cause errors if the format is incorrect, and generally limits the contents of the Java message object that is passed on. Finally, the native methods perform checks on their parameters.

These additional checks prevented several of the areas I thought were problems from actually being problems. For example, there is a function called by Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer that takes in an array of byte arrays, an array of integers and an array of booleans. It uses these values to construct candidates for the call. It checks that the array of byte arrays and the array of integers are of the same length before it loops through them, using values from each, but it does not perform the same check on the boolean array. I thought that this could go out of bounds, but it turns out that the integer and booleans are serialized as a vector of <int,bool> pairs, and the arrays are then copied from the vector, so it is not actually possible to send arrays with different lengths.

One area of the signalling messages that looked especially concerning was the voip_options field of the message. This field is never sent from the sending device, but is added to the message by the server before it is forwarded to the receiving device. It is a buffer in JSON format that is processed by the receiving device and contains dozens of configuration parameters.

{"aec":{"offset":"0","mode":"2","echo_detector_mode":"4","echo_detector_impl":"2","ec_threshold":"50","ec_off_threshold":"40","disable_agc":"1","algorithm":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"aecm_adapt_step_size":"2"},"agc":{"mode":"0","limiterenable":"1","compressiongain":"9","targetlevel":"1"},"bwe":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"encode":{"complexity":"5","cbr":"0"},"init_bwe":{"use_local_probing_rx_bitrate":"1","test_flags":"982188032","max_tx_rott_based_bitrate":"128000","max_bytes":"8000","max_bitrate":"350000"},"ns":{"mode":"1"},"options":{"connecting_tone_desc": "test","video_codec_priority":"2","transport_stats_p2p_threshold":"0.5","spam_call_threshold_seconds":"55","mtu_size":"1200","media_pipeline_setup_wait_threshold_in_msec":"1500","low_battery_notify_threshold":"5","ip_config":"1","enc_fps_over_capture_fps_threshold":"1","enable_ssrc_demux":"1","enable_preaccept_received_update":"1","enable_periodical_aud_rr_processing":"1","enable_new_transport_stats":"1","enable_group_call":"1","enable_camera_abtest_texture_preview":"1","enable_audio_video_switch":"1","caller_end_call_threshold":"1500","call_start_delay":"1200","audio_encode_offload":"1","android_call_connected_toast":"1"}
Sample voip_options (truncated)

If a peer could send a voip_options parameter to another peer, it would open up a lot of attack surface, including a JSON parser and the processing of these parameters. Since this parameter almost always appears in an offer, I tried modifying an offer to contain one, but the offer was rejected by the WhatsApp server with error 403. Looking at the binary, there were three other signal types in the incoming call flow that could accept a voip_options parameter. Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAccept and Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChanged were accepted by the server if a voip_options parameter was included, but it was stripped before the message was sent to the peer. However, if a voip_options parameter was attached to a Java_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfo message, it would be forwarded to the peer device. I confirmed this by sending malformed JSON looking at the log of the receiving device for an error.

The voip_options parameter is processed by WhatsApp in three stages. First, the JSON is parsed into a tree. Then the tree is transformed to a map, so JSON object properties can be looked up efficiently even though there are dozens of them. Finally, WhatsApp goes through the map, looking for specific parameters and processes them, usually copying them to an area in memory where they will set a value relevant to the call being made.

Starting off with the JSON parser, it was clearly the PJSIP JSON parser. I compiled the code and fuzzed it, and only found one minor out-of-bounds read issue.

I then looked at the conversion of the JSON tree output from the parser into the map. The map is a very efficient structure. It is a hash map that uses FarmHash as its hashing algorithm, and it is designed so that the entire map is stored in a single slab of memory, even if the JSON objects are deeply nested. I looked at many open source projects that contained similar structures, but could not find one that looked similar. I looked through the creation of this structure in great detail, looking especially for type confusion bugs as well as errors when the memory slab is expanded, but did not find any issues.

I also looked at the functions that go through the map and handle specific parameters. These functions are extremely long, and I suspect they are generated using a code generation tool such as bison. They mostly copy parameters into static areas of memory, at which point they become difficult to trace. I did not find any bugs in this area either. Other than going through parameter names and looking for value that seemed likely to cause problems, I did not do any analysis of how the values fetched from JSON are actually used. One parameter that seemed especially promising was an A/B test parameter called setup_video_stream_before_accept. I hoped that setting this would allow the device to accept RTP before the call is answered, which would make RTP bugs interaction-less, but I was unable to get this to work.

In the process of looking at this code, it became difficult to verify its functionality without the ability to debug it. Since WhatsApp ships an x86 library for Android, I wondered if it would be possible to run the JSON parser on Linux.

Tavis Ormandy created a tool that can load the libwhatsapp.so library on Linux and run native functions, so long as they do not have a dependency on the JVM. It works by patching the .dynamic ELF section to remove unnecessary dependencies by replacing DT_NEEDED tags with DT_DEBUG tags. We also needed to remove constructors and deconstructors by changing the DT_FINI_ARRAYSZ and DT_INIT_ARRAYSZ to zero. With these changs in place, we could load the library using dlopen() and use dlsym() and dlclose() as normal.

Using this tool, I was able to look at the JSON parsing in more detail. I also set up distributed fuzzing of the JSON binary. Unfortunately, it did not uncover any bugs either.

Overall, WhatsApp signalling seemed like a promising attack surface, but we did not find any vulnerabilities in it. There were two areas where we were able to extend the attack surface beyond what is used in the basic call flow. First, it was possible to send signalling messages that should only be sent after a call is answered before the call is answered, and they were processed by the receiving device. Second, it was possible for a peer to send voip_options JSON to another device. WhatsApp could reduce the attack surface of signalling by removing these capabilities.

I made these suggestions to WhatsApp, and they responded that they were already aware of the first issue as well as variants of the second issue. They said they were in the process of limiting what signalling messages can be processed by the device before a call is answered. They had already fixed other issues where a peer can send voip_options JSON to another peer, and fixed the method I reported as well. They said they are also considering adding cryptographic signing to the voip_options parameter so a device can verify it came from the server to further avoid issues like this. We appreciate their quick resolution of the voip_options issue and strong interest in implementing defense-in-depth measures.

In Part 5, we will discuss the conclusions of our research and make recommendations for better securing video conferencing.

Tuesday, December 11, 2018

Adventures in Video Conferencing Part 3: The Even Wilder World of WhatsApp

Posted by Natalie Silvanovich, Project Zero

WhatsApp is another application that supports video conferencing that does not use WebRTC as its core implementation. Instead, it uses PJSIP, which contains some WebRTC code, but also contains a substantial amount of other code, and predates the WebRTC project. I fuzzed this implementation to see if it had similar results to WebRTC and FaceTime.

Fuzzing Set-up

PJSIP is open source, so it was easy to identify the PJSIP code in the Android WhatsApp binary (libwhatsapp.so). Since PJSIP uses the open source library libsrtp, I started off by opening the binary in IDA and searching for the string srtp_protect, the name of the function libsrtp uses for encryption. This led to a log entry emitted by a function that looked like srtp_protect. There was only one function in the binary that called this function, and called memcpy soon before the call. Some log entries before the call contained the file name srtp_transport.c, which exists in the PJSIP repository. The log entries in the WhatsApp binary say that the function being called is transport_send_rtp2 and the PJSIP source only has a function called transport_send_rtp, but it looks similar to the function calling srtp_protect in WhatsApp, in that it has the same number of calls before and after the memcpy. Assuming that the code in WhatsApp is some variation of that code, the memcpy copies the entire unencrypted packet right before it is encrypted.

Hooking this memcpy seemed like a possible way to fuzz WhatsApp video calling. I started off by hooking memcpy for the entire app using a tool called Frida. This tool can easily hook native function in Android applications, and I was able to see calls to memcpy from WhatsApp within minutes. Unfortunately though, video conferencing is very performance sensitive, and a delay sending video packets actually influences the contents of the next packet, so hooking every memcpy call didn’t seem practical. Instead, I decided to change the single memcpy to point to a function I wrote.

I started off by writing a function in assembly that loaded a library from the filesystem using dlopen, retrieved a symbol by calling dlsym and then called into the library. Frida was very useful in debugging this, as it could hook calls to dlopen and dlsym to make sure they were being called correctly. I overwrote a function in the WhatsApp GIF transcoder with this function, as it is only used in sending text messages, which I didn’t plan to do with this altered version. I then set the memcpy call to point to this function instead of memcpy, using this online ARM branch finder.

sub_2F8CC
MOV             X21, X30
MOV             X22, X0
MOV             X23, X1
MOV             X20, X2
MOV             X1, #1
ADRP            X0, #aDataDataCom_wh@PAGE ; "/data/data/com.whatsapp/libn.so"
ADD             X0, X0, #aDataDataCom_wh@PAGEOFF ; "/data/data/com.whatsapp/libn.so"
BL              .dlopen
ADRP            X1, #aApthread@PAGE ; "apthread"
ADD             X1, X1, #aApthread@PAGEOFF ; "apthread"
BL              .dlsym
MOV             X8, X0
MOV             X0, X22
MOV             X1, X23
MOV             X2, X20
NOP
BLR             X8
MOV             X30, X21
RET
The library loading function

I then wrote a library for Android which had the same parameters as memcpy, but fuzzed and copied the buffer instead of just copying it, and put it on the filesystem where it would be loaded by dlopen. I then tried making a WhatsApp call with this setup. The video call looked like it was being fuzzed and crashed in roughly fifteen minutes.

Replay Set-up


To replay the packets I added logging to the library, so that each buffer that was altered would also be saved to a file. Then I created a second library that copied the logged packets into the buffer being copied instead of altering it. This required modifying the WhatsApp binary slightly, because the logged packet will usually not be the same size as the packet currently being sent. I changed the length of the hooked memcpy to be passed by reference instead of by value, and then had the library change the length to the length of the logged packet. This changed the value of the length so that it would be correct for the call to srtp_protect. Luckily, the buffer that the packet is copied into is a fixed length, so there is no concern that a valid packet will overflow the buffer length. This is a common design pattern in RTP processing that improves performance by reducing length checks. It was also helpful in modifying FaceTime to replay packets of varying length, as described in the previous post.

This initial replay setup did not work, and looking at the logged packets, it turned out that WhatsApp uses four streams with different SSRCs for video conferencing (possibly one for video, one for audio, one for synchronization and one for good luck). The streams each had only one payload type, and they were all different, so it was fairly easy to map each SSRC to its stream. So I modified the replay library to determine the current SSRC for each stream based on the payload types of incoming packets, and then to replace the SSRC of the replayed packets with the correct one based on their payload type. This reliably replayed a WhatsApp call. I was then able to fuzz and reproduce crashes on WhatsApp.

Results

Using this setup, I reported one heap corruption issue on WhatsApp, CVE-2018-6344. This issue has since been fixed. After this issue was resolved, fuzzing did not yield any additional crashes with security impact, and we moved on to other methodologies. Part 4 will describe our other (unsuccessful) attempts to find vulnerabilities in WhatsApp.

Wednesday, December 5, 2018

Adventures in Video Conferencing Part 2: Fun with FaceTime

Posted by Natalie Silvanovich, Project Zero

FaceTime is Apple’s video conferencing application for iOS and Mac. It is closed source, and does not appear to use any third-party libraries for its core functionality. I wondered whether fuzzing the contents of FaceTime’s audio and video streams would lead to similar results as WebRTC.

Fuzzing Set-up


Philipp Hancke performed an excellent analysis of FaceTime’s architecture in 2015. It is similar to WebRTC, in that it exchanges signalling information in SDP format and then uses RTP for audio and video streams. Looking at the FaceTime implementation on a Mac, it seemed the bulk of the calling functionality of FaceTime is in a daemon called avconferenced. Opening up the binary that supports its functionality, AVConference in IDA, it contains a function called SRTPEncryptData. This function then calls CCCryptorUpdate, which appeared to encrypt RTP packets below the header.

To do a quick test of whether fuzzing was likely to be effective, I hooked this function and altered the underlying encrypted data. Normally, this can be done by setting the DYLD_INSERT_LIBRARIES environment variable, but since avconferenced is a daemon that restarts automatically when it dies, there wasn’t an easy way to set an environment variable. I eventually used insert_dylib to alter the AVConference binary to load a library on startup, and restarted the process. The library loaded used DYLD_INTERPOSE to replace CCCryptorUpdate with a version that fuzzed every input buffer (using fuzzer q from Part 1) before it was processed. This implementation had a lot of problems: it fuzzed both encryption and decryption, it affected every call to CCCryptorUpdate from avconferenced, not just ones involved in SRTP and there was no way to reproduce a crash. But using the modified FaceTime to call an iPhone led to video output that looked corrupted, and the phone crashed in a few minutes. This confirmed that this function was indeed where FaceTime calls are encrypted, and that fuzzing was likely to find bugs.

I made a few changes to the function that hooked CCCryptorUpdate to attempt to solve these problems. I limited fuzzing the input buffer to the two threads that write audio and video output to RTP, which also solved the problem of decrypted packets being fuzzed, as these threads only ever encrypt. I then added functionality that wrote the encrypted, fuzzed contents of each packet to a series of log files, so that test cases could be replayed. This required altering the sandbox of avconferenced so that it could write files to the log location, and adding spinlocks to the hook, as calling CCCryptorUpdate is thread safe, but logging packets isn’t.

Call Replay


I then wrote a second library that hooks CCCryptorUpdate and replays packets logged by the first library by copying the logged packets in sequence into the packet buffers passed into the function. Unfortunately, this required a small modification to the AVConference binary, as the SRTPEncryptData function does not respect the length returned by CCCryptorUpdate; instead, it assumes that the length of the encrypted data is the same as the length as the plaintext data, which is reasonable when CCCryptorUpdate isn’t being hooked. Since SRTPEncryptData always uses a large fixed-size buffer for encryption, and encryption is in-place, I changed the function to retrieve the length of the encrypted buffer from the very end of the buffer, which was set in the hooked CCCryptorUpdate call. This memory is unlikely to be used for other purposes due to the typical shorter length of RTP packets. Unfortunately though, even though the same encrypted data was being replayed to the target, it wasn’t being processed correctly by the receiving device.

To understand why requires an explanation of how RTP works. An RTP packet has the following format.


It contains several fields that impact how its payload is interpreted. The SSRC is a random identifier that identifies a stream. For example, in FaceTime the audio and video streams have different SSRCs. SSRCs can also help differentiate between streams in a situation where a user could potentially have an unlimited number of streams, for example, multiple participants in a video call. RTP packets also have a payload type (PT in the diagram) which is used to differentiate different types of data in the payload. The payload type for a certain data type is consistent across calls. In FaceTime, the video stream has a single payload type for video data, but the audio stream has two payload types, likely one for audio data and the other for synchronization. The marker (M in the diagram) field of RTP is also used by FaceTime to represent when a packet is fragmented, and needs to be reassembled.

From this it is clear that simply copying logged data into the current encrypted packet won’t function correctly, because the data needs to have the correct SSRC, payload type and marker, or it won’t be interpreted correctly. This wasn’t necessary in WebRTC, because I had enough control over WebRTC that I could create a connection with a single SSRC and payload type for fuzzing purposes. But there is no way to do this in FaceTime, even muting a video call leads to silent audio packets being sent as opposed to the audio stream shutting down. So these values needed to be manually corrected.

An RTP feature called extensions made correcting these fields difficult. An extension is an optional header that can be added to an RTP packet. Extensions are not supposed to depend on the RTP payload to be interpreted, and extensions are often used to transmit network or display features. Some examples of supported extensions include the orientation extension, which tells the endpoint the orientation of the receiving device and the mute extension, which tells the endpoint whether the receiving device is muted.

Extensions mean that even if it is possible to determine the payload type, marker and SSRC of data, this is not sufficient to replay the exact packet that was sent. Moreover, FaceTime creates extensions after the packet is encrypted, so it is not possible to create the complete RTP packet by hooking CCCryptorUpdate, because extensions could be added later.

At this point, it seemed necessary to hook sendmsg as well as CCCryptorUpdate. This would allow the outgoing RTP header to be modified once it is complete. There were a few challenges in doing this. To start, audio and video packets are sent by different threads in FaceTime, and can be reordered between the time they are encrypted and the time they are sent by sendmsg. So I couldn’t assume that if sendmsg received an RTP packet that it was necessarily the last one that was encrypted. There was also the problem that SSRCs are dynamic, so replaying an RTP packet with the same SSRC it is recorded with won’t work, it needs to have the new SSRC for the audio or video stream.

Note that in MacOS Mojave, FaceTime can call sendmsg via either the AVConference binary or the IDSFoundation binary, depending on the network configuration. So to capture and replay unencrypted RTP traffic on newer systems, it is necessary  to hook CCCryptorUpdate in AConference and sendmsg in IDSFoundation (AVConference calls into IDSFoundation when it calls sendmsg). Otherwise, the process is the same as on older systems.

I ended up implementing a solution that recorded packets by recording the unencrypted payload, and then recorded its RTP header, and using a snippet of the encrypted payload to pair headers with the correct unencrypted payload. Then to replay packets, the packets encrypted in CCCryptorUpdate were replaced with the logged packets, and once the encrypted payload came through to sendmsg, the header was replaced with the logged one for that payload. Fortunately, the two streams with unique SSRCs used by FaceTime do not share any payload types, so it was possible to determine the new SSRC for each stream by waiting for an incoming packet with the correct payload type. Then in each subsequent packet, the SSRC was replaced with the correct one.

Unfortunately, this still did not replay a FaceTime call correctly, and calls often experienced decryption failures. I eventually determined that audio and video on FaceTime are encrypted with different keys, and updated the replay script to queue the CCCryptor used by CCCryptorUpdate function based on whether it was audio or video content. Then in sendmsg, the entire logged RTP packet, including the unencrypted payload, was copied into the outgoing packet, the SSRC was fixed, and then the payload encrypted with the next CCCryptor out of the appropriate queue. If a CCCryptor wasn’t available, outgoing packets were dropped until a new one was created. At this point, it was possible to stop using the modified AVConference binary, as all the packet modification was now happening in sendmsg. This implementation still had reliability problems.

Digging more deeply into how FaceTime encryption works, packets are encrypted in CTS mode, which requires a counter. The counter is initialized to a unique value for each packet that is sent. During the initialization of the RTP stream, the peers exchange two 16-byte random tokens, one for audio and one for video. The counter value for each packet is then calculated by exclusive or-ing the token with several values found in the packet, including the SSRC and the sequence number. Only one value in this calculation, the sequence number, changes between each packet. So it is possible to calculate the counter value for each packet by knowing the initial counter value and sequence number, which can be retrieved by hooking CCCryptorCreateWithMode. The sequence number is xor-ed with the random token at index 0x12 when FaceTime constructs a counter, so by xor-ing this location with the initial sequence number and then a packet’s sequence number, the counter value for that packet can be calculated. The key can also be retrieved by hooking CCCryptorCreateWithMode.This allowed me to dispense with queuing cryptors, as I now had all the information I needed to construct a cryptor for any packet. This allowed for packets to be encrypted faster and more accurately.

Sequence numbers still posed a problem though, as the initial sequence number of an RTP stream is randomly generated at the beginning of the call, and is different between subsequent calls. Also, sequence numbers are used to reconstruct video streams in order, so they need to be correct. I altered the replay tool determine the starting sequence number of each stream, and then calculate the difference between the starting sequence number of each logged stream and the sequence number of the logged packet and then add it to this value. These two changes finally made the replay tool work, though replay gets slower and slower as a stream is replayed due to dropped packets.   

Results


Using this setup, I was able to fuzz FaceTime calls and reproduce the crashes. I reported three bugs in FaceTime based on this work. All these issues have been fixed in recent updates.

CVE-2018-4366 is an out-of-bounds read in video processing that occurs on Macs only.

CVE-2018-4367 is a stack corruption vulnerability that affects iOS and Mac. There are a fair number of variables on the stack of the affected function before the stack cookie, and several fuzz crashes due to this issue caused segmentation faults as opposed to stack_chk crashes, so it is likely exploitable.

CVE-2018-4384 is a kernel heap corruption issue in video processing that affects iOS. It is likely similar to this vulnerability found by Adam Donenfeld of Zimperium.

All of these issues took less than 15 minutes of fuzzing to find on a live device. Unfortunately, this was the limit of fuzzing that could be performed on FaceTime, as it would be difficult to create a command line fuzzing tool with coverage like we did for WebRTC as it is closed source.

In Part 3, we will look at video calling in WhatsApp.

Tuesday, December 4, 2018

Adventures in Video Conferencing Part 1: The Wild World of WebRTC

Posted by Natalie Silvanovich, Project Zero

Over the past five years, video conferencing support in websites and applications has exploded. Facebook, WhatsApp, FaceTime and Signal are just a few of the many ways that users can make audio and video calls across networks. While a lot of research has been done into the cryptographic and privacy properties of video conferencing, there is limited information available about the attack surface of these platforms and their susceptibility to vulnerabilities. We reviewed the three most widely-used video conferencing implementations. In this series of blog posts, we describe what we found.

This part will discuss our analysis of WebRTC. Part 2 will cover our analysis of FaceTime. Part 3 will discuss how we fuzzed WhatsApp. Part 4 will describe some attacks against WhatsApp that didn’t work out. And finally, Part 5 will discuss the future of video conferencing and steps that  developers can take to improve the security of their implementation.

Typical Video Conferencing Architecture


All the video conferencing implementations we investigated allow at least two peers anywhere on the Internet to communicate through audiovisual streams. Implementing this capability so that it is reliable and has good audio and video quality presents several challenges. First, the peers need to be able to find each other and establish a connection regardless of NATs or other network infrastructure. Then they need to be able to communicate, even though they could be on different platforms, application versions or browsers. Finally, they need to maintain audio and video quality, even if the connection is low-bandwidth or noisy.

Almost all video conferencing solutions have converged on a single architecture. It assumes that two peers can communicate via a secure, integrity checked channel which may have low bandwidth or involve an intermediary server, and it allows them to create a faster, higher-bandwidth peer-to-peer channel.

The first stage in creating a connection is called signalling. It is the process through which the two peers exchange the information they will need to create a connection, including network addresses, supported codecs and cryptographic keys. Usually, the calling peer sends a call request including information about itself to the receiving peer, and then the receiving peer responds with similar information. SDP is a common protocol for exchanging this information, but it is not always used, and most implementations do not conform to the specification. It is common for mobile messaging apps to send this information in a specially formatted message, sent through the same channel text messages are sent. Websites that support video conferencing often use WebSockets to exchange information, or exchange it via HTTPS using the webserver as an intermediary.

Once signalling is complete, the peers find a way to route traffic to each other using the STUN, TURN and ICE protocols. Based on what these protocols determine, the peers can create UDP, UDP-over-STUN and occasionally TCP connections based of what is favorable for the network conditions.

Once the connection has been made, the peers communicate using Real-time Transport Protocol. Though this protocol is standardized, most implementations deviate somewhat from the standard. RTP can be encrypted using a protocol called Secure RTP (SRTP), and some implementations also encrypt streams using DTLS. Under the encryption envelope, RTP supports features that allow multiple streams and formats of data to be exchanged simultaneously. Then, based on how RTP classifies the data, it is passed on to other processing, such as video codecs.  Stream Control Transmission Protocol (SCTP) is also sometimes used to exchange small amounts of data (for example a text message on top of a call) during video conferencing, but it is less commonly used than RTP.

Even when it is encrypted, RTP often doesn’t include integrity protection, and if it does, it usually doesn’t discard malformed packets. Instead, it attempts to recover them using strategies such as Forward Error Correction (FEC). Most video conferencing solutions also detect when a channel is noisy or low-bandwidth and attempt to handle the situation in a way that leads to the best audio and video quality, for example, sending fewer frames or changing codecs. Real Time Control Protocol (RTCP) is used to exchange statistics on network quality and coordinate adjusting properties of RTP streams to adapt to network conditions.

WebRTC


WebRTC is an open source project that enables video conferencing. It is by far the most commonly used implementation. Chrome, Safari, Firefox, Facebook Messenger, Signal and many other mobile applications use WebRTC. WebRTC seemed like a good starting point for looking at video conferencing as it is heavily used, open source and reasonably well-documented.

WebRTC Signalling


I started by looking at WebRTC signalling, because it is an attack surface that does not require any user interaction. Protocols like RTP usually start being processed after a user has picked up the video call, but signalling is performed before the user is notified of the call. WebRTC uses SDP for signalling.

I reviewed the WebRTC SDP parser code, but did not find any bugs. I also compiled it so it would accept an SDP file on the commandline and fuzzed it, but I did not find any bugs through fuzzing either. I later discovered that WebRTC signalling is not implemented consistently across browsers anyhow. Chrome uses the main WebRTC implementation, Safari has branched slightly and Firefox uses their own implementation. Most mobile applications that use WebRTC implement their own signalling in a protocol that is not SDP as well. So it is not likely that a bug in WebRTC signalling would affect a wide variety of targets.

RTP Fuzzing


I then decided to look at how RTP is processed in WebRTC. While RTP is not an interaction-less attack surface because the user usually has to answer the call before RTP traffic is processed, picking up a call is a reasonable action to expect a user to take. I started by looking at the WebRTC source, but it is very large and complex, so I decided fuzzing would be a better approach.

The WebRTC repository contains fuzzers written for OSS-Fuzz for every protocol and codec supported by WebRTC, but they do not simulate the interactions between the various parsers, and do not maintain state between test cases, so it seemed likely that end-to-end fuzzing would provide additional coverage.

Setting up end-to-end fuzzing was fairly time intensive, so to see if it was likely to find many bugs, I altered Chrome to send malformed RTP packets. I changed the srtp_protect function in libsrtp so that it ran the following fuzzer on every packet:

void fuzz(char* buf, int len){

int q = rand()%10;

if (q == 7){
int ind = rand()%len;
buf[ind] = rand();
}

if(q == 5){
for(int i = 0; i < len; i++)
buf[i] = rand();

}
RTP fuzzer (fuzzer q)

When this version was used to make a WebRTC call to an unmodified instance of Chrome, it crashed roughly every 30 seconds.

Most of the crashes were due to divide-by-zero exceptions, which I submitted patches for, but there were three interesting crashes. I reproduced them by altering the WebRTC source in Chrome so that it would generate the packets that caused the same crashes, and then set up a standalone build of WebRTC to reproduce them, so that it was not necessary to rebuild Chrome to reproduce the issues.

The first issue, CVE-2018-6130 is an out-of-bounds memory issue related to the use of std::map find in processing VP9 (a video codec). In the following code, the value t10_pic_idx is pulled out of an RTP packet unverified (GOF stands for group of frames).

if (frame->frame_type() == kVideoFrameKey) {
   ...
   GofInfo info = gof_info_.find(codec_header.tl0_pic_idx)->second;
   FrameReceivedVp9(frame->id.picture_id, &info);
   UnwrapPictureIds(frame);
   return kHandOff;
 }


If this value does not exist in the gof_info_ array, std::map::find returns the end value of the map, which points to one element past the allocated values for the map. Depending on memory layout, dereferencing this iterator will either crash or return the contents of unallocated memory.

The second issue, CVE-2018-6129 is a more typical out-of-bounds read issue, where the index of a field is read out of an RTP packet, and not verified before it is used to index a vector.

The third issue, CVE-2018-6157 is a type confusion issue that occurs when a packet that looks like a VP8 packet is sent to the H264 parser. The packet will eventually be treated like an H264 packet even though it hasn’t gone through the necessary checks for H264. The impact of this issue is also limited to reading out of bounds.
There are a lot of limitations to the approach of fuzzing in a browser. It is very slow, the issues are difficult to reproduce, and it is difficult to fuzz a variety of test cases, because each call needs to be started manually, and certain properties, such as the default codec, can’t change in the middle of the call. After I reported these issues, the WebRTC team suggested that I use the video_replay tool, which can be used to replay RTP streams recorded in a patched browser. The tool was not able to reproduce a lot of my issues because they used non-default WebRTC settings configured through signalling, so I added the ability to load a configuration file alongside the RTP dump to this tool. This made it possible to quickly reproduce vulnerabilities in WebRTC.

This tool also had the benefit of enabling much faster fuzzing, as it was possible to fuzz RTP by fuzzing the RTP dump file and loading it into video_replay. There were some false positives, as it was also possible that fuzzing caused bugs in parsing the RTP dump file format, but most of the bugs were actually in RTP processing.
Fuzzing with the video_replay tool with code coverage and ASAN enabled led to four more bugs. We ran the fuzzer on 50 cores for about two weeks to find these issues.

CVE-2018-6156 is probably the most exploitable bug uncovered. It is a large overflow in FEC. The buffer WebRTC uses to process FEC packets is 1500 bytes, but it does no size checking of these packets once they are extracted from RTP. Practically, they can be up to about 2000 bytes long.

CVE-2018-6155 is a use-after-free in a video codec called VP8. It is interesting because it affects the VP8 library, libvpx as opposed to code in WebRTC, so it has the potential to affect software that uses this library other than WebRTC. A generic fix for libvpx was released as a result of this bug.

CVE-2018-16071 is a use-after-free in VP9 processing that is somewhat similar to CVE-2018-6130. Once again, an untrusted index is pulled out of a packet, but this time it is used as the upper bounds of a vector erase operation, so it is possible to delete all the elements of the vector before it is used.

CVE-2018-16083 is an out-of-bounds read in FEC that occurs due to a lack of bounds checking.

Overall, end-to-end fuzzing found a lot of bugs in WebRTC, and a few were fairly serious. They have all now been fixed. This shows that end-to-end fuzzing is an effective approach for finding vulnerabilities in this type of video conferencing solution. In Part 2, we will try a similar approach on FaceTime. Stay tuned!