The Real-Time Protocol (RTP) is an IETF standard, documented in RFC 3550. While different vendor systems use various signaling protocols, virtually every vendor uses RTP for the audio. This is important because RTP will be present as a target protocol in any VoIP environment. RTP is a simple protocol, generally riding on top of UDP. RTP provides payload type identification, sequence numbering, timestamping, and delivery monitoring. RTP does not provide mechanisms for timely delivery or other QoS capabilities. It depends on lower layer protocols to do this. RTP also does not assure delivery or order of packets. However, RTP's sequence numbers allow applications, such as an IP phone, to check for lost or out of order packets.
RTP includes the RTP control protocol (RTCP), which is used to monitor the quality of service and to convey information about the participants in an ongoing session. VoIP endpoints should update RTCP, but not all do.
RTP is a binary protocol, which adds the following header to each UDP packet:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ V=2PX CC M PT sequence number +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ timestamp +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ synchronization source (SSRC) identifier +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ contributing source (CSRC) identifiers .... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The first twelve bytes are present in every RTP packet, while the list of CSRC identifiers is present only when inserted by a mixer. The fields are defined as follows :
Version (V): 2 bits This field identifies the version of RTP. The version defined by RFC 3550 specification is two (2).
Padding (P): 1 bit If the padding bit is set, the packet contains one or more additional padding bytes at the end that are not part of the payload.
Extension (X): 1 bit If the extension bit is set, the fixed header must be followed by exactly one header extension, with a format defined in RFC 3550.
CSRC count (CC): 4 bits The CSRC count contains the number of CSRC identifiers that follow the fixed header.
Marker (M): 1 bit The interpretation of the marker is defined by a profile. It is intended to allow significant events such as frame boundaries to be marked in the packet stream.
Payload type (PT): 7 bits This field identifies the format of the RTP payload and determines its interpretation by the application.
Sequence number: 16 bits The sequence number increments by one for each RTP data packet sent and may be used by the receiver to detect packet loss and to restore packet order. The initial value of the sequence number should be random (it should not be 0) to make knownplaintext attacks on encryption more difficult, even if the source itself does not encrypt according to the method because the packets may flow through a translator that does.
Timestamp: 32 bits The timestamp reflects the sampling time of the first byte in the RTP payload. The clock used to calculate the timestamp must have sufficient resolution to allow endpoints to perform synchronization and jitter calculations.
SSRC: 32 bits The SSRC field identifies the synchronization source. This identifier should be chosen randomly , so that no two synchronization sources within the same RTP session should have the same SSRC. Although the probability of multiple sources choosing the same identifier is low, all RTP implementations must be prepared to detect and resolve duplicates.
The presence of the sequence number, timestamp, and SSRC makes it difficult for an attacker to inject malicious RTP packets into a stream. The attacker needs to be performing a man-in-the-middle (MITM) attack or at least be able to monitor the packets, so that the malicious packets include the necessary SSRC, sequence number, and timestamp. If these values are not correct, the target endpoint will ignore the malicious packets.
RTP audio is sampled at a transmitting endpoint over a given time period. A number of samples are collected and then typically compressed by a compressor/decompressor (codec) . For example, the ITU has created and published specifications for several popular audio codecs, such as G.711, G.723, G.726, and G.729.
G.711 is the most commonly used codec, particularly for LAN-based VoIP calls. G.711 uses Pulse Code Modulation (PCM) and requires 64 Kbps. Other codecs, such as G.729, which uses Adaptive Differential Pulse Code Modulation (ADPCM), only require 8 Kbps. These codecs are often used over lower-bandwidth links.
Codecs such as G.711 are waveform coders , which means they are not aware of human speech characteristics. They sample the audio at a specific frequency, such as 8 KHz (8000 times a second). Codecs such as G.728 are vocoders and are aware of human speech characteristics. Only a limited number of sounds are uttered during human speech. A vocoder identifies the phonemes uttered and looks up codes in a table corresponding to that sound. Codes are then transmitted instead of the sampled audio itself. Vocoders are significantly more computationally intensive than waveform coders, but generally require less bandwidth. They do, however, generate poor sounding audio.
G.711 audio is carried as a 160-byte payload within an RTP message. RTP messages are transmitted within UDP packets at a rate of 50 Hz (in other words every 20 milliseconds (ms)). The sequence number field within the RTP header begins at some random number and increases monotonically by 1 with each RTP packet transmitted. For G.711, the timestamp begins at some random number and increases monotonically by 160 with each RTP packet transmitted.
When an audio session is being set up between two VoIP endpoints, SIP signaling messages (for example, INVITE, OK) typically carry a Session Description Protocol (SDP) message within their payload. Session Description Protocol message exchange is the mechanism by which endpoints negotiate, or state, which codec or codecs they care to support for encoding or decoding audio during the session. One codec may be used to compress transmitted audio, and a different codec may be used to decompress received audio.
The codec determines the time quantum over which audio is sampled and the rate that RTP- bearing packets are transmitted. The selected transmission rate is fixed. Whether or not the packets arrive at the fixed rate depends on the underlying network. Packets may be lost, arrive out of order, or be duplicated . Receiving endpoints must take this into account. Endpoints use an audio jitter buffer that collects, resequences, fills in gaps, and if necessary, deletes samples in order to produce the highest quality audio playback. The sequence number and timestamp in the RTP header are used for this purpose.