34.4 VOICE AND VIDEO OVER IP

< Day Day Up >

Using the Internet, intranets and virtual private networks for voice communication offers great potential because all corporations can drastically reduce their communication costs. The initial euphoria created by VoIP was so dramatic that some journalists even started writing the obituaries of major telecom service providers. Certainly VoIP has the potential, and all these service providers followed the dictum "if you cannot beat them, follow them," and are getting into VoIP. It is still a long time before the traditional telecom operators close their shops, simply because the Internet infrastructure has not penetrated the market the way the traditional telephone infrastructure has. But then, VoIP is the technology of the future.

Organizations having their own IP-based intranets or virtual private networks will be greatly benefited by providing voice/ video services over these networks.

34.4.1 Applications of Voice over IP

Compared to transmitting voice over a circuit-switched network, transmitting voice over a packet-switched network has the main advantage of savings in cost-it is because of the fact that in packet-switched networks, the billing is not based on the distance but on the data sent. For example, on the Internet, irrespective of the location of the Web server we access, we pay only for the local calls for accessing the ISP, based on the duration of the connection. The same approach for billing can be followed for voice communication over IP networks, including the Internet.

An organization can use the intranet infrastructure to provide voice over IP service. However, to interact with the normal PSTN subscribers, a gateway needs to be installed.

Consider an organization with multiple branches in different parts of a country or in different countries. These branches can be connected through a virtual private network (VPN) using the Internet infrastructure. The gateway or IWF provides the interface between the subscriber equipment (such as PBX LAN) and the packet network. People at different branches can do voice or data communication through the packet network without paying for long distance calls.

Through this technology, the Internet can be used for voice communication between any two persons located anywhere on Earth. A person can communicate with another directly through the multimedia PCs, or one person can use a multimedia PC and another a normal telephone connected to the PSTN. In the second case, a VoIP gateway is required to do do the protocol conversion between the packet network and the PSTN.

VoIP is the first and foremost step in achieving convergence wherein the Internet infrastructure can be used for both voice and data communication. Also, the PSTN infrastructure and the Internet infrastructure will converge to provide a single interface to the subscriber for providing voice communication.

34.4.2 Issues in Voice and Video over IP

When we make voice calls over the PSTN, we get good quality speech. This is because the PSTN provides low transmission delay (except when the call uses satellite channel), and generally the network is reliable. So, the quality of service is guaranteed.

When voice is transmitted over an IP network, the quality of service is a major issue mainly because of the delays and unreliable service provided by the IP network. The major quality of service issues in IP networks for voice communication are delay, jitter, and packet loss.

Delay: Delay causes two problems in voice communication: echo and talker overlap. Echo is caused by signal reflections of the speaker's voice from the far end telephone equipment back to the ear of the speaker. If the round trip delay (sender to receiver and back) is more than 50 msec, echo is heard, and echo cancellation equipment has to be incorporated. ITU standard G.165 specifies the performance requirements of the echo cancellation equipment. Talker overlap occurs when the one-way delay is greater than 250 msec. The speaker talks into his telephone and, because of the delay, assumes that the other person is not responding and says a hello again-the conversation is very difficult to continue. This must be avoided by reducing the delay to the maximum extent possible.

Delay in IP networks is an accumulation of algorithmic delay, processing delay, and network delay.

Algorithmic delay: To reduce the bandwidth requirements, voice is coded at low bit rates for transmission over the IP networks. The coding algorithm introduces algorithmic delay. Typical algorithmic delays for various coders are given below. It needs to be mentioned that the coding delays are decreasing because of the availability of digital signal processors with higher processing power.

Coding technique	ITU Standard	Coding Rate (Kbps)	Delay
ADPCM	G.726	16, 24, 32, 40	125 microseconds
CELP (Code Excited Linear Prediction)	G.728	16	2.5 milliseconds
Multirate coder	G.723.1	5.3, 6.3	30 milliseconds

Processing delay: After coding the voice signals, packetization has to be done, which again contributes to delay. This delay is dependent on the performance of the processor used and also the algorithm.
Network delay: Depending on the traffic on the network, there may be congestion in the network, and hence there will be delay in the arrival of the packets at the receiver. In IP networks, a network delay up to 100 milliseconds is not uncommon.

Note

The delay in IP networks is the sum of algorithmic delay, processing delay, and network delay. Algorithmic delay is caused by the low bit rate coding algorithm, processing delay by packetization of the voice data, and network delay due to the congestion in the network.

In VoIP systems, the effect of the delay has to be compensated using different strategies-fast packet switching, use of low-delay codes, and so on.

The three important factors that need special attention for real-time communication over IP networks are delay, jitter, and packet loss.

Jitter: In IP networks, the delay is variable because the delay is dependent on the traffic. If there is congestion in the network, the delay varies from packet to packet. For voice, this is a major problem because if the packets are not received with constant delay, the voice replay is not proper, and there will be gaps while replaying.

In real-time voice communication, even if a voice packet is lost, the replay at the receiving end should be continuous. To achieve this, one approach is to replay the previous packet. The other approach is to send redundant information in each packet so that the lost packet can be reconstructed.

One way of solving this problem is to hold the packets in a buffer to take care of the highest delay anticipated for a packet. This calls for a high buffer storage at the receiver.

Lost packets: When congestion develops in the IP networks, some packets may be dropped. Loss of packets means loss of information. IP networks do not guarantee quality of service, and the problem is overcome using one of the following methods:

Interpolation for lost packets: When a packet is lost, the previous packet is replayed in the place of the lost packet to fill the time. This approach is ok if the number of packets lost is small and if consecutive packets are not lost.
Send redundant information: The information about the nth packet is sent along with (n+1)^th packet. When one packet is lost, the next packet contains information about the previous packet, and this information is used to reconstruct the lost packet. Of course, this approach is at the expense of bandwidth utilization, but speech quality improves.

Use a hybrid approach: A combination of the previous two approaches can be used as a trade-off between bandwidth utilization and quality of speech.

34.4.3 Video over IP

Video communication over IP networks follows the same principles as voice over IP networks. However, video requires a much higher data rate than voice. If we transmit 30 video frames per second, then the video appears normal. The effect of reducing the number of frames per second will be seen in the form of jerks. Because of the lack of high-speed Internet access and good streaming technologies, we encounter problems in video communication.

Realtime Transport Protocol (RTP) identifies the type of data (voice/ video) being transmitted, determines the order of the packets to be presented at the receiving end and synchronizes the media streams for different sources.

With higher access speeds to access the Internet and better video streaming technologies, in the future, video conferencing and video messaging will become popular.

Because the present IP networks do not support high bandwidths (particularly the Internet and the corporate LANs operating at 10Mbps), low bit rate video codecs are used to provide two-way and multiparty video conferencing. Video coding at 64kbps data rate would provide reasonably good quality for business applications.

Note

For video communication over IP, the video signal is encoded at 64kbps, which provides a reasonably good quality video for business meetings.

To overcome the problems of real-time transmission over the TCP/IP networks, special protocols RTP and RTCP are defined for voice and video communication over IP networks. The protocol suite for voice/video communication in IP networks is shown in Figure 34.6. The RTP runs above the UDP, which runs above the IP (Version 4 or Version 6 or mobile IP).

Multimedia Applications

RTP, RTCP, RSVP

UDP/TCP

Network layer(IPV4, IPV6, IPM)

Datalink layer

Physical layer

Figure 34.6: Protocol suite for voice and video over IP.

34.4.4 Realtime Transport Protocol (RTP)

RTP provides end-to-end delivery service for real-time data transmission. RTP runs over the UDP. RTP supports both unicast and multicast service. In unicast, separate copies of data are sent from source to each destination. In multicast service, the data is sent from the source only once, and the network is responsible for transmitting the data to multiple applications (such as in video conferencing over IP networks).

RTP provides the following services:

Identify the type of data being transmitted (voice or video).
Determine the order of the packets to be presented at the receiving end.
Synchronize the media streams for different sources.

RTP does not guarantee delivery of packets. Based on the information in the packet header, the receiver has to find out whether the packets are error free or out of sequence. Since RTP does not guarantee quality of service (QoS), it is augmented by Real Time Control Protocol (RTCP), which enables monitoring the quality. RTCP also provides control and identification mechanisms for RTP transmission. If QoS is a must for an application, RTP can be used over a Resource Reservation Setup Protocol (RSVP) that provides connection-oriented service. Using RSVP, the resources can be reserved so that QoS can be assured.

RTP uses ports-one port is used for the media data and the other is used for RTCP. Sessions are established between hosts (called participants) to exchange data. One session is used for each media type. For example, in a video conferencing application, one session is used for audio and one session is used for video. So, for example, the participants can choose, to hear only audio or only video (if a beautiful woman is giving a boring lecture). The data for a session is sent as a stream of packets called an RTP stream. In a session, there can be multiple sources (audio source, video source). Each source is called a contributing source (CSRC). Each CSRC is a source of RTP packets. In a conferencing call, each source of RTP packets has to be uniquely identified, and this is achieved through a unique ID known as the synchronization source (SSRC) identifier, which is 32 bits in length. The RTP packets from a number of sources are combined through a mixer, and the combined output RTP stream is sent over the network. The mixer output consists of a list of SSRC identifiers known as the CSRC list. At the receiving host, the RTP packets are sent to different media players based on the identifiers-audio packets are sent to the sound card, video to the display unit and so on. These output devices are called the renderers.

Each RTP packet consists of a header and data (payload). The RTP data packet header format is shown in Figure 34.7.

click to expand
Figure 34.7: RTP data packet header format.

Version number (2 bits): the current version number is 2.

Padding (1 bit): If the bit is set, it indicates that there are more bytes at the end of the packet that are not part of the payload. This bit is used by some encryption algorithms.

Extension (1 bit): If the bit is set, the fixed header is followed by one header extension. This mechanism allows additional information to be sent in the header.

CSRC count (4 bits): Indicates the number of CSRC identifiers that follow the fixed header. If the count is zero, the synchronization source is the source of the payload.

Marker (1 bit): A marker bit defined by the particular media profile.

Payload type (7 bits): An index to a media profile table that describes the payload format. RFC 1890 specifies the payload mappings for audio and video.

Sequence number (16 bits): The packet sequence number. This value is incremented by one for each packet.

Timestamp (32 bits): The sampling instant of the first byte of the payload. Several consecutive packets can have the same timestamp-for example, when all the packets belong to the same video frame.

SSRC (32 bits): This field identifies the synchronization source. If the CSRC count is zero, the payload source is the synchronization source. If the CSRC count is not zero, the SSRC identifies the mixer.

CSRC (32 bits each): Identifies the contributing sources for the payload. There can be a maximum of 16 contributing sources.

The RTP header consists of the following fields: version number, padding, extension, CSRC count, marker, payload type, sequence number, timestamp, SSRC, and CSRC.

34.4.5 Real Time Control Protocol (RTCP)

In a session, in addition to the media data, control packets are sent. These RTCP packets are sent periodically to all the participants. These packets contain information on the quality of service, source of media being transmitted on the data port, and the statistics of the data so far transmitted. The types of packets are:

Sender report (SR) packets: A participant that recently sent data packets sends an SR packet that contains the number of packets and bytes sent and information that can be used to synchronize media streams from different sessions.

Receiver report (RR) packets: Session participants periodically send RR packets. An RR packet contains information about the number of packets lost, highest sequence number received, and the timestamp that can be used to estimate the round-trip delay between the sender and the receiver.

Real Time Control Protocol is used to send control packets to all participants. These control packets contain information about the quality of service and the statistics of the data so far transmitted.

Source description packets: This packet gives the source description element (SDES) containing the canonical name (CNAME) that identifies the source.

BYE packets: A BYE packet is sent by a source when it is leaving the session. The packet may include a reason as to why it is leaving.

Application-specific packets: An APP packet is sent to define and send any application-specific information.

Since the PSTN and the Internet use different protocols for voice communication, if we need to make calls from a H.323 network to the PSTN or vice versa, we need a gateway. The block diagram of the gateway is shown in Figure 34.8.

click to expand
Figure 34.8: Interconnecting Internet and PSTN through gateway.

The gateway has two modules. One module does the conversion of the signaling protocols between the PSTN and the Internet. The second module does the necessary transformation of the media-since the PSTN uses 64kbps PCM coding whereas the VoIP network uses low bit rate codes, this coding transformation (voice transcoding) is done by this module.

The gateway between the IP network and the PSTN handles two functions: (a) conversion of the signaling protocols between the PSTN and the IP network and (b) media transformation such as conversion of the PCM-coded speech to low bit rate coded speech.

< Day Day Up >