21.5 Performance Issues in Mobile Video Telephony

After having given an overview of the standards for mobile video telephony, including CS/PS terminals and call control issues in 3GPP networks, in this section we will make some considerations for and remarks on performance. In particular error resilience, QoS profiles for conversational service, QoS metrics for video, video quality results for 3G-324M terminals, SIP signaling delay, and RTCP reporting capability aspects will be analyzed.

21.5.1 Error Resilience and QoS

In mobile video telephony special attention must be paid to error resilience issues. Because an efficient system must operate with minimal end-to-end delays, often there is not enough time for media reparation, whenever media is hit by errors due to the lossy characteristics of the air interface. In most of the cases, forward error correction (or redundancy coding) is the only means to provide error resilience within the imposed delays. In addition, to reduce the impact of data corruption and packet losses on the received media, some special shrewdness also can be taken. Here we will focus on PS video telephony systems.

When encoding a video signal using the H.263 Profile 3, the achieved error resilience is higher than the baseline H.263. MPEG-4 visual offers also advanced tools for error resilience, such as data partitioning, RVLC (Reversible Variable Length Codes) and resynchronization markers. To guarantee low delay, the specification ^[56] recommends that the video packets must be no larger than 512 bytes. In general, the smaller the packets, the smaller the amount of video data lost (and the visual quality loss) in case of packet losses. On the other hand, too-small packets produce excessive RTP/UDP/IP header overhead. The choice of the right packet size is a trade-off between error resilience, delay, and bandwidth occupancy. The packet size also can be changed dynamically on the fly, based on the condition of the network link.

When encoding and packetizing speech data with AMR or AMR-WB, the specification ^[57] mandates (or forbids) the use of certain codec options:

Speech data must be packetized using bandwidth-efficient operations. The encapsulation algorithm ^[58] offers both bit and byte alignment of data. The former is more efficient in terms of bandwidth usage.
Encapsulation of no more than one speech frame into an RTP packet to keep the delay at the minimum. One AMR speech frame is of 20-ms duration. This implies that the packet rate at the videophone terminal is 50 packets per second for both incoming and outgoing RTP flows.
The multichannel session shall not be used.
Interleaving shall not be used. This causes an increase in delay.
Internal CRC shall not be used. Data correction is performed in the lower layers of the protocol stack. This saves bandwidth.

For the transmission of real-time text using T.140, the use of redundancy coding is recommended to provide a better error resilience. ^[59]

At the network level, 3GPP specifications offer the possibility to configure the QoS profile for a conversational multimedia application running over a conversational PDP context. The specification ^[60] defines the recommended target figures for error rates and delays:

SDU error ratio (or packet loss rate): 0.7 percent or less for speech and 0.01 percent for video.
Transfer delay: 100 ms for speech and 150 ms for video.

In CS networks, errors in the air interface produce single bit errors in the video packet payload. A video decoder is generally resilient to bit error rates up to 10–3. In PS networks, errors in the air interface produce erroneous packets that generally are not forwarded to the higher protocol layers than IP. So, they are regarded as lost packets. In this case, SDU error ratios as indicated previously can be used to provide enough media resilience from packet losses.

21.5.2 Video QoS Metrics

Adequate techniques for objective and subjective speech and video quality assessment must be adopted to guarantee that a given mobile videophone implementation fulfills a minimum set of QoS requirements. This section focuses on video quality metrics.

When developing mobile video telephony applications the need is to decide which fundamental quality parameters should be selected as key parameters in QoS assessment of video. For this purpose both subjective and objective metrics must be used, because they can be considered complementary. You are referred to Curcio ^[61] for details about subjective metrics. Regarding objective quality metrics, standardization bodies have defined some methods. For example, the ANSI ^[62] and ITU ^[63] standards describe some metrics. However, for some of the metrics described in these documents, the implementation is not straightforward. Despite the effort of standardization bodies to define common video quality metrics, often the most-used objective method for video quality assessment is the PSNR (Peak Signal-to-Noise Ratio), because it is the easiest to apply to the metrics available. However, other useful quality metrics can be put to use when developing mobile videophone terminals. For further details on the metrics computation methods, please refer to Curcio. ^[64], ^[65]

The quality metrics are categorized into six classes, depending on the type of information they can provide:

Frame-based: This set of metrics gives information about the number of frames that have been processed end-to-end. The metrics are
- Number of encoded frames
- Number of decoded frames
- Number of dropped frames
- Drop frame rate
- Encoding frame rate
- Decoding frame rate
- Display frame rate
- Size of the first INTRA-coded frame
Bit rate-based: The objective of these metrics is to provide information about the repartition of the channel bandwidth. This information is precious for optimizing system performance. The metrics are
- Audio bit rate (obtained by the audio codec)
- Video bit rate
- Packetization overhead
- Application total bit rate (computed as a sum of the above values)
Packet-based: These metrics give information about the packets that are generated by the RTP packetizer or the H.223 multiplexer:
- Number of packets per frame
- Size of the packets
Loss- or corruption-based: These metrics provide information about the amount of packets lost, or the amount of correctly/incorrectly delivered data:
- Packet loss rate
- Correctly delivered data rate
- Misdelivered data rate
- Bit error rate computation
PSNR-based. PSNR is a measure of the difference between the original frame and the corresponding encoded (or decoded) frame. PSNR-based metrics are
- PSNR of the video sequence
- Standard deviation of PSNR
- PDF (Probability Density Function) and CDF (Cumulative Density Function) of PSNR
- Representative run for subjective evaluation (when multiple simulation runs are considered)
Delay-based. Delay is a very critical issue in mobile video telephony. Because end-to-end delay is made up of different components, one approach would be to measure the different delays and try to optimize them separately. A set of measurable delays are
- Capturing delay
- Initial video encoding delay (time required to encode the first INTRA frame)
- Encoding delay for video frames (minimum, average, and maximum)
- Packetization delay
- Transmission delay (related to the network)
- Depacketization delay
- Decoding delay for video frames (minimum, average, and maximum)
- Display delay
- End-to-end delay
- PDF and CDF for any of the delay components above
- Out of delay constraints rate (to measure the percentage of delay violation over a fixed threshold T of time)
- Delay jitter computed for different delays above (a particularly interesting value is the frame rate jitter)

21.5.3 Video Quality Results for 3G-324M

To provide an idea of the performance of a videophone in mobile environment; we have implemented a PC version of 3G-324M terminal and made mobile-to-mobile calls between two 3G-324M PC terminals through a simulated circuit-switched WCDMA network at 64 kbps. Table 21.2 summarizes the main simulation parameters used in our tests.

Table 21.2: 3G-324M Simulation Parameters
Speech	Preencoded AMR speech stream with average bit rate of 4.9 kbps (silence suppression is used)
Video codec	H.263+ with Annex F, I, J, T
Input frame rate	30 fps
Frame size	QCIF (176 144 pixels)
Original video sequence	Carphone concatenated three times Original (382 frames, 12.7 seconds) Concatenated (1146 frames, 38.1 seconds)
WCDMA channel bit rate	64 kbps
Mobile speeds	3 and 50 kmph
Bit error rates (BERs)	64 kbps, 3 kmph: 27E-05 and 22E-04 64 kbps, 50 kmph: 26E-05 and 22E-04
Frequency	1920 MHz
Chip rate	4.096 Mbps
Transmission direction	Uplink
Interleaving depth	40 ms
Coding	1/3-rate turbo code, 4 states
Duration of each error pattern	180 seconds
Multiplexing	H.223 Level 2
Number of simulations	10 for each error pattern file (each time starting from a different random position of the file)

The error patterns were injected two times into the bit stream to simulate the case of mobile-to-mobile connection, where two radio links are involved (this is the reason the bit error rates are doubled in Table 21.2).

The results obtained were measured in terms of average PSNR over 10 runs, standard deviation of PSNR, frame rate, delay, bandwidth usage, and visual quality. The reader interested in details about performance of 3G-324M terminals at different bit rates with service flexibility for WCDMA and HSCSD networks may refer to Curcio and coworkers, ^[66] Hourunranta and Curcio, ^[67] and Curcio and Hourunranta. ^[68] Average PSNR results are shown in Table 21.3.

Table 21.3: PSNR for Video over 3G-324M
Speed/BER	Average PSNR (dB)	Standard Deviation of PSNR (dB)
Error free	32.12	0.02
3 kmph 2*7E-05	32.01	0.07
3 kmph 2*2E-04	31.65	0.19
50 kmph 2*6E-05	31.89	0.14
50 kmph 2*2E-04	31.64	0.19

The maximum quality loss achieved at the higher BER is below 0.5 dB, with a maximum standard deviation below 0.2 dB. The average encoding frame rate was 10.2 frames per second. The end-to-end delay from encoding to display (excluding capturing and network delay) was 140 ms, of which 98 ms was for shaping delay ^[69] and 42 ms was the processing delay. The bandwidth usage is reported in Table 21.4.

Table 21.4: Bandwidth Repartition for 3G-324M
Type of Data	Percentage of Occupancy on the Total Bandwidth
Video data	84
Audio data	8
H.223 multiplexer overhead	8

Finally, Figure 21.9 shows the average visual quality under the worst of the conditions tested (BER = 2*2E-04 at 3 kmph). The picture has been selected in a way that the PSNR of the sample picture is as close to the average PSNR as possible (31.65 dB). As it can be seen, the picture does not show critical degradations, and its quality is fairly good.

click to expand
Figure 21.9: Carphone 64 kbps BER = 2*2E-04 3 kmph.

21.5.4 SIP Signaling Delay

One factor that influences the overall user QoS, in addition to media quality, is the call setup delay. This is important when globally evaluating user satisfaction for a certain service. We take this issue into consideration in this section, evaluating the performance of a SIP user agent (UA) signaling with video telephony capabilities that we have implemented.

When SIP is used over UDP on a mobile network, the call setup time between two terminals can vary because of the following factors:

Lossy nature of the channel: If SIP packets are lost during call establishment, these are retransmitted.
Size of the channel: Smaller network bandwidths yield higher call setup delays than larger bandwidths.
Processing delays in the network: Each network element takes some time to process the requests made by the endpoints.
Congestion in the network path along the two end-points.

The message reliability system defined in SIP ^[70] is made in such way that it can cope with packet losses and unexpected delays within the network. The basic idea is that if a SIP message is not received within a certain specified time, it is retransmitted by the protocol itself. In the following, the retransmission rules for the different SIP messages exchanged in a session between two SIP UAs such as one in Figure 21.6 are explained:

INVITE method. A SIP UA should retransmit an INVITE request with an interval that starts at T1 seconds, and doubles after each packet transmission. T1 is an estimate of the round-trip time (RTT). The client stops retransmissions if it receives a provisional (1xx) or definitive (2xx) response, or once it has sent a total of seven request packets. A UA client may send a BYE or CANCEL request after the seventh retransmission (i.e., after 64*T1 seconds). In our implementation the value of T1 is set at 0.5 seconds.
BYE method. In this case, a SIP client should retransmit requests with an exponential backoff for congestion control reasons. For example, if the first packet sent is lost, the second packet is sent T1 seconds later, and eventually the next one after 2*T1 seconds (4*T1 seconds, and so on), until the interval reaches a value T2. Subsequent retransmissions are spaced by T2 seconds. T2 represents the amount of time a BYE server transaction will take to respond to a request, if it does not respond immediately. If the client receives a provisional response, it continues to retransmit the request, but with an interval of T2 seconds (this is done to ensure reliable delivery of the final response). Retransmissions cease when the client has sent a total of 11 packets (i.e., after T1*64 seconds), or it has received a definitive response. Responses to BYE are not acknowledged via ACK. In our implementation the values of T1 and T2 are set to 0.5 and 4 seconds, respectively.
ACK method. ACK is not retransmitted, but in case of loss the UA server retransmits the 200/OK.
Informational (provisional) responses (1xx). UA servers do not transmit informational responses reliably. For instance, our implementation does not retransmit informational responses (100/TRYING, 180/RINGING). However, the UA server, which transmits a provisional response, will retransmit it upon reception of a duplicate request.
Successful responses (2xx). A UA server does not retransmit responses to BYE. In all the other cases a UA server, which transmits a final response, should retransmit it with the same spacing as the BYE. Response retransmissions cease when an ACK request is received or the response has been transmitted 11 times (i.e., after 64*T1 seconds). The value of a final response is not changed by the arrival of a BYE or CANCEL request.

In 3GPP Release 5 networks, the timers T1 and T2 are set to different and more-conservative values.

The tests we have run have been performed over a 3GPP Release '99 network emulator. The results will be expressed in terms of the following metrics:

Postdialing delay (PDD). It also is called postselection delay or dial-to-ring delay. This is the time elapsed between the caller clicking the button of his terminal to make the call and hearing the terminal ringing. In our case the PDD corresponds to the time T1 (see Figure 21.6).
Answer-signal delay (ASD). This is the time elapsed between the phone being picked up and the caller receiving indication of this. In our case the ASD corresponds to the time T2 (see Figure 21.6). It must be noted that the caller receives notification that the callee has picked up the phone when the first receives the 200/OK. However, the call-signaling handshake is completed when the callee receives the ACK from the caller. This is the reason we have considered the ASD in this way.
Call-release delay (CRD). This is the time elapsed between the phone being hung up by the releasing party (the caller in our example in Figure 21.6) and a new call can be initiated/received (by the same party). In our tests the CRD corresponds to the time T3.

Results of simulations are shown in Table 21.5. No signaling compression algorithms were used. For comparisons between calls over 3GPP Release '99 networks and calls in Intranet or WLAN environment, and for further details about SIP signaling delays, the reader can refer to Curcio and Lundan. ^[71], ^[72]

Table 21.5: Call Setup Times for SIP Signaling
Call Setup Metric	Delay (ms)
Postdialing delay	62
Answer-signal delay	45
Call-release delay	50

Table 21.6 contains results for SIP call set-up times in the case of restricted bandwidths. Also in this case, no signaling compression was used. Table 21.5 results assumed a bandwidth of 384 kbps; however, in many cases it is better to assume, as we did, that the bearer reserved for SIP signaling is a dedicated one, and of smaller size. Also, when running the tests with restricted bandwidth we have injected 2 percent packet losses using the NISTNET ^[73] simulator. The figures show that there is an increase in postdialing delays up to almost one second for network bandwidths as narrow as 2 kbps.

Table 21.6: Postdialing Delay for SIP Signaling with Limited Bandwidth
Network Bandwidth (kbps)	Delay (ms)
2	981
5	427
9.2	287
16	164
32	119
64	78

The results presented in this section are not related to SIP signaling within Release 5 of 3GPP specifications, where SIP is part of the call control in the IMS. In this case, the SIP signaling delays are estimated to be larger than those shown, due to the increased complexity of the whole network system.

21.5.5 RTCP Performance

The RTCP protocol basics have been introduced in Chapter 4, Section 4.5. We recall the fact the RTCP is used by a receiver to provide QoS information to the transmitting party in order to repair the media transmitted or, in general, to take some (possibly prompt) action to adjust or improve the QoS toward the receiver.

RTCP packets are normally sent with a minimum interval of 5 seconds. However, some applications may benefit from sending a more-frequent feedback. Video telephony can certainly benefit from a faster feedback, because this allows a faster reaction of the sender terminal to provide a better QoS to the receiving terminal. One possible action is to change on the fly the encoding parameters when the packet loss rate increases. This action should be taken as early as possible in the transmitting terminal, and a 5-second interval could be too long a time window to allow a fast reaction, especially if the media to repair is a speech stream.

The new RTP specifications ^[74], ^[75] define a more-flexible use of the RTCP data flow, allowing more-frequent feedback by reducing the transmission interval to a value lower than 5 seconds or by fixing the percentage of the RTP session bandwidth reserved for RTCP traffic.

We have run some tests for 1-minute speech and video streams. The former was encoded using the AMR codec at 12.2 kbps with silence suppression. The latter was encoded using the H.263+ video codec at 64 kbps. For the speech session the maximum RTCP packet length was 168 bytes (including UDP/IP headers, a sender report and full SDES), while for the video session the RTCP packet length was 88 bytes (including UDP/IP headers, a receiver report and SDES). Results for different RTCP bandwidth percentages are shown in Tables 21.7 and 21.8.

Table 21.7: Results for Different RTCP Bandwidth Percentages (AMR Speech)
RTCP Bandwidth (%)	RTCP Bandwidth (kbps)	Average RTCP Interval (ms)	Number of RTCP Packets
1.0	0.19	5000	12
1.6	0.28	3158	19
1.9	0.35	2609	23
2.2	0.39	2308	26

Table 21.8: Results for Different RTCP Bandwidth Percentages (H.263+ Video)
RTCP Bandwidth (%)	RTCP Bandwidth (kbps)	Average RTCP Interval (ms)	Number of RTCP Packets
0.2	0.14	5000	12
0.3	0.18	4000	15
0.5	0.34	2069	29
1.0	0.67	1053	57
1.2	0.82	857	70
1.9	1.31	536	112
2.4	1.63	432	139
2.8	1.88	375	160

The leftmost column in Tables 21.7 and 21.8 contains the RTCP bandwidth as percentage of the RTP session bandwidth, which includes media and headers overhead (including RTP/UDP/IP headers). The second column contains the RTCP bandwidth in kilobits per second. The third column of data is the computed average RTCP interval between two QoS reports. The last column contains the number of RTCP packets sent by a receiver during 1 minute of data reception.

The reader can see that the minimum bandwidth occupied by RTCP is below 0.2 kbps when a 5-second transmission interval is used. In this case the receiver sends only 12 QoS reports. When increasing the bandwidth reserved for RTCP, more-frequent feedback can be sent. For example, for speech traffic, a feedback message every 2.3 seconds would allow 26 QoS reports in one minute. In the same way for video traffic, a feedback message every 375 ms would allow 160 QoS reports in one minute. This would let the transmitting terminal have more possibilities to adjust the error resilience properties of the video stream. Theoretically, it could be possible to have 160 QoS reports for the speech stream as well. However, this would imply an RTCP bandwidth of about 2.4 kbps, i.e., over 13 percent of the RTP session bandwidth for speech. The reader interested in details about RTCP traffic can refer to Curcio and Lundan. ^[76], ^[77]

^[56]3GPP TSGS-SA, Packet switched conversational multimedia applications, Transport protocols (Release 5), TS 26.236, v.5.7.0 (2002-12).

^[57]3GPP TSGS-SA, Packet switched conversational multimedia applications, Transport protocols (Release 5), TS 26.236, v.5.7.0 (2002-12).

^[58]Sjoberg, J. et al., RTP payload format and file storage format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) audio codecs, IETF RFC 3267, March 2002.

^[59]3GPP TSGS-SA, Packet switched conversational multimedia applications, Default codecs (Release 5), TS 26.235, v.5.1.0 (2002–03).

^[60]3GPP TSGS-SA, Packet switched conversational multimedia applications, Transport protocols (Release 5), TS 26.236, v.5.7.0 (2002-12).

^[61]Curcio, I.D.D., Mobile video QoS metrics, Int. J. Comput. Appl., 24 (2), 41–51, 2002.

^[62]ANSI, Digital Transport of One-Way Video Signals - Parameters for Objective Performance Assessment, T1.801.03, 1996.

^[63]ITU-T, Multimedia communications delay, synchronization and frame rate measurement, Recommendation P.931, December 1998.

^[64]Curcio, I.D.D., Mobile video QoS metrics, Int. J. Comput. Appl., 24 (2), 41–51, 2002.

^[65]Curcio, I.D.D., Practical Metrics for QoS Evaluation of Mobile Video, Internet and Multimedia Systems and Applications Conference (IMSA 2000), Las Vegas, 9–23 November 2000, pp. 199–208.

^[66]Curcio, I.D.D., Lappalainen, V., and Mostafa, M.-E., QoS evaluation of 3G-324M mobile videophones over WCDMA networks, Comput. Networks, 37 (3–4), 425–445, 2001.

^[67]Hourunranta, A. and Curcio, I.D.D., Delay in Mobile Videophones, IEEE 7th Mobile Multimedia Communications Workshop (MoMuC 2000), Tokyo, 23–26 October 2000, pp. 1-B-3-1/1-B-3-7.

^[68]Curcio, I.D.D. and Hourunranta, A., QoS of Mobile Videophones in HSCSD Networks, IEEE 8th International Conference on Computer Communications and Networks (ICCCN '99), Boston, 11–13 October 1999, pp. 447–451.

^[69]Hourunranta, A. and Curcio, I.D.D., Delay in Mobile Videophones, IEEE 7th Mobile Multimedia Communications Workshop (MoMuC 2000), Tokyo, 23–26 October 2000, pp. 1-B-3-1/1-B-3-7.

^[70]Rosenberg, J. et al., SIP: Session Initiation Protocol, IETF RFC 3261, March 2002.

^[71]Curcio, I.D.D. and Lundan, M., SIP Call Setup Delay in 3G Networks, IEEE 7th Symposium on Computers and Communication (ISCC '02), Taormina, Italy, 1–4 July 2002, pp. 835–840.

^[72]Curcio, I.D.D. and Lundan, M., Study of Call Setup in SIP-Based Videotelephony, 5th World Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2001), Orlando, 22–25 July 2001, Vol. IV, pp. 1–6.

^[73]NIST, NISTNet, http://www.antd.nist.gov/nistnet/.

^[74]Schulzrinne, H. et al., RTP: A Transport Protocol for Real-Time Applications, IETF draft, Work in progress, November 2001.

^[75]Casner, S., SDP Bandwidth Modifiers for RTCP Bandwidth, IETF draft, Work in progress, November 2001.

^[76]Curcio, I.D.D. and Lundan M., Event-Driven RTCP Feedback for Mobile Multimedia Applications, IEEE 3rd Finnish Wireless Communications Workshop (FWCW '02), Helsinki, 29 May 2002.

^[77]Curcio, I.D.D. and Lundan, M., On RTCP Feedback for Mobile Multimedia Applications, IEEE International Conference on Networking (ICN '02), Atlanta, 26–29 August 2002.