A Brief History of AudioVideo Networking | RTP: Audio and Video for the Internet

A Brief History of Audio/Video Networking

The idea of using packet networks ”such as the Internet ”to transport voice and video is not new. Experiments with voice over packet networks stretch back to the early 1970s. The first RFC on this subject ”the Network Voice Protocol (NVP) ¹ ”dates from 1977. Video came later, but still there is over ten years of experience with audio/video conferencing and streaming on the Internet.

Early Packet Voice and Video Experiments

The initial developers of NVP were researchers transmitting packet voice over the ARPANET, the predecessor to the Internet. The ARPANET provided a reliable-stream service (analogous to TCP/IP), but this introduced too much delay, so an "uncontrolled packet" service was developed, akin to the modern UDP/IP datagrams used with RTP. The NVP was layered directly over this uncontrolled packet service. Later the experiments were extended beyond the ARPANET to interoperate with the Packet Radio Network and the Atlantic Satellite Network (SATNET), running NVP over those networks.

All of these early experiments were limited to one or two voice channels at a time by the low bandwidth of the early networks. In the 1980s, the creation of the 3-Mbps Wideband Satellite Network enabled not only a larger number of voice channels but also the development of packet video. To access the one-hop, reserved-bandwidth, multicast service of the satellite network, a connection-oriented inter-network protocol called the Stream Protocol (ST) was developed. Both a second version of NVP, called NVP-II, and a companion Packet Video Protocol were transported over ST to provide a prototype packet-switched video teleconferencing service.

In 1989 “1990, the satellite network was replaced with the Terrestrial Wideband Network and a research network called DARTnet while ST evolved into ST-II. The packet video conferencing system was put into scheduled production to support geographically distributed meetings of network researchers and others at up to five sites simultaneously .

ST and ST-II were operated in parallel with IP at the inter-network layer but achieved only limited deployment on government and research networks. As an alternative, initial deployment of conferencing using IP began on DARTnet, enabling multiparty conferences with NVP-II transported over multicast UDP/IP. At the March 1992 meeting of the IETF, audio was transmitted across the Internet to 20 sites on three continents over multicast "tunnels" ”the Mbone (which stands for "multicast backbone") ”extended from DARTnet. At that same meeting, development of RTP was begun.

Audio and Video on the Internet

Following from these early experiments, interest in video conferencing within the Internet community took hold in the early 1990s. At about this time, the processing power and multimedia capabilities of workstations and PCs became sufficient to enable the simultaneous capture, compression, and playback of audio and video streams. In parallel, development of IP multicast allowed the transmission of real-time data to any number of recipients connected to the Internet.

Video conferencing and multimedia streaming were obvious and well-executed multicast applications. Research groups took to developing tools such as vic and vat from the Lawrence Berkeley Laboratory, ⁸⁷ nevot from the University of Massachusetts, the INRIA video conferencing system, nv from Xerox PARC, and rat from University College London. ⁷⁷ These tools followed a new approach to conferencing, based on connectionless protocols, the end-to-end argument, and application-level framing. ⁶⁵ ^, ⁷⁰ ^, ⁷⁶ Conferences were minimally managed, with no admission or floor control, and the transport layer was thin and adaptive. Multicast was used both for wide-area data transmission and as an interprocess communication mechanism between applications on the same machine (to exchange synchronization information between audio and video tools). The resulting collaborative environment consisted of lightly coupled applications and highly distributed participants .

The multicast conferencing (Mbone) tools had a significant impact: They led to widespread understanding of the problems inherent in delivering real-time media over IP networks, the need for scalable solutions, and error and congestion control. They also directly influenced the development of several key protocols and standards.

RTP was developed by the IETF in the period 1992 “1996, building on NVP-II and the protocol used in the original vat tool. The multicast conferencing tools used RTP as their sole data transfer and control protocol; accordingly , RTP not only includes facilities for media delivery, but also supports membership management, lip synchronization, and reception quality reporting.

In addition to RTP for transporting real-time media, other protocols had to be developed to coordinate and control the media streams. The Session Announcement Protocol (SAP) ³⁵ was developed to advertise the existence of multicast data streams. Announcements of sessions were themselves multicast, and any multicast-capable host could receive SAP announcements and learn what meetings and transmissions were happening. Within announcements, the Session Description Protocol (SDP) ¹⁵ described the transport addresses, compression, and packetization schemes to be used by senders and receivers in multicast sessions. Lack of multicast deployment, and the rise of the World Wide Web, have largely superseded the concept of a distributed multicast directory, but SDP is still used widely today.

Finally, the Mbone conferencing community led development of the Session Initiation Protocol (SIP). ²⁸ SIP was intended as a lightweight means of finding participants and initiating a multicast session with a specific set of participants. In its early incarnation, SIP included little in the way of call control and negotiation support because such aspects were not used with the Mbone conferencing environment. It has since become a more comprehensive protocol, including extensive negotiation and control features.

ITU Standards

In parallel with the early packet voice work was the development of the Integrated Services Digital Network (ISDN) ”the digital version of the plain old telephone system ”and an associated set of video conferencing standards. These standards, based around ITU recommendation H.320, used circuit-switched links and so are not directly relevant to our discussion of packet audio and video. However, they did pioneer many of the compression algorithms used today (for example, H.261 video).

The growth of the Internet and the widespread deployment of local area networking equipment in the commercial world led the ITU to extend the H.320 series of protocols. Specifically, they sought to make the protocols suitable for "local area networks which provide a non- guaranteed quality of service," IP being a classic protocol suite fitting the description. The result was the H.323 series of recommendations.

H.323 was first published in 1997 ⁶² and has undergone several revisions since. It provides a framework consisting of media transport, call signaling, and conference control. The signaling and control functions are defined in ITU recommendations H.225.0 and H.245. Initially the signaling protocols focused principally on interoperating with ISDN conferencing using H.320, and as a result suffered from a cumbersome session setup process that was simplified in later versions of the standard. For media transport, the ITU working group adopted RTP. However, H.323 uses only the media transport functionality of RTP and makes little use of the control and reporting elements.

H.323 met with reasonable success in the marketplace , with several hardware and software products built to support the suite of H.323 technologies. Development experience led to complaints about its complexity, in particular the complex setup procedure of H.323 version 1 and the use of binary message formats for the signaling. Some of these issues were addressed in later versions of H.323, but in the intervening period interest in alternatives grew.

One of those alternatives, which we have already touched on, was SIP. The initial SIP specification was published by the IETF in 1999, ²⁸ as the outcome of an academic research project with virtually no commercial interest. It has since come to be seen as a replacement for H.323 in many quarters , and it is being applied to more varied applications, such as text messaging systems and voice-over-IP. In addition, it is under consideration for use in third-generation cellular telephony systems, ¹¹⁵ and it has gathered considerable industry backing.

The ITU has more recently produced recommendation H.332, which combines a tightly coupled H.323 conference with a lightweight multicast conference. The result is useful for scenarios such as an online seminar, in which the H.323 part of the conference allows close interaction among a panel of speakers while a passive audience watches via multicast.

Audio/Video Streaming

In parallel with the development of multicast conferencing and H.323, the World Wide Web revolution took place, bringing glossy content and public acceptance to the Internet. Advances in network bandwidth and end-system capacity made possible the inclusion of streaming audio and video along with Web pages, with systems such as RealAudio and QuickTime leading the way. The growing market in such systems fostered a desire to devise a standard control mechanism for streaming content. The result was the Real-Time Streaming Protocol (RTSP), ¹⁴ providing initiation and VCR-like control of streaming presentations; RTSP was standardized in 1998. RTSP builds on existing standards: It closely resembles HTTP in operation, and it can use SDP for session description and RTP for media transport.