Accessible Video Content Requires Closed Captioning


Maximum Accessibility: Making Your Web Site More Usable for Everyone
By John M. Slatin,, Sharron Rush
Table of Contents
Chapter 13.  Enhancing Accessibility through Multimedia

Video media elements typically rely on the user's ability to see and hear content in order to fully experience it. Users who do not hear or have turned off their speakers rely on a text caption track of the dialogue and other sounds that accompany the video. Besides making the soundtrack accessible to some 20 million Americans who are deaf or hard of hearing, these closed captions are useful for people with cognitive disabilities; captions are also helpful to individuals who are learning the language spoken in the video. WCAG 1.0 Checkpoint 1.4 and Section 508 paragraph (b) specify that the captions must be synchronized with the delivery of the video and audio presentation.

You've probably seen closed captioning many times for example, on televisions located in crowded or noisy public spaces such as hotel and airport lobbies, bars, restaurants, and so forth. All U.S. televisions manufactured since 1992 can display closed captions, so if you've never really thought about how they work, we recommend that you try this experiment at home. Turn on the closed captions on your television set. Turn off the sound. Watch the news or several of your favorite shows using only the captions. Spend enough time doing this so that the novelty and initial frustration wear off, and note how different types of elements and events are represented in the caption dialog, music, background noises, other sound effects.

Guidelines for Closed Captioning

Closed captioning has been around since the 1970s, and as a result there are excellent resources you can consult for information about how to handle the many different possibilities that might come up in the video material you'll need to caption. Closed captioning was pioneered by the Caption Center at WGBH, the Boston, Massachusetts PBS affiliate that has also pioneered descriptive video. The Caption Center publishes an excellent guide called "How It Works," which offers some useful tips on captioning style. Some of the details pertain specifically to television, and you'll need to adapt the tips for presentation on the Web or other media. The Caption Center guidelines address a wide variety of topics. We mention just a few here and refer you to the guidelines for additional information. [3]

[3] All material in this section is based on information presented in "How It Works," a guide to closed captioning produced by WGBH in Boston, MA. Accessed January 21, 2002, at Quotations used with permission.

In the early days of captioning, captions often represented heavily edited versions of what was actually said in the dialog. According to the Caption Center:

The rationale for this was that this type of editing would make it easier for the deaf and hard-of-hearing audience to understand the program. Experience has shown, however, that much of the caption-viewing audience prefers to have a verbatim or near-verbatim rendering of the audio; therefore, any editing that occurs nowadays is usually for reading speed only. Strive for a reading speed that allows the viewer enough time to read the captions yet still keep an eye on the program video. Once you reach a decision on caption reading speed, use that speed consistently in your work.

There are occasions when limited reading time makes it necessary to edit viewers couldn't possibly keep up with verbatim captions that change at the speed of rapid-fire dialog. When this happens, the Caption Center advises, "try to maintain precisely the original meaning and flavor of the language as well as the personality of the speaker." When captioning "classic" movies, literary works, or the speech of important personalities, it's best to avoid editing if at all possible.

It's sometimes possible to place captions on the screen to indicate who is speaking at a given point (this is one reason why, as we discuss below, captioning formats for the Web include features that let you control the visual presentation of the captions). This technique doesn't always work, though, so it may be necessary to include the speaker's name or other identifier in square brackets. The experts at the Caption Center advise:

When considering the placement of captions, keep in mind what action is occurring in a given scene. If only one person talks throughout a scene, captions are generally placed at bottom center. If there are multiple characters in a scene, caption placement on or near individual speakers is used to indicate who is saying what.

As we've noted, both WCAG 1.0 and Section 508 require that captions be synchronized with the soundtrack. SMIL and other media languages control the timing of caption changes to synchronize them to the video elements. The Caption Center's guidelines help us understand why this is so important:

To convey pacing appropriate to humor, suspense and drama, as well as to indicate who is speaking, captions may be timed to appear and disappear precisely when the words are spoken. The text may be timed to change with shot changes for readability and aesthetic purposes. In applying timing conventions, consider that logical caption division should not be sacrificed for exactitude in timing. Readability should always be the first priority.

The guidelines also cover sound effects. It may not be possible (or desirable) to provide a caption for every sound, no matter how in-significant it may be; but all sounds that contribute to understanding should be represented in the captions. Since these are not actually spoken, references to sound effects should be distinguished typographically from the dialog. The Caption Center uses italics in paren-theses to identify such effects, including such things as indicating when a speaker is whispering, shouting, and so on.

To indicate that music is playing or that someone is singing, place a graphic of a musical note at the beginning and end of the caption. Captions for songs, advertising jingles, and so on should be verbatim. Include song titles when possible; if they're not part of the dialogue, make sure they're typographically different than the lyrics or spoken dialog.

For numbers, the Caption Center suggests following widely used conventions. For example, spell out numbers from one to ten, and use numerals for larger numbers (21, for instance, or 252,971). This convention holds unless a large number is the only text in the caption; then the number should be spelled out (for example, two million two hundred fifty thousand).

Spelling and punctuation should be consistent and again should follow widely used conventions. The Caption Center recommends equipping yourself with good reference materials such as The Chicago Manual of Style as well as a dictionary, a thesaurus, and an encyclopedia.

When Captioning Is Not Enough: Providing Signed Interpretation

It's important to realize, however, that captioning is not a universal solvent for making audio material accessible to people who cannot hear it. Captions work well for "late-deafened" individuals such as the many senior citizens who suffer hearing loss as they age. Such individuals have spent most of their lives as hearing people, and for them, spoken and written language are the most natural means of communication. But that's not true for everyone. Linguist Leland McCleary of the University of Sao Paolo explains:

Only a very few of those individuals who are born deaf or who become deaf at an early age achieve fluency in the spoken language of their community, and only then with great effort. For the deaf, the only route to full language mastery is through a sign language; but access to sign language is not always guaranteed, since the majority of those born deaf are born into hearing families. [McCleary 2001]

For millions of deaf persons in the United States, the native language is American Sign Language (ASL); millions of other deaf individuals around the world use the signed languages of their native countries. Sign is the first language for these individuals, just as English is the first language for most people in the United States. This has important consequences. As the European author of "Guidelines for Signed Books" wrote, "Deaf people have their own language and their own culture. It is difficult for a French producer to make a typically English production. It is as difficult for a hearing producer to make a Deaf production" [Pyfers 1999]. Many people for whom Sign is the first language and the natural medium of expression find written language difficult, especially when it appears and disappears quickly, as synchronized captions do. This is not a trivial issue: to quote the "Guidelines" again, "Deaf adults need video for information on Deaf issues and Deaf culture, but also for easy access to information, educa-tion, politics, culture, etc. of the larger, hearing community" [Pyfers 1999]. To ensure maximum accessibility for the deaf audience, you may create a Sign version in addition to the caption track. Applications that automatically convert text or speech to Sign are starting to come onto the market, but there is no substitute for experienced and fluent signers and interpreters. Signed interpretation services are also available through commercial providers.

Captioning on the Web

On the Web, a typical video element consists of a video file, which usually contains the video imagery and the accompanying audio. The dominant browser-based video players RealNetworks' RealPlayer (now RealOne), Apple's QuickTime player, and Microsoft's Windows Media Player allow for the inclusion of other tracks, such as text captions, additional audio narration, and graphics. Each of these technologies uses a proprietary video and text file format and synchronization method. In addition to these proprietary methods of synchronization, most of these players include some level of support for SMIL. Developed by the W3C, SMIL enables Web developers to divide multimedia content into separate files and streams (audio, video, text, and images) and send them to a user's computer individually, greatly reducing the size of the multimedia file. Separate files and streams are synchronized to display together as if they are a single multimedia stream. The ability to separate out the static text and images and the resulting reduction in file size minimize the time it takes to transfer the files over the Internet. SMIL is based on the Extensible Markup Language (XML). Rather than defining the actual formats used to represent multimedia data, it defines the commands that specify whether the various multimedia components should be played together or in sequence. SMIL version 2.0, published in February 2001, is similar in simplicity to HTML and can be written using a simple text editor. SMIL forms the basis for many of the captioning solutions we examine in this chapter and includes the ability for the user to turn captions on or off.

The primary benefit to using SMIL for synchronization is that it allows the author to program switch conditions to provide captions, images, or alternative language formats. A player that fully supports SMIL displays tracks according to how the user configures the player settings. If QuickTime 5 worked in this way, there would have been no need for the ATSTAR team to supply separate versions of the same media or for the user to make a choice each time video material is presented, as in Figure 13-2. In addition to providing a way to synchronize media elements and provide conditions for their display, SMIL also allows the author to configure the space in which these elements are displayed.

In an ideal world, all of the third-party plug-in developers would fully implement the W3C recommendations that affect their products. However, this rarely happens, and it takes some time when it does. The W3C has recently introduced some changes in the process for developing guidelines: a draft cannot be published as a formal Recommendation until at least two implementations of each checkpoint have been located. These changes are designed to reduce the lag time between publication of W3C recommendations and the availability of applications that support them. Even so, the Web author who is concerned about accessibility should understand the differ-ences between the product choices available and what measures must be taken to deliver an accessible presentation. Although SMIL may still be considered an emerging method until the user agents fully accommodate it, we stress its importance here because in the long run, a growing compliance with a standards-based mechanism for delivering multimedia will benefit everyone involved in the process. The code mechanism for presenting multimedia would be consistent across browsers, players, and operating systems, thereby easing the burden on Web authors. The user experience would be more consistent and controllable, and upgrades to tools and methods would be based on a wider range of user and author experiences.

Before we describe the major differences between the popular media players available today, let's explore a little further the caption component of the multimedia presentation. As we mentioned above, the caption element of a video or audio segment is presented to the user as a text equivalent of the dialog synchronized with the multimedia event. To produce this caption track, we must first transcribe the media event into a script, identify the speakers for each segment, and place a timecode on each segment to synchronize it with the media. Transcribing the media is a manual process and can be quite demanding if the soundtrack is complex; fortunately, you can use professional captioning services if you decide to outsource this part of the work. After you have the transcription of the media, you can generate the caption file with a text editor by adding the timecodes and other necessary formatting codes in the format appropriate to the media player you are using. Each of the media players has a proprietary markup for their caption files, though all of them start with ASCII text (as does HTML, of course). The most difficult task in converting transcript files to caption files is the timecode stamping for synchronization. Fortunately there are applications available to assist with this process. In the next section, we introduce a free tool that supports all three of the major media players.

MAGpie: The Media Access Generator from NCAM

MAGpie is the acronym for the Media Access Generator developed by the National Center for Accessible Media (NCAM) at WGBH, the PBS affiliate in Boston, Massachusetts, that has pioneered many important accessibility advances for broadcast and Internet media, including closed captioning and descriptive video. You can download MAGpie for free from the NCAM site at

You can use MAGpie 1.0 to add captions to three multimedia formats: Apple's QuickTime, the W3C's SMIL (via RealOne's RealText), and Microsoft's Synchronized Accessible Media Interchange (SAMI) format for Windows Media Player. MAGpie 1.0 can also integrate audio descriptions into SMIL presentations. A beta version of MAGpie was released in fall 2001. The final release will include these additional features:

  • Java-based application for Windows and Mac OS X.

  • Improved editing behavior.

  • XML output.

  • Output support for Web-embedded media (RealText, SAMI, SMIL, and QT text).

  • Full audio description support, including audio recording and playback.

  • Karaoke-style highlighting.

  • Improved support for media types all types supported by Windows Media Player, QuickTime, and RealOne.

With MAGpie, you can easily add caption lines, speaker identification, and time stamps to accompany a video or audio segment. MAGpie 1.0 uses Windows Media Player to display video and audio tracks of resources to be captioned.

When you create a project and identify the media element you wish to caption, MAGpie creates a new caption stream and opens the media resource in Windows Media Player. MAGpie controls Windows Media Player and includes functionality to play, pause, stop, and jump forward or backward. The most important feature, however, is the ability to "grab" the timecode with the press of a function key (F9). Then, as the media segment plays, MAGpie captures and inserts the appropriate segment of caption text. MAGpie allows you to edit the caption text and the timecodes, delete and add events, and split or combine events. When you're finished, you can export the project in SAMI format, QuickText format, or SMIL, which will produce a RealText (.rt) file and SMIL code to synchronize it with.

Figure 13-3 shows a screen shot of our example ATSTAR video being edited with MAGpie 1.0 and Windows Media Player. This illustration shows the MAGpie application on the left and the video playing in Windows Media Player on the right. The MAGpie window is in table format with header information at the top and several rows of caption entries indicating the timecode, the speaker, and the caption text. The initial row has only a timecode of 00:00:00 and no text to initiate the file.

Figure 13-3. Screen shot of MAGpie editing session with Windows Media Player. Used with permission.


MAGpie is an indispensable tool for timecoding complex caption data.

Comparing the Popular Media Players


At the time of this writing, RealOne offers the most support for SMIL among the major video players. At this time, RealOne is the newest player by RealNetworks. RealOne is available only for Windows at this point, so we will center our discussion on RealPlayer 8, which is available for both Windows and Macintosh. RealPlayer uses a caption format known as RealText to display and synchronize text captions. RealText markup contains the following elements and attributes.

  • <window bgcolor="#RRGGBB" wordwrap="true | false" duration=""> sets the window characteristics for displaying the captions.

    - bgcolor sets the background color of the caption window. RRGGBB represents the red, green, and blue color reference in hex format.

    - wordwrap determines whether text should be wrapped to window size if a line of text is too wide for the window.

    - duration sets the length of time the text presentation will last in hours, minutes, and seconds.

  • <font size="size" face="font-family" color=" #RRGGBB"> sets font characteristics.

    - size values can be -2, -1, +0, +1, +2, +3, or +4.

    - face sets the font face used to render the captions. Values are font families like Arial and Helvetica (the default is Times New Roman).

    - color sets the font color. RRGGBB represents the red, green, and blue color reference in hex format.

  • <time begin=""/> sets the start time for a caption in hours, minutes, and seconds.

  • <center> centers a caption in a presentation window.

  • <br> creates a line break.

Using these elements, the author can control the background and font color, font size and face, and time stamp for each caption line; she or he can also center the caption in the space. We suggest following the guidelines published by the Caption Center at WGBH when deciding what to include in your captions and how to format them. (See the Guidelines for Closed Captioning section earlier in this chapter for selections from the Caption Center guidelines.)

Below is an example of RealText markup and the description of the code function from one of the ATSTAR videos.

<window bgcolor="000000" wordwrap="true"      duration="00:05:29.08">  <font size="1" face="Arial" color="#FFFFFF">  <center>  <time begin="00:00:00.00"/>  <clear/>  <time begin="00:00:01.20"/><c lear/>  </center>GE History Teacher<br>  <center>Well, Mrs. Allen thank you for coming in today, I      know you've been busy, it's been a tough week for you,      ... you made time for us.<br>  <time begin="00:00:06.07"/><clear/>  </center>GE History Teacher<br>  <center>I think we can make some progress on the questions      we've been working on, so...<br>  . . .  <time begin="00:05:29.08"/><clear/>&nbsp;  </center>  </font>  </window>

RealPlayer's user preferences include Accessibility Settings that allow users to elect to show captions. RealPlayer uses the SMIL <switch> element to determine whether or not to display the RealText file, depending on how the user has set this preference. Figure 13-4 shows the RealPlayer Accessibility Settings dialog box with the Show Captions radio button selected.

Figure 13-4. Screen shot of the RealPlayer Accessibility Settings dialog box. Used with permission.



The QuickTime player allows caption tracks as well as other media tracks to be synchronized with the video. QuickTime 5 does support SMIL but does not yet support the SMIL switch to activate the captioning according to the user settings. Therefore, to give the user a choice of captioned or noncaptioned video using QuickTime, you must create two separate movies to be played and provide links to each for the user, as in the ATSTAR example above.

To create captioned video using QuickTime, you must create a caption file using QuickText with time stamps that synchronize with the video dialog or action. QuickTime Pro allows you to then "compile" the QuickTime movie including a video file (.mov), a QuickText caption file (, and graphic files for background images with a synchronization file. When the user activates the video, it is the synchronization file that runs. (If you choose, you can use MAGpie to add the new tracks and create the synchronization file.)

QuickText markup contains the following elements.

  • {QText} starts the QuickText document.

  • {font: font-family} sets the current font family.

  • {justify: left, center or right} sets the alignment of text .

  • {backcolor: red, blue, green} specifies the color of text based on red, blue, and green values (0-255 for each color).

  • {timescale: number} sets the time scale used for determining synchronization cues.

  • {width: pixels} specifies the width of the window used by the QuickTime player to render the text.

  • {height: pixels} specifies the height of the window used by the QuickTime player to render the text.

  • [] text element sets the synchronization time for rendering the next QuickText element.

You can also control font family, text justification, and background color for the caption space. You can adjust the timescale as well. A time stamp synchronizes the rendering of the caption.

Important note: You must create a space for the captioning to be rendered by using a graphic spacer that is played throughout the duration of a video rendered in QuickTime; otherwise, the caption text will be overlaid on the video itself and may become illegible.

Figure 13-5 shows a video frame from the ATSTAR curriculum with captioning as rendered by the QuickTime player. The ATSTAR video frame pictures a large green puzzle piece, a video frame in the top center, and caption text below the video frame all within the green graphic. The QuickTime control slider rests on the bottom edge. Each of these components is a separate element of the QuickTime media group. The media group also includes a green graphic the size of the caption block (not distinguishable from the background image) and a synchronization file. When ATSTAR users choose to view video without captions, another synchronization file activates a visually similar frame that references this same movie and background images but not the caption file. Thus, offering viewers a choice between video with and without captions does not require duplicating the actual video files themselves.

Figure 13-5. ATSTAR video frame showing captioning rendered by QuickTime. Used with permission.


A portion of the QuickText caption file for this ATSTAR video appears below.

{QTtext}{font:Arial}{justify:center}{size:12}  {backcolor:0,0,0}  {timescale:100}{width:400}{height:0}  [00:00:00.00]  [00:00:01.20]       {justify:left}GE History Teacher     {justify:center}Well, Mrs. Allen thank you for coming     in today, I know you've been busy, it's been a tough     week for you, ... you made time for us.  [00:00:06.07]  {justify:left}GE History Teacher  {justify:center}I think we can make some progress on uh,     the questions we've been working on, so...     . . .  [00:05:29.07]  [00:05:29.08]

There are many steps to creating this group of elements in QuickTime, including the following.

  • Creating the QuickText file.

  • Importing and configuring this file in QuickTime Pro.

  • Creating a text track.

  • Adding the text track to the video track.

  • Positioning the text track in the video frame.

  • Saving the result as a separate file, with the appropriate suffix.

QuickTime does allow the inclusion of various media elements, but the methods are somewhat cumbersome.

Windows Media Player

Windows Media Player does not support SMIL at this time, and the Microsoft SAMI caption format can be used only for captions and not for the other media elements that QuickTime and RealOne allow. That said, some people find Windows Media Player easier to use. Windows Media Player does allow users to choose whether to view captions, so you need to create only one version of the video module. However, you must create different versions for disk-based and Web-based presentations. For disk-based resources, you create a caption file (.smi) with the same name as the video file but with a .asf extension. The caption files play automatically if the user preferences are set to display them. For Web-based presentations, you must bind the caption file to the media file with a .asx file; Windows Media Player will then run the .asx file. The captions are displayed in a separate, fixed window, so you do not have to worry about background color or interference with the video.

SAMI markup is styled after HTML, and you can use Cascading Style Sheets for text styling. Below is a list of the elements of SAMI markup.

  • <sami> indicates the file is a SAMI-based caption file.

  • <head> defines the head block, which contains title and styling information.

  • <title> is used for informational purposes; it is optional.

  • <style> defines styles for caption elements; uses CSS conventions.

  • <body> contains the synchronization cues.

  • <sync start=milliseconds> sets the time synchronization cue for a text element, in milliseconds.

  • <p class=style ref> text specifies the text element(s) for the current synchronization element.

The code below presents an example of a SAMI file for the same ATSTAR video we discussed previously.

<sami>  <head>      <copyright="">      <title></title>      <style type="text/css">      <!--         p {              font-size:12pt;              font-family: Arial;              font-weight: normal;              color: #FFFFFF;              background-color: #000000;              text-align: center;          }          .ENUSCC { Name: English;              lang: EN-US-CC; }          #Source {              font-size:12pt;              font-family: Arial;              font-weight: normal;              color: #FFFFFF;              background-color: #000000;             text-align: left;              margin-bottom: -12pt;          }      -->      </style>  </head>  <body>  <sync start=0>      <p class=ENUSCC ID=Source>&nbsp;</p>      <p class=ENUSCC></p>  </sync>  <sync start=1200>      <p class=ENUSCC id=Source>GE History Teacher</p>      <p class=ENUSCC><table align=center>      Well, Mrs. Allen thank you for coming in today, I know         you've been busy, it's been a tough week for you,         ... you made time for us.      </table></p>  </sync>  <sync start=6070>      <p class=ENUSCC id=Source>GE History Teacher</p>      <p class=ENUSCC><table align=center>      I think we can make some progress on uh, the questions         we've been working on, so...      </table></p>  . . .  <sync start=329080>      <p class=ENUSCC>&nbsp;</p>  </sync>  </body>  </sami>

Windows Media metafile extensions are used to identify the format of the file that a metafile references. Windows Media metafiles with .wax, .wvx, or .asx extensions reference files with .wma (Windows Media Audio), .wmv (Windows Media Video), and .asf (Windows Media file) extensions, respectively. All metafiles, regardless of the file name extension used, have the <ASX> element tag at the beginning of the file with the version attribute specified. The ASX code below gives an example of how the video file and caption file are bound together for Web-based presentation. The code is similar to XML, and at this writing it is better supported by Microsoft Internet Explorer than by Netscape Navigator or other user agents.

<ASX version = "3.0">  <title>SAMI Captioning Demo</title>  <entry> <title>Student Scenario</title>  <author>ATSTAR</author> <copyright>2001</copyright>  <ref href = "at_vid_414.asf" />  </entry>  <entry>  <ref href="at_vid_414.smi" />  </entry>  </ASX>

More about SMIL Attributes

There are excellent tutorials that will help you learn more about how to use SMIL to effectively and accessibly include multimedia elements in your work. Like HTML, SMIL includes useful attributes that will support the accessibility of your pages. We list a few of them below; you'll notice some familiar terms, along with some we haven't encountered before.

  • alt: When used as an attribute of a media object, alt specifies a brief text message about the function of that object. Media players may render alternative text in place of or in addition to media content, for instance when images or sound are turned off or not supported.

  • longdesc: As in HTML, the longdesc attribute links to a more complete description of media content. Authors should provide long descriptions of complex content, such as charts and graphs or works of art. The longdesc attribute is also useful to designate a text transcript of audio and video information.

  • title: This one can be used as an attribute of most SMIL elements to provide advisory information about the nature of the element. The SMIL specification explains how to use the title attribute for a given element type. For example, for links, use it to describe the target of the link.

  • author: Use this attribute to specify text metadata about document elements. Metadata generally promote accessibility by providing more context and orientation.

  • abstract: Optional metadata about document elements, the abstract attribute, like the author attribute, increase context and orientation information for the user. We encourage you to use these attributes to provide more equivalent user experiences in your multimedia presentations. Text metadata provide a number of access options since they may be rendered in a variety of ways on the screen, as speech, or on a refreshable Braille display.

The overview of captioning considerations and options presented here is only an introduction. Multimedia applications are changing rapidly as the trend toward greater accessibility increases. We encourage you to stay abreast of changes and provide the highest degree of user choice in your multimedia presentations. Among other emerging practices is the field of audio description.


    Maximum Accessibility(c) Making Your Web Site More Usable for Everyone
    Maximum Accessibility: Making Your Web Site More Usable for Everyone: Making Your Web Site More Usable for Everyone
    ISBN: 0201774224
    EAN: 2147483647
    Year: 2002
    Pages: 128

    Similar book on Amazon © 2008-2017.
    If you may any questions please contact us: