Audio for Video | Apple Pro Training Series. Optimizing Your Final Cut Pro System. A Technical Guide to Real-World Post-Production

Ah, audio... always second fiddle when video is being discussed, yet sound is arguably more important than picture in the overall impact of a program. Fortunately, audio is fairly straightforward when compared to video's intricacies, at least from a technical standpoint, but there are a couple of details worth noting to avoid common pitfalls.

Single- and Double-System Recording

Sound recorded on videotape (or on film, for film cameras) is called single-system sound. It's convenient, since it's married to the image in perfect sync, and can be captured in sync at the same time as the picture.

Double-system sound is recorded separately, on DAT, minidisk, quarter-inch tape, or portable disk recorders. With double-system sound, a sync mark of some sort is needed to align sound to picture: a clapstick or slate is normally used, but anything that produces simultaneous visual and audible cues can be used. Sometimes a timecode slate is used; the camera shoots the image of the timecode being used on the sound recorder, and the timecodes are matched up in post.

Audio on Videotape

Audio is stored on videotape in a number of ways. Linear analog tracks are used in most analog videotape formats; the audio is laid down in its own track using a stationary head, just like reel-to-reel or cassette audio. Linear tracks can be overdubbed or recorded separately from the video track. The frequency response of a linear track is largely dependent on the "pull speed" of the tape format: how rapidly the tape moves past the head. By modern standards, linear audio tracks aren't especially high-fidelity.

AFM (Audio Frequency Modulation) tracks are analog audio tracks recorded by rotating heads in the same area of the tape as the video. Quality is usually very good, better than linear audio, but the audio cannot be recorded separately from the video.

PCM (Pulse Code Modulation) tracks are digital audio tracks recorded in their own segments of the video track by the video heads. PCM tracks are usually recordable separately from video and are of high quality.

Mono, Stereo, Multichannel

Audio is recorded to one or more tracks, each track containing a channel of sound. A single channel is mono, short for monophonic, but beyond that things get more complex.

Two tracks may contain two separate mono feeds, say, a shotgun mike and a lavaliere. The two channels are independent of each other, with their own separate levels and settings, and should be kept as separate channels within FCP: you would capture these unlinked (in FCP 5) or as "Ch 1 + Ch 2" (in earlier versions).

Alternately, two tracks might form a stereo pair, where the two tracks contain the left and right channel sounds of a single, stereophonic feed. These two channels should be panned left and right, respectively, and their levels should be ganged together so they fade up and down in sync. If you capture them linked (FCP 5) or as "Stereo" (earlier versions), FCP will perform the panning and ganging automatically.

Tip

Although you can convert a clip's audio from "Stereo" to "Ch 1 + Ch 2" and vice versa after the fact, life will be much easier if you select the proper format before capturing dozens of clips.

Sometimes a stereo pair arrives as a mid+side recording. MS recording is the audio equivalent of the RGB-to-YUV transform: the mid track contains the sum of the left and right channels, and the side track contains their differences. Unfortunately FCP can't handle mid+side stereo internally; you'll have to use a separate program to reformat the channels as left+right, or use an external processor to convert mid+side to left+right as you capture it. Fortunately mid+side recording is less common than left+right recording.

Tip

Mid+side played back as left+right will have one channel that sounds like a normal mono mix (the sum of left and right), and the other may sound weak, indistinct, and may have some sounds much louder while others are softer.

In a pinch, if you do not need the stereo image, you can discard the side track and just use the mid track as a mono mix.

Multichannel audio is common on many modern videotape formats as well as tape- or disk-based digital audio recorders. With a multichannel recording, you really need to have the sound recorder's notes to determine how each track is being used and how to capture it.

Although FCP can capture up to 24 channels with the appropriate hardware, FireWire DV captures are limited to two channels at a time. If you need to capture more than two channels, you can either capture two channels at a time, making multiple passes through the media, or you can use an appropriate multichannel audio interface to capture the audio into FPC, bypassing FireWire.

Sample Rates

Digital audio may be sampled at various rates; the higher the rate, the better the high-frequency response will be, but the higher the data rate and storage requirements of the audio track will be. Sampling theory states that the highest frequency you can faithfully record and reproduce is half the sampling frequency (the Nyquist limit). In practice, the high-frequency response for audio sampled at a given rate will be slightly less than half the sample rate.

Human hearing is usually said to range from 20 Hz to 20 kHz, but NTSC broadcast audio doesn't extend past 15 kHz, nor does the response of many older people's ears (or younger people's ears, if they've listened to a lot of heavy metal with the volume turned up to 11). At a minimum you'd want audio sampling to occur at 30 kHz for NTSC transmission, or 40 kHz to cover the range of hearing. On the other hand, many film sound designers now prefer to record and mix at 96 kHz or 192 kHz, saying that the overtones otherwise lost have a perceptible impact on the audible frequencies.

You'll most commonly encounter sound recorded at 32 kHz, 44.1 kHz, and 48 kHz. 32 kHz is used by DV, DVCAM, and Digital8 camcorders in four-channel mode. (Some older DV camcorders can only record at 32 kHz.) 44.1 kHz is the sample rate used on CDs and other recorded media, and occasionally appears on DV tapes created on early DV NLEs. 48 kHz is the most common rate for professional audio and audio for video recording. Final Cut Pro happily handles these rates natively.

96 kHz and 192 kHz are available on some disk-based audio recorders. Capturing audio at this rate requires compatible third-party audio interfaces.

Rates below 32 kHz aren't normally seen in source media, but are often used for Internet or multimedia delivery. If you import such a clip into FCP, FCP can resample to a higher rate, but you won't have much in way of high frequencies, and the sound is likely to be dull and "muddy." (Of course, if you're looking for precisely that sort of soundconstrained by limited-bandwidth transmission mediathen go for it.)

Bit Depth

Bit depth in sound, as in video, affects the smoothness, signal-to-noise ratio, and dynamic range of the captured signal. The standard bit depth for professional audio is 16 bits per channel, although some high-end equipment now records 24 or even 48 bits per channel. FPC 5 supports both 16-bit and 24-bit audio.

DV, in four-channel mode, records 32 kHz 12-bit audio using level-adaptive compression: the signal-to-noise ratio degrades slightly when the level is very high (when you can't hear the difference anyway), but retains the fineness and discrimination of 16-bit audio for soft sounds. Once inside the Mac, the DV codec's audio section converts it to 16 bits, so the 12-bitness of DV's four-channel mode is mostly a curiosity, not an operational issue.

Lower bit depths are often found, again, in media intended for Internet or multimedia delivery. If you use these media in FCP, they'll exhibit a higher noise floor: soft sounds will tend to be lost in a sort of hiss or hash of noise.

File Formats

Most production audio, if saved to disk, will be stored in AIFF or WAV (wave) files. AIFFs typically originate on Macs, WAVs on PCs, but both can contain 48 kHz 16-bit uncompressed audio, and both import perfectly well into FCP. AIFFs and WAVs also support sample rates as low as 8 kHz and bit depths of 8 bits per channel, and AIFFs may contain compressed audio using a variety of codecs, but these are not commonly found in production applications.

Tip

It doesn't hurt to look at an imported clip's properties when you import it to make sure the bit depth and sample rate are sufficient to give you high-quality audio.

Film audio is sometimes recorded using other formats, such as BWF (Broadcast Wave File: WAV with added timecode and metadata tracks). You may need to convert these files to AIFFs or WAVs using a third-party program before bringing them into FCP.

Compressed Audio

Most production audio is recorded uncompressed (at 1.5 Megabits/second for two channels of 16-bit, 48 kHz sound), but there are a few exceptions.

DV's 32 kHz 12-bit adaptive compression was described earlier; for all intents and purposes it can be treated as 32 kHz 16-bit uncompressed.

Audio recorded on minidisc uses Sony's ATRAC compression. It's a "perceptual codec" designed to throw away certain frequencies masked by louder sounds, so that the resulting playback is perceptually indistinguishable from the original sound. Whether it really is indistinguishable or not is a controversy best left unaddressed; regardless, people are using minidisc for location audio recording. Minidisc audio is decompressed for playback and captured as 16-bit uncompressed, so its compression is not something to worry about inside FCP.

MPEG-1, layer 2 audio is used in HDV cameras, compressing the audio 4:1 to 384 kbits/second. FCP 5 decompresses HDV audio during capture, and renders back to MPEG-1 Layer 2 when outputting to an HDV device.

MP3 (MPEG-1, layer 3) audio is used in iTunes, the iPod, and a variety of Internet and computer-related applications. Sometimes you'll get audio tracks (sound effects, second language tracks, and so on) as MP3 files; you can use iTunes or other third-party audio applications to decompress them to AIFFs for use in FCP. The quality of MP3s depends on their bitrate; 128 to 160 kbits/second gives quite usable audio, and lower bitrates give flatter and muddier sound.

Dolby E is a way to compress six channels of audio for studio transmission over a single AES/EBU two-channel connection. FCP can't handle Dolby E directly; you'll need a third-party producttypically a Dolby E decoder boxto convert the audio back into uncompressed form.

Dolby AC-3 is a low-bitrate format for delivering 5.1 audio (five channels plus a low-frequency subwoofer channel) on DVD or in digital television transmission. It's not intended for further production purposes; it's a delivery format only.

Locked and Unlocked DV Audio

FCP's capability to capture raw DV data across FireWire leads to an interesting audio synchronization issue.

DV cameras can record audio as "locked" or "unlocked." Locked audio, used in more professional equipment, uses an audio sample clock precisely tied to the video sample clock, so that audio and video sampling march in lockstep. Cameras recording unlocked audio, as the name implies, use an audio sample clock that is not precisely locked to the video clock. For example, audio that should be sampled at 48 kHz gets sampled slightly faster or slower: 48.009 kHz, or 47.998 kHz. The sample rate information stored with the DV data, however, still says 48 kHz; the DV specification allows only 32 kHz, 44.1 kHz, and 48 kHz timebases.

DV interleaves the audio and video on a frame basis, so when a tape is played back, the audio and video play in sync regardless of the audio clock used. And when iMovie imports a DV stream from tape, things likewise stay in sync.

When DV is stored in a QuickTime file, the audio is read out of each frame and resaved as a parallel track of AIFF-format audio. Each track of a QuickTime file has its own timebase determining the playback rate of the track.

With locked audio, the timebase stored in the DV data matches the sample rate of the audio, and the audio plays back at the proper rate, staying in sync with the picture. With unlocked audio, if the nominal and actual timebases differ, the audio will slowly drift out of sync with picture over the duration of the clip. For short captures, the drift is not noticeable, but captures over 4 or 5 minutes can show perceptible sync errors.

Final Cut Pro HD compensates for this automatically. When a DV clip is captured over FireWire, FCP counts the audio samples as it goes, so it can record the actual timebase in the resulting QuickTime file. (With previous versions, you had to turn on "auto sync compensation," but now FCP has this feature always turned on.)

Clips captured in FCP should always be in syncas long as the clips are well-formed on tape. If a clip capture crosses a blank spot on the tape, or a scrambled frame, FCP can get confused and lose count, resulting in a bad audio timebase setting and resultant sync drift. If this happens, and you can't locate the bad frame on tape to avoid capturing across it, try recapturing again and/or break the clip into smaller clips and capture them individually.

By the same token, DV clips imported from other sources, such as programs saved to disk or tape with older versions of FCP or other NLEs, may show sync drift or slippage. In these cases, the sync drift may have been embedded in the program by the program writing the file; you may have to unlink audio and video and speed-change the audio to get it back in sync by trial and error.

More Info

The Final Cut Pro manualalso available in FCP by choosing Help > Final Cut Pro User Manualis chock-full of really useful background information about audio. Have you looked at it recently?