Types of Digital Audio

There are two major methods for storing digital audio: amplitude based and frequency based. The amplitude-based method takes samples of the sound waveform at very small time intervals, converts the analog voltages into integers, and stores the resulting values sequentially. Usually the sample rate will be between 8,000 and 44,000 per second. The quality of recording will go up with an increase in sample rate, and down with a decrease; CD-quality sound has a standard rate of 44,100 samples per second. Sample rates higher than this don't add any additional sound quality.

The quality of the sound is also affected by the resolution of the analog-to-digital (A “D) conversion. If each sample is stored in any less than 8 bits, the sound quality degrades rapidly . CD-quality sound is recorded with 16 bits of accuracy, and many sound cards advertise the ability to record incoming signals to 20 or more bits of accuracy. The number of bits used for each sample is called the bit depth.

The amplitude-based method is more commonly known as PCM, short for "pulse code modulation." This terminology comes from the early days of digital audio, when the results of the A “D conversion were re-encoded into trains of pulses for transmission over a serial line. These days, of course, those results simply go straight to a hard drive or some other process.

PCM storage is simple and intuitive. If a stream of PCM samples stops (for example, due to a network problem), the sound simply stops. If a PCM file is played at a sample rate different from the one at which it was recorded, it sounds speeded up or slowed down, like a record player at the wrong setting. In fact, records and cassette tapes both store a replica of the analog waveform on physical media, and thus could be called analog PCM.

The frequency-based method of storing audio is much more recent and has no physical analog. In the 18th century, Fourier proved that any repeating waveform, no matter how complex, can be represented as a sum of pure tones of different volumes . Sustained notes from a flute, a bassoon, or an electric guitar approximate periodic waveforms and can all be represented this way. The flute has more of its volume in the high-pitched pure tones, which is what gives it its "brightness." Similarly, the bassoon has more of its volume in the low-pitched pure tones, which gives it its "throatiness."

The basic idea of frequency encoding is first to chop the audio waveform into frames that last only a fraction of a second. This frame rate is much slower than PCM sample rates but too fast for the human ear to perceive as independent. Then a mathematical transform is performed on the frame that obtains the pure tones and their corresponding volumes. The operation is repeated for the next frame, and those values are stored. Often the values for one frame won't be that different from the values for the next frame, and this makes it easier to compress the resulting lists of numbers .

This basic concept can be enhanced in many ways. For example, some of the values that don't contribute much to the sound of the audio clip can be discarded. In this respect, frequency encoding of audio is similar to the JPEG encoding of image data (see Chapter 22 , Image Processing). How much information is retained is determined by a parameter called the bit rate. Low-bit-rate files have poor quality; the quality gets better and better as the bit rate increases .

In constrast to PCM encoding, if a frequency encoding stream stops or skips, the sound does not simply stop. Instead, it " freezes " at a particular tone until more data is received. If the stream is delayed, as in a slow network connection, the sound appears to "drag," or slow down, without altering the pitches used in the clip. This interesting effect isn't possible with a PCM file (in that case, the audio would slow down, but the pitch would be changed also).

Finally, many audio clips support more than one channel of sound. One channel means monaural, and two channels mean stereo. High-end audio applications can handle audio clips with many more channels (for example, 5.1 stereo surround sound). Most of the clips that you'll work with, though, will involve two channels.

So which method is right for your recordings? The answer depends on what you want to do with the files. If you want to trim or add dead time, add digital effects, or reduce noise, it's best to use an uncompressed PCM format. If you have limited storage space, or when you actually want to distribute your files, frequency-based encoding is often a good bet due to its considerably smaller size. For each of these two methods, there are several commonly used file formats.

RAW Format

RAW format is basically raw PCM data in a file, with nothing else. Its major drawback is that the file contains no information about the size of the samples, the sample rate, or even the endianness ^[*] of the data. On the other hand, if you pick a standard such as CD quality (44.1 KHz, 16 bits, 2-channels, little-endian), you can use Unix utilities like cat and dd to cut and paste audio clips. What could be simpler? I'll show you an example in the upcoming section Cut and Paste.

^[*] Endianness refers to whether given data is stored "little-end-first" ( little-endian ) or "big-end-first" ( big-endian ).

WAV Format

Another major type of PCM file is the WAV, which is familiar to Windows users. WAVs contain multiple chunks of data, each tagged with a type code and a size. The two required chunks are a header, tagged WAVEFMT, and PCM audio data, tagged DATA (surprising choice of name , isn't it?). The header contains the information lacking in a RAW file: sample rate, number of channels, and bit depth, as well as other optional fields. WAV files are found on many different platforms, and practically all Linux sound-processing utilities support them to one degree or another.

MP3 (MPEG Layer 3) Format

In recent years , the MP3 format has become extremely popular on the Internet. An MP3 file consists of header information, followed by a number of encoded frames. The frames have a very short duration (<1/20 of a second), so individual frames blend together in a continuous perception. The data in each frame is frequency encoded, and certain heuristics are used to preserve only the elements of a waveform that are important in human perception. These tricks allow a very high compression ratio over a comparable WAV file, easily as high as 10:1. Even the highest-quality MP3 encoding can still produce a 4:1 compression.

The quality of an MP3, and in fact most frequency encodings, is determined by a parameter called bit rate. The de facto standard bit rate is 128,000 bits per second, or 128 kbps. Compare this to the number of bits of a PCM file that must be sent to a soundcard each second:

This is a factor of 10 more than the bit rate of an MP3. Small wonder , then, that frequency-encoded formats are the best for distributing audio files.

Since MP3s are frequency encoded, the data in the file bears little resemblance to the actual waveform. It must be decoded before any but the most simple operations can be performed on it. Consequently, playing or modifying MP3 files takes a good deal of processing power. Luckily, most modern machines have little difficulty keeping up. However, RAW or WAV files are still far better for editing and filtering audio clips.

Depending on your software philosophy, there may also be another problem with MP3s. The MP3 encoding algorithm is patented by the Fraunhofer Group , which requires anyone who builds an MP3 encoder (but not a player) to pay a license fee. Despite this situation, when MP3s became the "next big thing," many encoders appeared, some under dubious circumstances. The resultant wide access to free encoders helped propel MP3s to the popularity they enjoy today.

In addition, in their latest release of Windows, XP, Microsoft has de- emphasized the ability to create MP3s in favor of Windows Media Format. The default recorder settings for MP3s make them sound considerably worse than the default setting for WMA. It would be a sad situation indeed if this trickery causes the majority of music lovers to switch to a proprietary format simply because it "sounds better," not because of any real quality differences. In the long run, the popularity of MP3s may already have peaked.

Ogg Vorbis Format

Luckily, the open source movement comes thundering back! The Ogg Project is an ambitious and worthy attempt to build a set of fully open source, nonproprietary, non-patent-encumbered audio and video storage algorithms (referred to as codecs ). Although development is happening on many fronts, the audio codec, called Vorbis, is mostly completed. It is technically similar to MP3 but sufficiently different that it does not violate any patents held by Fraunhofer or other industry groups. All of the Ogg Vorbis source is freely available under the BSD license. Support for this format has already found its way into XMMS, and players are available for Windows. So perhaps there is some hope after all.