introduction of digital audio name: yao-cheng chuang phone: 0919005578 email :...

Introduction of Digital Audio

Name: Yao-Cheng ChuangPhone: 0919005578

Email : [email protected]

History and Comparison

Speech and audio history.

Speech and Audio

Speech is sounds human can utter, but audio is what human can hear.

The basic bandwidth of speech is 4KHz. On the other hand, the basic bandwidth of audio is 22.05KHz.

The research of speech coding started earlier than audio coding.

SPL: Sound Pressure Level

Speech Codec

The first speech codec standard is PCM (Pulse Code Modulation). It used simple sampling and quantization to represent digital speech information.PCM is 64Kbps.It is also called CCITT G.711.

(International Telephone and Telegraph Consultative Committee)

The goal of speech codec is low bit-rate.

ADPCM (Adaptive Differential PCM), also called CCITT G.721, is the representative of 32Kbps.

Because the neighborhood of speech sampling is usually similar, we use their differential to compress original data.

Later, CCITT G.723, G.726 appeared. They are also ADPCM but support many bit-rate selections, such as 40Kbps, 32Kbps, 24Kbps, 16Kbps.

CCITT G.727, G.728 are 16Kbps, and they are representative of middle bit-rate.

They use the technique of backward-CELP. This technique pays attention to short delay time.

CELP (Code Excited Linear Prediction) is 8Kbps.

MOS: Mean Opinion Score

Audio Codec

After speech codec, many companies and committees invest in audio codec.

ISO formulates a suite of video and audio standards called MPEG.

Dolby develops AC-1, AC-2, and AC-3.ISO (International Organization for Standardization)

MPEG (Moving Pictures Experts Group)

AC-3 (Audio Codec 3)

DAB: Digital Audio BroadcastDCC: Digital Compact CassetteISDN: Integrated Services Digital NetworkMD: Micro Drive

Why Transform?

Two main reasons.

Benefit of Transformation

There are two main reasons to transform one kind of information or data from one domain to another domain.

1. Data compression.

2. Some operations can only be done in some domain.

Data Compression

Data Compression (cont.)

Disadvantage

It is not a good method for us to use transformation in audio data compression.

Our ears are more sensitive in some frequency (e.g. 1kHz - 5kHz).

This kind of data compression does not consider psychoacoustic factors.

Frequency Domain

Human ears hear sounds according to its frequency.

Some operations must be in frequency domain.

Many psychoacoustic studies are based on frequency domain.

Pulse-Code Modulation

Raw data of sound.

Modulation

Modulation is a means of encoding information for the purpose of transmission or storage.

Such as amplitude modulation (AM) and frequency modulation (FM) have long been used to modulate carrier frequencies with analog audio information for radio broadcast.

Amplitude Modulation

Frequency Modulation

PWM / PPM

pulse-width modulation

pulse-position modulation

PAM / PNM

pulse-amplitude modulation

pulse-number modulation

PCM

pulse-code

modulation

It is the most commonly used modulation method

Lossless and Lossy Compression

Two main models of compression.

Terminology

E ( ): encoding algorithmD ( ): decoding algorithmM: original datam = E (M) -> encoding MM’ = D(m) -> decoding mIf M = M’ , then we call the algorithm as lossless compression, otherwise as lossy compression.

Compression Ratio

Compression ratio

p = (M – m) / M * 100%

Generally, lossy compression is better than lossless compression in compression ratio.

Psychoacoustics and Human Ear

Sounds of Human feeling.

Terminology

Loudness: Sound loudness is a subjective term describing the strength of the ear's perception of a sound.

Intensity: Sound intensity is defined as the sound power per unit area. The basic units are watts/m2 or watts/cm2 .

Threshold of Hearing

This is audibility curve. Below the curve, we can not hear anything.Human ears can hear the sound scale from 20-20000 Hz.Many sound intensity measurements are made relative to a standard threshold of hearing intensity:

I 0= 10-12 watts/m2 = 10-16 watts/cm2

Intensity Level

Decibel (dB) : The sound intensity I 1

may be expressed in decibels above the standard threshold of hearing I 0 .

Intensity level = 10 log10 ( I 1 / I 0 ) (dB)

I 0 : threshold of hearing

10-¹² watts / m²

I 1 : the intensity we want to measure

Threshold of Feeling

It is the upper bound curve that human can bear. Over this curve, human ears could be hurt.

It is not a horizontal line, either. In lower frequency, human ears are more sensitive, so the curve has a wave trough there.

Equal-Loudness Curve

At any equal-loudness curve, human hear the same loudness.

Equal-loudness curves are not horizontal lines.

Between threshold of hearing and threshold of feeling, there are infinite equal-loudness curves.

Human Hearing

Sound Masking

Time / frequency sound masking.

Frequency Masking

If many tones play simultaneously, some tones will be masked by others.

We can draw a frequency masking curve, and we can not hear sounds under the curve.

The curve’s slope steep at low frequency, but slow at high frequency.

Frequency Masking (cont.)

The louder masking sounds, the larger masked area.

If we use the frequency masking technique, we can reduce the coding bits.

Time Masking

If one sound is played, it may generate pre-masking and post-masking.

Post-masking is longer than pre-masking.

The larger the sound, the longer the masking.

MP3

MPEG1 Layer3

Introduction

MPEG: Moving Pictures Experts Group

MP3: MPEG-1 Layer-3

Why is MP3 so popular?

Open standard

Availability of hardware and software

Near CD (Compact Disk) quality

Fast Internet access for universities and

businesses

MP3 Format

An MPEG audio file is separated into smaller parts called frames. Each frame is independent. Each frame has its own header and audio information. There is no file header. Therefore, you can cut any part of MPEG audio file and play it correctly.

The frame header is constituted by the first four bytes (32 bits) in a frame. aaaaaaaa aaabbccd

eeeeffgh iijjklmmWe can know some information from the frame header, such as:

What are the version and layer? Is it protected by CRC (Cyclic Redundancy Check)? What are the bit-rate and frequency?

The tag is used to describe the MPEG audio file. It contains information about artist, title, album, publishing year, genre, and comments.

It is exactly 128 bytes long and is located at the end of the audio data.

AAABBBBB BBBBBBBB BBBBBBBB BBBBBBBB BCCCCCCC CCCCCCCC CCCCCCCC CCCCCCCDDDDDDDDD DDDDDDDD DDDDDDDD DDDDDEEE

EFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFG

MP3 Encoder

MDCT: Modified Discrete Cosine TransformFFT: Fast Fourier Transform

MP3 Encoder

768Kbps = 32K samples/second * 24 bits/sample

MP3 Decoder

iMDCT: inverse Modified Discrete Cosine Transform

MP3 Decoder

Psychoacoustic Principles

Critical band

Sound masking:

Time masking

Frequency masking

Filter Bank

Hybrid filter bank

Polyphase and MDCT (Modified Discrete Cosine Transform)

32 channels of polyphase sub-band

MDCT transforms each sub-band into 18 smaller channels.

MDCT

DFT

DCT

CELP

Code Excited Linear Prediction

Background

Over the years many speech coding techniques have been developed starting from PCM and ADPCM (Adaptive Differential Pulse Code Modulation) in the 60s, to linear prediction in the 70s, and CELP in the late 80s and 90s.

Because we discover that speech spectra are similar at nearby samples, we use the method of prediction.

Person Model

For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice.

For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened.

The shape of your vocal tract determines the sound that you make.

The shape of the vocal tract changes relatively slowly (on the scale of 10 ms to 100 ms).

The amount of air coming from your lung determines the loudness of your voice.

Math Model

Vocal Tract H(z) (LPC (Linear Predictive Coding) Filter)

Air u(n) (Innovations)

Vocal Cord Vibration V (voice)

Vocal Cord Vibration Period T (pitch period)

Fricatives and Plosives UV (unvoiced)

Air Volume G (gain)

LPC

It stands for Linear Prediction Coefficients.

: spectra : error

LPC is the basic technique of CELP. Because CELP uses the prediction method, its bit-rate can be lower.

nin

L

iin eXaX

1

},...,,{ 21 nXXX

ne

CELP Encoder

AC-3

Audio Codec 3

What Is AC-3?

AC-3 refers to a multichannel music compression technology that has been developed by Dolby Laboratories.

Dolby Laboratories has used the term Dolby Digital to refer to this digital system in the film and theater industries, and has used the term Dolby Surround AC-3 to refer to the system in the home theater market.

The AC-3 can carry from 1 to 5.1 channels. It provides five full range channels (3 Hz to 20,000 Hz): three front channels (left, center, and right), plus two surround channels. A sixth bass-only effects channel (3 Hz to 120 Hz), also called sometimes “Low Frequencies Enhancement channel" (LFE).

How Does AC-3 Work?

It uses lossy compressions. Like MP3 or AAC, AC-3 uses sound properties to achieve its compression.

Input uncompressed PCM samples must be 32, 44.1, or 48 kHz on up to 20 bits.

AC-3 Encoder

AC-3 Decoder

AAC

MPEG-2 Advanced Audio Coding

AdvertisementBecause of its exceptional performance and quality, Advanced Audio Coding (AAC) is at the core of the MPEG4, 3GPP (3rd Generation Partnership Project) specifications and is the new audio codec of choice for Internet, wireless, and digital broadcast arenas.AAC provides audio encoding that compresses much more efficiently than older formats such as MP3, yet delivers quality rivaling that of uncompressed CD (Compact Disk) audio.

Why AAC?

The driving force to develop AAC was the quest for an efficient coding method for surround signals, like 5-channel signals (left, right, center, left-surround, right-surround) as being used in cinemas today.

One aim of AAC was a considerable decrease of necessary bit-rate.

Low Delay

Low Delay audio coding is needed whenever some sort of communication is transmitted over low bandwidth channels in both directions, i.e. live broadcasts on TV (Television) or radio stations or in mobile phone networks (3G: 3rd Generation).

Both AAC and CELP have low delay property.

AAC vs. MP3

MPEG-2 AAC is the consequent continuation of the truly successful coding method MPEG1 Layer-3.

The crucial differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio Layer-3 are shown as follows:

Quantization: By allowing finer control of quantization resolution, the given bit rate can be used more efficiently. Prediction: A technique commonly established in the area of speech coding systems. It benefits from the fact that certain types of audio signals are easy to predict.Bit-stream format: The information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible. The optimization of these coding methods together with a flexible bit-stream structure has made further improvement of the coding efficiency possible.

WMA

Windows Media Audio

What Is WMA?

It is an audio format by Microsoft.

Its file size is only one half the same data of MP3 file, but sound quality is similar to MP3.

Because it is proprietary, we hardly know its detailed codec.

The Difference between ASF and WMA/WMV

The only difference between ASF files and WMA or WMV files are the file extensions and the MIME types.

The MIME type for a WMV file is video/x-ms-wmv, and for WMA it is audio/x-ms-wma. The MIME type for ASF is video/x-ms-asf. The basic internal structure of the files is identical.

MIME: Multipurpose Internet Mail Extensions

WMV: Windows Media Video

ASF: Active Streaming Format

MIDI

Musical Instrument Digital Interface

What Is MIDI?

MIDI is a method of communication between digital instruments.

It was created at 1982.

Unlike so called speech or audio, MIDI is similar to one kind of music score. It is unrelated to codec.

We can write some musical notes on one MIDI file, then computer looks up a table for corresponding musical note and its sound.

Therefore, we just change the table, then we can set the sounds in violin, piano, or other instruments.

MIDI file is much smaller than general audio file.

Use What Is Fitting

MPEG1 Layer2 at 192 Kbps was in 7 of 8 cases significantly better than AAC at 96 Kbps, and in 6 of 8 cases better than AAC at 128 Kbps. Under the condition of twice cascading, the quality of AAC was much inferior to Layer2. It should also be noted, that there would be a significant difference in the processing time delay between Layer2 (which needs approximately 70ms) and AAC (about 300ms).

Reference

K. C. Pohlmann, Principles of Digital Audio, Fourth Edition, McGraw-Hill, New York, 2000.

吳炳飛 , Audio Coding 技術手冊 , 全華科技圖書 , 台北 , 2004.

AudioCoding.com, “Welcome to the World of Audiocoding,” http://faac.sourceforge.net/oldsite/wiki/, 2005.

introduction of digital audio name: yao-cheng chuang phone: 0919005578 email :...

Documents

frequency modulation

amplitude modulation

modulation modulation

data compression slide

audio speech

audio codec

micro drive slide

audio data compression