introduction of digital audio name: yao-cheng chuang phone: 0919005578 email :...
TRANSCRIPT
History and Comparison
Speech and audio history.
Speech and Audio
Speech is sounds human can utter, but audio is what human can hear.
The basic bandwidth of speech is 4KHz. On the other hand, the basic bandwidth of audio is 22.05KHz.
The research of speech coding started earlier than audio coding.
SPL: Sound Pressure Level
Speech Codec
The first speech codec standard is PCM (Pulse Code Modulation). It used simple sampling and quantization to represent digital speech information.PCM is 64Kbps.It is also called CCITT G.711.
(International Telephone and Telegraph Consultative Committee)
The goal of speech codec is low bit-rate.
ADPCM (Adaptive Differential PCM), also called CCITT G.721, is the representative of 32Kbps.
Because the neighborhood of speech sampling is usually similar, we use their differential to compress original data.
Later, CCITT G.723, G.726 appeared. They are also ADPCM but support many bit-rate selections, such as 40Kbps, 32Kbps, 24Kbps, 16Kbps.
CCITT G.727, G.728 are 16Kbps, and they are representative of middle bit-rate.
They use the technique of backward-CELP. This technique pays attention to short delay time.
CELP (Code Excited Linear Prediction) is 8Kbps.
MOS: Mean Opinion Score
Audio Codec
After speech codec, many companies and committees invest in audio codec.
ISO formulates a suite of video and audio standards called MPEG.
Dolby develops AC-1, AC-2, and AC-3.ISO (International Organization for Standardization)
MPEG (Moving Pictures Experts Group)
AC-3 (Audio Codec 3)
DAB: Digital Audio BroadcastDCC: Digital Compact CassetteISDN: Integrated Services Digital NetworkMD: Micro Drive
Why Transform?
Two main reasons.
Benefit of Transformation
There are two main reasons to transform one kind of information or data from one domain to another domain.
1. Data compression.
2. Some operations can only be done in some domain.
Data Compression
Data Compression (cont.)
Disadvantage
It is not a good method for us to use transformation in audio data compression.
Our ears are more sensitive in some frequency (e.g. 1kHz - 5kHz).
This kind of data compression does not consider psychoacoustic factors.
Frequency Domain
Human ears hear sounds according to its frequency.
Some operations must be in frequency domain.
Many psychoacoustic studies are based on frequency domain.
Pulse-Code Modulation
Raw data of sound.
Modulation
Modulation is a means of encoding information for the purpose of transmission or storage.
Such as amplitude modulation (AM) and frequency modulation (FM) have long been used to modulate carrier frequencies with analog audio information for radio broadcast.
Amplitude Modulation
Frequency Modulation
PWM / PPM
pulse-width modulation
pulse-position modulation
PAM / PNM
pulse-amplitude modulation
pulse-number modulation
PCM
pulse-code
modulation
It is the most commonly used modulation method
Lossless and Lossy Compression
Two main models of compression.
Terminology
E ( ): encoding algorithmD ( ): decoding algorithmM: original datam = E (M) -> encoding MM’ = D(m) -> decoding mIf M = M’ , then we call the algorithm as lossless compression, otherwise as lossy compression.
Compression Ratio
Compression ratio
p = (M – m) / M * 100%
Generally, lossy compression is better than lossless compression in compression ratio.
Psychoacoustics and Human Ear
Sounds of Human feeling.
Terminology
Loudness: Sound loudness is a subjective term describing the strength of the ear's perception of a sound.
Intensity: Sound intensity is defined as the sound power per unit area. The basic units are watts/m2 or watts/cm2 .
Threshold of Hearing
This is audibility curve. Below the curve, we can not hear anything.Human ears can hear the sound scale from 20-20000 Hz.Many sound intensity measurements are made relative to a standard threshold of hearing intensity:
I 0= 10-12 watts/m2 = 10-16 watts/cm2
Intensity Level
Decibel (dB) : The sound intensity I 1
may be expressed in decibels above the standard threshold of hearing I 0 .
Intensity level = 10 log10 ( I 1 / I 0 ) (dB)
I 0 : threshold of hearing
10-¹² watts / m²
I 1 : the intensity we want to measure
Threshold of Feeling
It is the upper bound curve that human can bear. Over this curve, human ears could be hurt.
It is not a horizontal line, either. In lower frequency, human ears are more sensitive, so the curve has a wave trough there.
Equal-Loudness Curve
At any equal-loudness curve, human hear the same loudness.
Equal-loudness curves are not horizontal lines.
Between threshold of hearing and threshold of feeling, there are infinite equal-loudness curves.
Human Hearing
Sound Masking
Time / frequency sound masking.
Frequency Masking
If many tones play simultaneously, some tones will be masked by others.
We can draw a frequency masking curve, and we can not hear sounds under the curve.
The curve’s slope steep at low frequency, but slow at high frequency.
Frequency Masking (cont.)
The louder masking sounds, the larger masked area.
If we use the frequency masking technique, we can reduce the coding bits.
Time Masking
If one sound is played, it may generate pre-masking and post-masking.
Post-masking is longer than pre-masking.
The larger the sound, the longer the masking.
MP3
MPEG1 Layer3
Introduction
MPEG: Moving Pictures Experts Group
MP3: MPEG-1 Layer-3
Why is MP3 so popular?
Open standard
Availability of hardware and software
Near CD (Compact Disk) quality
Fast Internet access for universities and
businesses
MP3 Format
An MPEG audio file is separated into smaller parts called frames. Each frame is independent. Each frame has its own header and audio information. There is no file header. Therefore, you can cut any part of MPEG audio file and play it correctly.
The frame header is constituted by the first four bytes (32 bits) in a frame. aaaaaaaa aaabbccd
eeeeffgh iijjklmmWe can know some information from the frame header, such as:
What are the version and layer? Is it protected by CRC (Cyclic Redundancy Check)? What are the bit-rate and frequency?
The tag is used to describe the MPEG audio file. It contains information about artist, title, album, publishing year, genre, and comments.
It is exactly 128 bytes long and is located at the end of the audio data.
AAABBBBB BBBBBBBB BBBBBBBB BBBBBBBB BCCCCCCC CCCCCCCC CCCCCCCC CCCCCCCDDDDDDDDD DDDDDDDD DDDDDDDD DDDDDEEE
EFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFG
MP3 Encoder
MDCT: Modified Discrete Cosine TransformFFT: Fast Fourier Transform
MP3 Encoder
768Kbps = 32K samples/second * 24 bits/sample
MP3 Decoder
iMDCT: inverse Modified Discrete Cosine Transform
MP3 Decoder
Psychoacoustic Principles
Critical band
Sound masking:
Time masking
Frequency masking
Filter Bank
Hybrid filter bank
Polyphase and MDCT (Modified Discrete Cosine Transform)
32 channels of polyphase sub-band
MDCT transforms each sub-band into 18 smaller channels.
MDCT
DFT
DCT
CELP
Code Excited Linear Prediction
Background
Over the years many speech coding techniques have been developed starting from PCM and ADPCM (Adaptive Differential Pulse Code Modulation) in the 60s, to linear prediction in the 70s, and CELP in the late 80s and 90s.
Because we discover that speech spectra are similar at nearby samples, we use the method of prediction.
Person Model
For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice.
For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened.
The shape of your vocal tract determines the sound that you make.
The shape of the vocal tract changes relatively slowly (on the scale of 10 ms to 100 ms).
The amount of air coming from your lung determines the loudness of your voice.
Math Model
Vocal Tract H(z) (LPC (Linear Predictive Coding) Filter)
Air u(n) (Innovations)
Vocal Cord Vibration V (voice)
Vocal Cord Vibration Period T (pitch period)
Fricatives and Plosives UV (unvoiced)
Air Volume G (gain)
LPC
It stands for Linear Prediction Coefficients.
: spectra : error
LPC is the basic technique of CELP. Because CELP uses the prediction method, its bit-rate can be lower.
nin
L
iin eXaX
1
},...,,{ 21 nXXX
ne
CELP Encoder
AC-3
Audio Codec 3
What Is AC-3?
AC-3 refers to a multichannel music compression technology that has been developed by Dolby Laboratories.
Dolby Laboratories has used the term Dolby Digital to refer to this digital system in the film and theater industries, and has used the term Dolby Surround AC-3 to refer to the system in the home theater market.
The AC-3 can carry from 1 to 5.1 channels. It provides five full range channels (3 Hz to 20,000 Hz): three front channels (left, center, and right), plus two surround channels. A sixth bass-only effects channel (3 Hz to 120 Hz), also called sometimes “Low Frequencies Enhancement channel" (LFE).
How Does AC-3 Work?
It uses lossy compressions. Like MP3 or AAC, AC-3 uses sound properties to achieve its compression.
Input uncompressed PCM samples must be 32, 44.1, or 48 kHz on up to 20 bits.
AC-3 Encoder
AC-3 Decoder
AAC
MPEG-2 Advanced Audio Coding
AdvertisementBecause of its exceptional performance and quality, Advanced Audio Coding (AAC) is at the core of the MPEG4, 3GPP (3rd Generation Partnership Project) specifications and is the new audio codec of choice for Internet, wireless, and digital broadcast arenas.AAC provides audio encoding that compresses much more efficiently than older formats such as MP3, yet delivers quality rivaling that of uncompressed CD (Compact Disk) audio.
Why AAC?
The driving force to develop AAC was the quest for an efficient coding method for surround signals, like 5-channel signals (left, right, center, left-surround, right-surround) as being used in cinemas today.
One aim of AAC was a considerable decrease of necessary bit-rate.
Low Delay
Low Delay audio coding is needed whenever some sort of communication is transmitted over low bandwidth channels in both directions, i.e. live broadcasts on TV (Television) or radio stations or in mobile phone networks (3G: 3rd Generation).
Both AAC and CELP have low delay property.
AAC vs. MP3
MPEG-2 AAC is the consequent continuation of the truly successful coding method MPEG1 Layer-3.
The crucial differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio Layer-3 are shown as follows:
Quantization: By allowing finer control of quantization resolution, the given bit rate can be used more efficiently. Prediction: A technique commonly established in the area of speech coding systems. It benefits from the fact that certain types of audio signals are easy to predict.Bit-stream format: The information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible. The optimization of these coding methods together with a flexible bit-stream structure has made further improvement of the coding efficiency possible.
WMA
Windows Media Audio
What Is WMA?
It is an audio format by Microsoft.
Its file size is only one half the same data of MP3 file, but sound quality is similar to MP3.
Because it is proprietary, we hardly know its detailed codec.
The Difference between ASF and WMA/WMV
The only difference between ASF files and WMA or WMV files are the file extensions and the MIME types.
The MIME type for a WMV file is video/x-ms-wmv, and for WMA it is audio/x-ms-wma. The MIME type for ASF is video/x-ms-asf. The basic internal structure of the files is identical.
MIME: Multipurpose Internet Mail Extensions
WMV: Windows Media Video
ASF: Active Streaming Format
MIDI
Musical Instrument Digital Interface
What Is MIDI?
MIDI is a method of communication between digital instruments.
It was created at 1982.
Unlike so called speech or audio, MIDI is similar to one kind of music score. It is unrelated to codec.
We can write some musical notes on one MIDI file, then computer looks up a table for corresponding musical note and its sound.
Therefore, we just change the table, then we can set the sounds in violin, piano, or other instruments.
MIDI file is much smaller than general audio file.
Use What Is Fitting
MPEG1 Layer2 at 192 Kbps was in 7 of 8 cases significantly better than AAC at 96 Kbps, and in 6 of 8 cases better than AAC at 128 Kbps. Under the condition of twice cascading, the quality of AAC was much inferior to Layer2. It should also be noted, that there would be a significant difference in the processing time delay between Layer2 (which needs approximately 70ms) and AAC (about 300ms).
Reference
K. C. Pohlmann, Principles of Digital Audio, Fourth Edition, McGraw-Hill, New York, 2000.
吳炳飛 , Audio Coding 技術手冊 , 全華科技圖書 , 台北 , 2004.
AudioCoding.com, “Welcome to the World of Audiocoding,” http://faac.sourceforge.net/oldsite/wiki/, 2005.