sgn–14006 audio and speech introduction 1 sgn-14006 / a.k. sgn–14006 audio and speech...

Download SGN–14006 Audio and Speech Introduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing

If you can't read please download the document

Post on 29-Jun-2020




0 download

Embed Size (px)


  • Introduction 1 SGN-14006 / A.K.

    SGN–14006 Audio and Speech Processing

    Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology (slides by Anssi Klapuri)

    Introduction 2 SGN-14006 / A.K. Course goals

    !  Learn basics of audio signal processing –  Basic operations and their underlying ideas and principles –  Give basic skills although all the latest cutting edge algorithms

    cannot be covered

    !  Learn fundamentals of speech processing –  Speech production and its computational modeling –  Acoustic features to represent speech signals –  Some applications: speech coding, synthesis

    !  Learn the basics of acoustics and human hearing –  These form the foundation for technical applications

    Introduction 3 SGN-14006 / A.K. Lecture timeline (some changes may still take place)

    !  Sound, audio signals, acoustics !  Hearing !  Basic audio signal processing operations

    –  AD/DA-conversion, filters and filter banks, dynamic control, etc. !  Sound synthesis !  Audio coding

    !  Speech production anatomy, phonetics !  Linear prediction, MFCCs, and cepstrum !  Speech coding !  Speech synthesis

    Introduction 4 SGN-14006 / A.K. What is not covered by this course

    !  Speech recognition, audio content analysis, and acoustic pattern recognition " Course SGN-24006 ”Analysis of Audio, Speech and Music

    Signals” (period 4)

    !  Analog audio –  Electroacoustics, microphone and loudspeaker design " See the course ”Akustiikan mittaukset”

    !  Hardware implementations

  • Introduction 5 SGN-14006 / A.K. Practical arrangements

    !  Course homepage: !  Lectures

    –  Mondays 12-14 in TB219 –  Thursdays 14-16 in TB222 –  Pasi Pertilä, pasi.pertila @

    !  Lecture slides will be available as pdf on the course page –  Course is not based on any individual textbook. Lectures, lecture notes

    and exercises will be sufficient to take the exam. –  Some recommended textbooks are mentioned at the end of this

    introduction !  Requirements: exam and project work !  5 cr

    Introduction 6 SGN-14006 / A.K. Exercises

    !  Exercises start one week after the lectures (2.9.2015) !  Assistants: Shriram Nandakumar, Emre Cakir !  Contents: math and Matlab exercises related to the

    lectures !  Two alternative groups

    –  Tuesday 10-12 in TC303 (updated!) –  Friday 12-14 in TC303 –  Register to either group on-line at 14:00 today

    !  Math problems are to be solved in advance, Matlab exercises are done during the exercises

    !  Active completion of the exercises and participation in the exercises is credited up to 3 points in the exam (equivalent to one mark)

    !  Project work will be discussed at the exercises too

    Introduction 7 SGN-14006 / A.K. Project work

    !  Implementing an audio signal processing algorithm in Matlab –  In two-person groups

    !  Topic(s) will be introduced later during the lectures !  Requirements:

    –  Choosing the topic –  Implementing the algorithm –  Final report by 28.10.

    !  More detailed instructions will appear on the course home page

    Introduction 8 SGN-14006 / A.K. Reference material

    !  Gold, Morgan, Ellis, ”Speech and audio signal processing,” Wiley, 2011. !  Zölzer.”Digital audio signal processing,” Wiley&Sons, 2nd ed. 2008.

    –  Including AD/DA-conversion, dynamic control, equalization, filter banks !  T.F. Quatieri: "Discrete-Time Speech Signal Processing: Principles and

    Practice", Prentice Hall PTR, 2002. !  Rossing. ”The science of sound”, Addison-Wesley, 1990.

    –  Acoustics, hearing !  Brandenburg, Kahrs. (1998). ”Applications of digital signal processing to audio

    and acoustics,” Kluwer Academic Publishers –  Chapter on Perceptual audio coding

    !  Pulkki, Karjalainen, ”Communication acoustic”,2015, Wiley

  • Introduction 9 SGN-14006 / A.K.

    Introduction to audio signals and their representation

    Introduction 10 SGN-14006 / A.K. Audio signals

    !  Audio = related to sound or hearing !  The word sound may mean

    1.  a sensation perceived by the auditory system, or 2.  longitudinal pressure waves in a material medium (such as air)

    that may cause a hearing sensation –  Due to human hearing, we usually consider the frequency range

    20 Hz – 20 kHz and air as the medium (although hearing works also underwater for example)

    !  Sound signal – audio signal –  Numerical representation of sound –  Sound pressure level as a function of time, measured using a

    microphone for example

    !  Note: audio signal is often understood as non-speech audio signal, although speech signals are audio too

    Introduction 11 SGN-14006 / A.K. Audio and speech processing

    !  Where is audio and speech processing needed? !  Examples:

    –  Convert a musical piece into compressed mp3 format and store it on a hard disc for playback later (audio coding)

    –  Encode a speech signal on a mobile phone before transmission –  Add reverberation to a sound, correct the pitch of a singer (studio

    technology) –  Enhance the quality of a speech signal (denoising, echo cancell.) –  Compensate for loudspeaker non-idealities by digital equalization

    !  Typical digital signal processing system: 1. Digitize a signal (sampling, quantization) 2. Process in digital form (store, manipulate, etc)

    -digital representation enables a variety of algorithms 3. Convert back to an analog signal

    Introduction 12 SGN-14006 / A.K. Audio signal representations

    !  Different applications employ different representations –  Time domain representation –  Frequency domain representation –  Time-frequency domain representation

    !  On this course we consider mainly music and speech –  Music signals involve a wide variety of sounds, billions of people

    listen to music worldwide –  Speech signals are an important special category of sound signals

    due to their importance for communication

  • Introduction 13 SGN-14006 / A.K. Time domain signal

    !  Air pressure level as a function of time (zero level = normal air pressure) is a natural representation for audio –  An analog signal is easy to record using a microphone and play

    back using a loudspeaker

    !  For music, typical sampling rates are 44.1 or 48 kHz –  Allows for representing the frequency range of human hearing

    (approximately 20 Hz – 20 kHz)

    !  For speech –  8 kHz: Narrowband

    •  the conventional telephone rate (sibilants /s/, /f/ distorted) –  16 kHz: Wideband

    •  voice over IP, bandwidth extension

    !  Other rates are also widely used: 96, 32, 22.05 kHz etc. !  Most of the energy (and information) of natural sounds is

    at low frequencies (around 200 Hz – 5 kHz)

    Introduction 14 SGN-14006 / A.K. Time domain signal (1)

    !  Analog signal (solid line) can be represented with discrete samples (dots) without loss of information, if the sampling frequency ≥ 2 * highest frequency component in the signal –  Remember from introductory signal processing courses

    Introduction 15 SGN-14006 / A.K. Time domain signal (2)

    !  Large time scale illustrates the sound amplitude envelope !  Example signal: one note from the oboe

    –  Amplitude is zero before the sound starts –  The oboe has continuous excitation, therefore the sound’s

    amplitude envelope remains nearly constant throught it duration

    Introduction 16 SGN-14006 / A.K. Time domain signal (3)

    !  Zoom-in of the same oboe signal at time t = 0.45 s !  90 ms frame illustrates the periodic waveform

    –  Many sounds are periodic, for example most musical instrument sounds and vowels in speech

  • Introduction 17 SGN-14006 / A.K. Frequency domain representation – spectrum

    !  Obtained by computing discrete Fourier transform (for example) of the time-domain signal, usually in a short frame

    !  Many perceptually important properties are more clearly visible in the frequency domain

    !  Decibel scale for amplitude is useful from the viewpoint of the human hearing and the dynamics of natural sounds –  Due to Fechner’s law (subjective sensation is proportional to the

    logarithm of the stimulus intensity) !  Phases are perceptually less important – often omitted

    Introduction 18 SGN-14006 / A.K. Consider log-frequency and dB-magnitude

    !  Linear scale –  usually

    hard to ”see” anything

    !  Log-frequency –  each octave is

    approximately equally important perceptually

    !  Log-magnitude –  perceived change

    from 50dB to 60dB about the same as from 60dB to 70dB

    Introduction 19 SGN-14006 / A.K. Time-frequency representation – spectrogram

    !  Shows sound intensity as a function of time and frequency !  Obtained by blocking the signal into short analysis frames

    and by computing their spectra !  For audio, the frame size is typically 10–100 ms: sound

    spectra are often nearl


View more >