simon king university of edinburgh - speech.zonesimon king university of edinburgh additional...

Speech Processing

Simon KingUniversity of Edinburgh

additional lecture slides for 2018-19

Two courses in one

• Undergraduate course code: LASC10061• Postgraduate course code: LASC11065

• Shared:• Course materials, lectures, labs• Exam/coursework weightings

• Differences:• Undergraduate course: 20 credits• Postgraduate course: 10 credits• Coursework expectations

Assessment

• Three items of assessment : two practical lab-based assignments and an exam1. First assignment: speech synthesis practical write-up (20%)2. Second assignment: speech recognition practical write-up (30%)3. Multiple choice exam in December (50%)

• Assignment due dates are given in the weekly schedule• Exam date is set by the University and will be published in due course

Introduction to the course

• learning outcomes• delivery• timetable• background required• course coverage• assignments

Learning outcomes

• Overview of the components of speech recognition and speech synthesis systems• Understand the main concepts and what each component does

• Describe a simple version of each component• See what the difficult problems are in recognition and synthesis. • Use tools for visualising and manipulating speech waveforms

• Experiment with two speech technology systems (Festival & HTK)• See knowledge and skills from different areas come together in an interdisciplinary field

Delivery - speech.zone website & Learn

• The website speech.zone contains almost everything you will need• Video material, reading lists, forums, calendar, coursework instructions

• you must have an account on this site, so that you can post on the forums - make sure you can log in, and email me if you have any trouble

• You only need to use Learn to:• Sign up for a lab group• Optional Introduction to Unix material• Submit your coursework

Delivery - what I provide

• The material on speech.zone is divided into modules

• the video content provides only the bare bones and there are gaps in coverage• you need to flesh out the details by taking full advantage of other modes of learning

• readings• lectures, including activities, quizzes, etc• labs• forums

Delivery - what you need to do

• Every week you will watch all videos and do essential readings before the lecture• You can then post questions on the forums

• feel free to ask questions at any level - I’ll tell you if you go “off topic”• try to choose an appropriate forum / topic / thread (I will re-organise as required)• often, the most basic questions are the most helpful

• Lectures will help you synthesise together all the different sources of information and different ways of learning. No single source will be enough on its own.

• To do well in this course, you need to be an active learner, not a passive observer• don’t worry - this course does not require much “audience participation”

Laptops are the single most reported distracter to both users and fellow students. (Fried 2008, Zhu et al 2011)

Laptop & device use is negatively correlated with understanding of course material and overall course performance. (Hembrooke & Gay 2003, Fried 2008, Patterson & Patterson 2017)

The negative effects of device use are the highest among males and low performing students (Patterson & Patterson 2017)

References

Carrie B. Fried (2008). In-class laptop use and its effects on student learning. Computers & Education 50(3): 906-914.

Helene Hembrooke and Geri Gay (2003). The Laptop and the Lecture: The Effects of Multitasking in Learning Environments. Journal of Computing in Higher Education 15(1): 46-64.

Richard W. Patterson, Robert M. Patterson (2017). Computers and productivity: Evidence from laptop use in the college classroom. Economics of Education Review 57: 66-79.

Erping Zhu, Matthew Kaplan, R. Charles Dershimer, Inger Bergom (2011). Use of Laptops in the Classroom: Research and Best Practices. University of Michigan CRLT Occasional Paper 30.

Students who use digital devices in class 'perform worse in exams' by Richard Adams, 11 May 2016.

Students are Better Off without a Laptop in the Classroom by Cindi May, 11 July 2017.

Why smart kids shouldn’t use laptops in classby Jeff Guo, 16 May 2016.

:

:

:

CC By Simon Fokt Washington Post illustration; iStock

http://www.sciencedirect.com/science/article/pii/S0360131506001436




https://link.springer.com/article/10.1007/BF02940852











http://www.crlt.umich.edu/sites/default/files/resource_files/CRLT_no30.pdf









https://www.theguardian.com/education/2016/may/11/students-who-use-digital-devices-in-class-perform-worse-in-exams



https://www.scientificamerican.com/article/students-are-better-off-without-a-laptop-in-the-classroom/

https://www.scientificamerican.com/article/students-are-better-off-without-a-laptop-in-the-classroom/

https://www.washingtonpost.com/news/wonk/wp/2016/05/16/why-smart-kids-shouldnt-use-laptops-in-class/?utm_term=.00feeadc4c13

https://www.washingtonpost.com/news/wonk/wp/2016/05/16/why-smart-kids-shouldnt-use-laptops-in-class/?utm_term=.00feeadc4c13

Delivery - lecture recording

• The video clips on speech.zone are extracted from previous lectures

• Lectures are also recorded, but…

• sometimes the technology fails• recordings will not make any sense for interactive parts, activities, quizzes, group

exercises, etc…

• PDF slides on speech.zone, approximately matching the video clips

• Slides used in lectures are improved every year, and will be made available as we proceed• these slides are dynamically updated in response to your questions• usually, they become available shortly before each lecture

Delivery - keep making it better

• Please give me feedback (email, forum posts, verbally, class reps, PPLS teaching offices, notes slipped under my office door,…) about course structure and delivery mode, throughout the course.

• I also want feedback on speech.zone• is it clearly organised?• is the website reliable and fast enough?• is it obvious what relates to this course, and what does not?• does everything work correctly on your device?

Delivery - keep making it better

tour of the website

Timetable

• Lecture - in two parts • Thursday 09:00 – 09:50

• type of content varies - see the course calendar• for foundation material, you decide whether to attend

• Thursday 10:00 – 10:50 - always attend• Each part of the lecture will start and finish at the above times (except today)

• Labs (you need to attend one session every week - sign up for a group on Learn)• group 1 : Tuesday 14.10 - 16.00• group 2 : Wednesday 11.10 - 13.00

What background do you need?

• This course involves:• Linguistics: phonetics, phonology, intonation, (perhaps syntax)• Mathematics: statistics and probability, optimisation• Engineering: practical implementations, empirical findings• Computer science: algorithms, efficient implementation

• So, there will be something new for everyone

What background do you need?

• No single background is assumed, but…

• If everything on that list is new to you, then you will probably find the course too hard

• You’ll get the most out of this course if you know a little bit about• speech (e.g., phonetics)• sound in general (including music and audio engineering)• engineering• computer science

Roadmap

• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition

• What is examinable?• everything in the videos• all “Essential” readings

Image source: https://www.amazon.co.uk

Image source: https://developer.amazon.com

Image source: https://developer.amazon.com

Automatic Speech Recognition ( ASR )

Text-to-Speech also called “Speech Synthesis”

( TTS )

Roadmap


Modules 1-2: The basics

• Waveform, spectrum, spectrogram• Speech production, speech perception• Acoustic phonetics

A speech waveform

time

ampl

itude

frequency

mag

nitud

e (lo

g sc

ale)

8kHz0

Key concepts we will understand

• Sound as a pressure wave• Speech waveforms• Short-term analysis, converting from the time domain to the frequency domain• Speech production

• possible sources of sound• different ways in which that sound is shaped into different vowels and consonants

• Speech perception• hearing • forming categories

Roadmap


Modules 3-5: Speech synthesis

• Text processing - normalisation of non-standard words• Pronunciation - lexicons and letter-to-sound modelling• Prosody - phrasing and pitch accents• Waveform generation - concatenation of pre-stored units• Prosodic manipulation - time and pitch modification in the time domain

linguistic specification

The classic two-stage pipeline of text-to-speech synthesis

Front endWaveform generator

text waveform

Author of the… Author of the ...

ao th er ah f dh ax

syl syl syl syl

sil

NN of DT

1 0 0 0

...

...

this is the original Keynote figure - everything is editable (if you un-group) - no good for scaling

Author of the ...

ao th er ah f dh ax

syl syl syl syl

sil

NN of DT

1 0 0 0

...

...

The linguistic specification

linguistic specification

Extracting features from text using the front end

Front end

text

feature extraction

Author of the ...

ao th er ah f dh ax

syl syl syl syl

sil

NN of DT

1 0 0 0

...

...

Author of the…

Text processing pipeline

Front end

LTS Phrase breakstokenize POS

tag intonation

individually learned from labelled data

textlinguistic

specification

Text processing pipelineText processing pipeline

Front end


tag intonation


Front end


tag intonation

• Step 1: divide input stream into tokens, which are potential words

• For English and many other languages• rule based• whitespace and punctuation are good features

• For some other languages, especially those that don’t use whitespace• may be more difficult• other techniques required (out of scope here)

Tokenize & Normalize


Front end


tag intonation

• Step 2: classify every token, finding Non-Standard Words that need further processing


In 2011, I spent £100 at IKEA on 100 DVD holders.

NYER MONEY ASWD NUM LSEQ



Front end


tag intonation

• Step 3: a set of specialised modules to process NSWs of a each type

2011 ꔄ NYER ꔄ twenty eleven£100 ꔄ MONEY ꔄ one hundred poundsIKEA ꔄ ASWD ꔄ apply letter-to-sound100 ꔄ NUM ꔄ one hundredDVD ꔄ LSEQ ꔄ D. V. D. ꔄ dee vee dee

NN DirectorIN ofDT theNP McCormickNP PublicNPS AffairsNP InstituteIN atNP U-MassNP Boston,NP DoctorNP EdNP Beard,VBZ saysDT theNN pushIN forVBP doPP itPP yourselfNN lawmaking

POS tagging

• Part-of-speech tagger• Accuracy can be very high• Trained on annotated text data• Categories are designed for text, not speech


Front end


tag intonation

ADVOCATING AE1 D V AH0 K EY2 T IH0 NGADVOCATION AE2 D V AH0 K EY1 SH AH0 NADWEEK AE1 D W IY0 KADWELL AH0 D W EH1 LADY EY1 D IY0ADZ AE1 D ZAE EY1AEGEAN IH0 JH IY1 AH0 NAEGIS IY1 JH AH0 SAEGON EY1 G AA0 NAELTUS AE1 L T AH0 SAENEAS AE1 N IY0 AH0 SAENEID AH0 N IY1 IH0 DAEQUITRON EY1 K W IH0 T R AA0 NAER EH1 RAERIAL EH1 R IY0 AH0 LAERIALS EH1 R IY0 AH0 L ZAERIE EH1 R IY0AERIEN EH1 R IY0 AH0 NAERIENS EH1 R IY0 AH0 N ZAERITALIA EH2 R IH0 T AE1 L Y AH0AERO EH1 R OW0AEROBATIC EH2 R AH0 B AE1 T IH0 K

• Pronunciation model• dictionary look-up, plus

• letter-to-sound model• But

• need deep knowledge of the language to design the phoneme set

• human expert must write dictionary

Pronunciation / LTS


Front end


tag intonation


• Breaking a complex problem down into simpler steps• Combining many components into a single architecture

• representing information in data structures• The pros and cons of rules vs. learning from data• Generalising to previously-unseen words or sentences• Creating new utterances from fragments of pre-recorded speech• Manipulating the pitch and duration of speech

Roadmap


Modules 6-9: Speech recognition

• Dynamic Time Warping• Probability distributions• Hidden Markov Models• Bayes’ Rule• Viterbi algorithm• Training HMMs• Simple language models.


• Simple pattern recognition by comparing to stored examples• The concept of generative models

• and how they can be used to make a classifier• The importance of choosing the right representation of the speech signal

• feature extraction• feature engineering

• Combining prior knowledge and new evidence in a principled probabilistic framework• the Bayesian approach• modelling variability and uncertainty using probability distributions

Roadmap


• along the way, we will see some important techniques, models, and algorithms — including machine learning — that are useful in other applications

Some of the things we will learn

• Basic digital signal processing• Text processing - rules vs learning from data• Classification and Regression Trees (CARTs)• Dynamic programming - Dynamic Time Warping (DTW); the Viterbi algorithm• Generative models - Hidden Markov Models (HMMs)• Probability - Bayes’ Rule• Expectation-Maximisation - the Baum-Welch algorithm• Finite state networks - N-gram language models• Feature extraction and engineering - Mel Frequency Cepstral Coefficients (MFCCs)

Lab facilities

• All practicals are done on Apple computers, running OS X• Appleton Tower 4.02: fully-supported computers, file backup, 24/7 access

• We’ll automatically create accounts for everyone. Look out for announcements.• We have no resources to support use of your own Mac (so, for experts only)

• Unix/Linux/command line• Basics: Terminal, mv, cp, cd, mkdir, starting programs - intro sessions available

• Never switch off machines, just log out• Lab access

• Via matriculation card at any time (PIN required out of hours)• Follow instructions on Learn to get your card activated

assignments are described on the website

end of first half of lecture 01

Roadmap


The basics

• How speech is produced and perceived• Waveform, spectrum, spectrogram

How speech is produced and perceived

• Speech production• sound source, vocal tract, source-filter model

• Speech perception• the auditory system

What you should already know

• Vocal tract is a tube of variable shape

• A tube (containing air) acts as a resonator for sound waves

• The frequencies of the resonances depend on the shape of the tube

Orientation

• Our goal is now• to build on our basic understanding of

the acoustics of the vocal tract • to understand speech production• to explain what we observe in

speech signals

• See other material on• digital signals• phonetics

Vocal tract anatomy

• Vocal tract is a tube

• Shape can be changed by moving the tongue, jaw and lips

• The nasal branch can be connected by lowering the velum

• The tongue is larger than you might have thought - a complex set of muscles

• The nasal cavity is surprisingly large

Image credit: Department of Mathematics and Systems Analysis, Aalto University, School of Science

The vocal tract is a resonator. A resonator can act as a filter.So, what is a filter?

filterinput signal output signal

Filtering in the time domain

filter

y[t] = e[t]�KX

k=1

bk y[t� k]

y[t]e[t]

input signal output signal

Two possible sources: periodic and aperiodic (non-periodic, random)

filter

y[t] = e[t]�KX

k=1

bk y[t� k]

y[t]e[t]

source output speech

Describing the source-filter model in the frequency domain

filter

source output speech

Image credit: Rice University (CC-BY-4.0)

What next?

• What we have now learned about

• Speech signals• speech production• the source-filter model

• Speech perception• filterbank• Mel scale

What next?

• What we have now learned about

• Speech signals• speech production• the source-filter model

• Speech perception• filterbank• Mel scale

Speech synthesis

Automatic speech recognition

simon king university of edinburgh - speech.zonesimon king university of edinburgh additional...

Documents