simon king university of edinburgh - speech.zonesimon king university of edinburgh additional...
TRANSCRIPT
Speech Processing
Simon KingUniversity of Edinburgh
additional lecture slides for 2018-19
Two courses in one
• Undergraduate course code: LASC10061• Postgraduate course code: LASC11065
• Shared:• Course materials, lectures, labs• Exam/coursework weightings
• Differences:• Undergraduate course: 20 credits• Postgraduate course: 10 credits• Coursework expectations
Assessment
• Three items of assessment : two practical lab-based assignments and an exam1. First assignment: speech synthesis practical write-up (20%)2. Second assignment: speech recognition practical write-up (30%)3. Multiple choice exam in December (50%)
• Assignment due dates are given in the weekly schedule• Exam date is set by the University and will be published in due course
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Learning outcomes
• Overview of the components of speech recognition and speech synthesis systems• Understand the main concepts and what each component does
• Describe a simple version of each component• See what the difficult problems are in recognition and synthesis. • Use tools for visualising and manipulating speech waveforms
• Experiment with two speech technology systems (Festival & HTK)• See knowledge and skills from different areas come together in an interdisciplinary field
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Delivery - speech.zone website & Learn
• The website speech.zone contains almost everything you will need• Video material, reading lists, forums, calendar, coursework instructions
• you must have an account on this site, so that you can post on the forums - make sure you can log in, and email me if you have any trouble
• You only need to use Learn to:• Sign up for a lab group• Optional Introduction to Unix material• Submit your coursework
Delivery - what I provide
• The material on speech.zone is divided into modules
• the video content provides only the bare bones and there are gaps in coverage• you need to flesh out the details by taking full advantage of other modes of learning
• readings• lectures, including activities, quizzes, etc• labs• forums
Delivery - what you need to do
• Every week you will watch all videos and do essential readings before the lecture• You can then post questions on the forums
• feel free to ask questions at any level - I’ll tell you if you go “off topic”• try to choose an appropriate forum / topic / thread (I will re-organise as required)• often, the most basic questions are the most helpful
• Lectures will help you synthesise together all the different sources of information and different ways of learning. No single source will be enough on its own.
• To do well in this course, you need to be an active learner, not a passive observer• don’t worry - this course does not require much “audience participation”
Laptops are the single most reported distracter to both users and fellow students. (Fried 2008, Zhu et al 2011)
Laptop & device use is negatively correlated with understanding of course material and overall course performance. (Hembrooke & Gay 2003, Fried 2008, Patterson & Patterson 2017)
The negative effects of device use are the highest among males and low performing students (Patterson & Patterson 2017)
References
Carrie B. Fried (2008). In-class laptop use and its effects on student learning. Computers & Education 50(3): 906-914.
Helene Hembrooke and Geri Gay (2003). The Laptop and the Lecture: The Effects of Multitasking in Learning Environments. Journal of Computing in Higher Education 15(1): 46-64.
Richard W. Patterson, Robert M. Patterson (2017). Computers and productivity: Evidence from laptop use in the college classroom. Economics of Education Review 57: 66-79.
Erping Zhu, Matthew Kaplan, R. Charles Dershimer, Inger Bergom (2011). Use of Laptops in the Classroom: Research and Best Practices. University of Michigan CRLT Occasional Paper 30.
Students who use digital devices in class 'perform worse in exams' by Richard Adams, 11 May 2016.
Students are Better Off without a Laptop in the Classroom by Cindi May, 11 July 2017.
Why smart kids shouldn’t use laptops in classby Jeff Guo, 16 May 2016.
:
:
:
CC By Simon Fokt Washington Post illustration; iStock
Delivery - lecture recording
• The video clips on speech.zone are extracted from previous lectures
• Lectures are also recorded, but…
• sometimes the technology fails• recordings will not make any sense for interactive parts, activities, quizzes, group
exercises, etc…
• PDF slides on speech.zone, approximately matching the video clips
• Slides used in lectures are improved every year, and will be made available as we proceed• these slides are dynamically updated in response to your questions• usually, they become available shortly before each lecture
Delivery - keep making it better
• Please give me feedback (email, forum posts, verbally, class reps, PPLS teaching offices, notes slipped under my office door,…) about course structure and delivery mode, throughout the course.
• I also want feedback on speech.zone• is it clearly organised?• is the website reliable and fast enough?• is it obvious what relates to this course, and what does not?• does everything work correctly on your device?
Delivery - keep making it better
tour of the website
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Timetable
• Lecture - in two parts • Thursday 09:00 – 09:50
• type of content varies - see the course calendar• for foundation material, you decide whether to attend
• Thursday 10:00 – 10:50 - always attend• Each part of the lecture will start and finish at the above times (except today)
• Labs (you need to attend one session every week - sign up for a group on Learn)• group 1 : Tuesday 14.10 - 16.00• group 2 : Wednesday 11.10 - 13.00
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
What background do you need?
• This course involves:• Linguistics: phonetics, phonology, intonation, (perhaps syntax)• Mathematics: statistics and probability, optimisation• Engineering: practical implementations, empirical findings• Computer science: algorithms, efficient implementation
• So, there will be something new for everyone
What background do you need?
• No single background is assumed, but…
• If everything on that list is new to you, then you will probably find the course too hard
• You’ll get the most out of this course if you know a little bit about• speech (e.g., phonetics)• sound in general (including music and audio engineering)• engineering• computer science
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
• What is examinable?• everything in the videos• all “Essential” readings
Image source: https://www.amazon.co.uk
Image source: https://developer.amazon.com
Image source: https://developer.amazon.com
Automatic Speech Recognition ( ASR )
Text-to-Speech also called “Speech Synthesis”
( TTS )
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
Modules 1-2: The basics
• Waveform, spectrum, spectrogram• Speech production, speech perception• Acoustic phonetics
A speech waveform
time
ampl
itude
frequency
mag
nitud
e (lo
g sc
ale)
8kHz0
Key concepts we will understand
• Sound as a pressure wave• Speech waveforms• Short-term analysis, converting from the time domain to the frequency domain• Speech production
• possible sources of sound• different ways in which that sound is shaped into different vowels and consonants
• Speech perception• hearing • forming categories
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
Modules 3-5: Speech synthesis
• Text processing - normalisation of non-standard words• Pronunciation - lexicons and letter-to-sound modelling• Prosody - phrasing and pitch accents• Waveform generation - concatenation of pre-stored units• Prosodic manipulation - time and pitch modification in the time domain
linguistic specification
The classic two-stage pipeline of text-to-speech synthesis
Front endWaveform generator
text waveform
Author of the… Author of the ...
ao th er ah f dh ax
syl syl syl syl
sil
NN of DT
1 0 0 0
...
...
this is the original Keynote figure - everything is editable (if you un-group) - no good for scaling
Author of the ...
ao th er ah f dh ax
syl syl syl syl
sil
NN of DT
1 0 0 0
...
...
The linguistic specification
linguistic specification
Extracting features from text using the front end
Front end
text
feature extraction
Author of the ...
ao th er ah f dh ax
syl syl syl syl
sil
NN of DT
1 0 0 0
...
...
Author of the…
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
individually learned from labelled data
textlinguistic
specification
Text processing pipelineText processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
• Step 1: divide input stream into tokens, which are potential words
• For English and many other languages• rule based• whitespace and punctuation are good features
• For some other languages, especially those that don’t use whitespace• may be more difficult• other techniques required (out of scope here)
Tokenize & Normalize
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
• Step 2: classify every token, finding Non-Standard Words that need further processing
Tokenize & Normalize
In 2011, I spent £100 at IKEA on 100 DVD holders.
NYER MONEY ASWD NUM LSEQ
Tokenize & Normalize
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
• Step 3: a set of specialised modules to process NSWs of a each type
2011 ꔄ NYER ꔄ twenty eleven£100 ꔄ MONEY ꔄ one hundred poundsIKEA ꔄ ASWD ꔄ apply letter-to-sound100 ꔄ NUM ꔄ one hundredDVD ꔄ LSEQ ꔄ D. V. D. ꔄ dee vee dee
NN DirectorIN ofDT theNP McCormickNP PublicNPS AffairsNP InstituteIN atNP U-MassNP Boston,NP DoctorNP EdNP Beard,VBZ saysDT theNN pushIN forVBP doPP itPP yourselfNN lawmaking
POS tagging
• Part-of-speech tagger• Accuracy can be very high• Trained on annotated text data• Categories are designed for text, not speech
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
ADVOCATING AE1 D V AH0 K EY2 T IH0 NGADVOCATION AE2 D V AH0 K EY1 SH AH0 NADWEEK AE1 D W IY0 KADWELL AH0 D W EH1 LADY EY1 D IY0ADZ AE1 D ZAE EY1AEGEAN IH0 JH IY1 AH0 NAEGIS IY1 JH AH0 SAEGON EY1 G AA0 NAELTUS AE1 L T AH0 SAENEAS AE1 N IY0 AH0 SAENEID AH0 N IY1 IH0 DAEQUITRON EY1 K W IH0 T R AA0 NAER EH1 RAERIAL EH1 R IY0 AH0 LAERIALS EH1 R IY0 AH0 L ZAERIE EH1 R IY0AERIEN EH1 R IY0 AH0 NAERIENS EH1 R IY0 AH0 N ZAERITALIA EH2 R IH0 T AE1 L Y AH0AERO EH1 R OW0AEROBATIC EH2 R AH0 B AE1 T IH0 K
• Pronunciation model• dictionary look-up, plus
• letter-to-sound model• But
• need deep knowledge of the language to design the phoneme set
• human expert must write dictionary
Pronunciation / LTS
Text processing pipeline
Front end
LTS Phrase breakstokenize POS
tag intonation
Key concepts we will understand
• Breaking a complex problem down into simpler steps• Combining many components into a single architecture
• representing information in data structures• The pros and cons of rules vs. learning from data• Generalising to previously-unseen words or sentences• Creating new utterances from fragments of pre-recorded speech• Manipulating the pitch and duration of speech
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
Modules 6-9: Speech recognition
• Dynamic Time Warping• Probability distributions• Hidden Markov Models• Bayes’ Rule• Viterbi algorithm• Training HMMs• Simple language models.
Key concepts we will understand
• Simple pattern recognition by comparing to stored examples• The concept of generative models
• and how they can be used to make a classifier• The importance of choosing the right representation of the speech signal
• feature extraction• feature engineering
• Combining prior knowledge and new evidence in a principled probabilistic framework• the Bayesian approach• modelling variability and uncertainty using probability distributions
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
• along the way, we will see some important techniques, models, and algorithms — including machine learning — that are useful in other applications
Some of the things we will learn
• Basic digital signal processing• Text processing - rules vs learning from data• Classification and Regression Trees (CARTs)• Dynamic programming - Dynamic Time Warping (DTW); the Viterbi algorithm• Generative models - Hidden Markov Models (HMMs)• Probability - Bayes’ Rule• Expectation-Maximisation - the Baum-Welch algorithm• Finite state networks - N-gram language models• Feature extraction and engineering - Mel Frequency Cepstral Coefficients (MFCCs)
Introduction to the course
• learning outcomes• delivery• timetable• background required• course coverage• assignments
Lab facilities
• All practicals are done on Apple computers, running OS X• Appleton Tower 4.02: fully-supported computers, file backup, 24/7 access
• We’ll automatically create accounts for everyone. Look out for announcements.• We have no resources to support use of your own Mac (so, for experts only)
• Unix/Linux/command line• Basics: Terminal, mv, cp, cd, mkdir, starting programs - intro sessions available
• Never switch off machines, just log out• Lab access
• Via matriculation card at any time (PIN required out of hours)• Follow instructions on Learn to get your card activated
assignments are described on the website
end of first half of lecture 01
Roadmap
• Modules 1-2: The basics• Modules 3-5: Speech synthesis• Modules 6-9: Speech recognition
The basics
• How speech is produced and perceived• Waveform, spectrum, spectrogram
The basics
• How speech is produced and perceived• Waveform, spectrum, spectrogram
How speech is produced and perceived
• Speech production• sound source, vocal tract, source-filter model
• Speech perception• the auditory system
What you should already know
• Vocal tract is a tube of variable shape
• A tube (containing air) acts as a resonator for sound waves
• The frequencies of the resonances depend on the shape of the tube
What you should already know
• Vocal tract is a tube of variable shape
• A tube (containing air) acts as a resonator for sound waves
• The frequencies of the resonances depend on the shape of the tube
Orientation
• Our goal is now• to build on our basic understanding of
the acoustics of the vocal tract • to understand speech production• to explain what we observe in
speech signals
• See other material on• digital signals• phonetics
Orientation
• Our goal is now• to build on our basic understanding of
the acoustics of the vocal tract • to understand speech production• to explain what we observe in
speech signals
• See other material on• digital signals• phonetics
How speech is produced and perceived
• Speech production• sound source, vocal tract, source-filter model
• Speech perception• the auditory system
Vocal tract anatomy
• Vocal tract is a tube
• Shape can be changed by moving the tongue, jaw and lips
• The nasal branch can be connected by lowering the velum
• The tongue is larger than you might have thought - a complex set of muscles
• The nasal cavity is surprisingly large
How speech is produced and perceived
• Speech production• sound source, vocal tract, source-filter model
• Speech perception• the auditory system
Image credit: Department of Mathematics and Systems Analysis, Aalto University, School of Science
The vocal tract is a resonator. A resonator can act as a filter.So, what is a filter?
filterinput signal output signal
Filtering in the time domain
filter
y[t] = e[t]�KX
k=1
bk y[t� k]
y[t]e[t]
input signal output signal
How speech is produced and perceived
• Speech production• sound source, vocal tract, source-filter model
• Speech perception• the auditory system
Two possible sources: periodic and aperiodic (non-periodic, random)
filter
y[t] = e[t]�KX
k=1
bk y[t� k]
y[t]e[t]
source output speech
Describing the source-filter model in the frequency domain
filter
source output speech
How speech is produced and perceived
• Speech production• sound source, vocal tract, source-filter model
• Speech perception• the auditory system
Image credit: Rice University (CC-BY-4.0)
Image credit: Rice University (CC-BY-4.0)
Image credit: Rice University (CC-BY-4.0)
Image credit: Rice University (CC-BY-4.0)
What next?
• What we have now learned about
• Speech signals• speech production• the source-filter model
• Speech perception• filterbank• Mel scale
What next?
• What we have now learned about
• Speech signals• speech production• the source-filter model
• Speech perception• filterbank• Mel scale
Speech synthesis
Automatic speech recognition