ctc with application to ocr - university of california...
TRANSCRIPT
![Page 1: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/1.jpg)
Connectionist Temporal Classification (CTC) with application to
Optical Character Recognition (OCR)
Siyang Wang
![Page 2: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/2.jpg)
Outline
• Two long-standing tasks• speech recognition and OCR
• Motivation: Pre-CTC Methods • HMM
• HMM-RNN hybrid
• Connectionist Temporal Classification(CTC)
• Applying CTC to OCR
• Disadvantages of CTC
![Page 3: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/3.jpg)
Two long-standing tasks
• Speech recognition
• Optical character recognition (OCR)
“Hello world”
“Hello world”
![Page 4: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/4.jpg)
A major difficulty
• No temporal correspondence (discussion question posted earlier) • Example: which segment of a sound signal sequence corresponds to a
phoneme?
• Ordering as a limited prior: not enough to easily establish correspondence
• Segmentation and alignment problems• Ambiguity: two connected phenom
• Lack of per-frame labeling (difficult to obtain such labeling, also does not make much sense to do so)
![Page 5: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/5.jpg)
Pre-CTC: Hidden Markov Models (HMM)
𝑥𝑡 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑠𝑡𝑎𝑡𝑒 𝑎𝑡 𝑡(𝑠𝑜𝑢𝑛𝑑 𝑠𝑖𝑔𝑛𝑎𝑙)
𝑎𝑡 = ℎ𝑖𝑑𝑑𝑒𝑛 𝑠𝑡𝑎𝑡𝑒 𝑎𝑡 𝑡 (𝑝ℎ𝑜𝑛𝑒𝑚𝑒)
• Conditional Independence assumptions:• 𝑃(𝑥𝑡 𝑎1, … , 𝑎𝑇 , 𝑥1, … , 𝑥𝑇 = 𝑃 𝑥𝑡 𝑎𝑡• 𝑃(𝑎𝑡 𝑎1, … , 𝑎𝑇 , 𝑥1, … , 𝑥𝑇 = 𝑃 𝑎𝑡 𝑎𝑡−1 = 𝑃 𝑎𝑡′ 𝑎𝑡′−1
• Inference: Forward-backward (Viterbi’s) Algorithm• Training: EM Algorithm• Simple segmentation strategy: combine connected hidden
states to output predicted sequence
https://distill.pub/2017/ctc/
![Page 6: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/6.jpg)
HMM Disadvantages(Graves, 2006)
• Inherently Generative (limits classification ability)
• Only limited RNN incorporation (identify local phenomes) • HMM-RNN hybrids
• Does not allow applying RNN end-to-end
• However, more work has shown since CTC paper(2006):• Combining deep neural network (not necessarily RNN) to HMM performs well
• Transducer in speech recognition (next lecture’s presentation!)
![Page 7: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/7.jpg)
Connectionist Temporal Classification (CTC)
• Alignment free transformation• Add a “blank” token to the pool of output classes/tokens
• Consecutive same tokens between “blank” tokens are taken as one token
• Example:
https://distill.pub/2017/ctc/
![Page 8: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/8.jpg)
How does this framework help classification?
• Define the classification problem: 𝑋 → 𝑌
• But, both 𝑋 and 𝑌 can vary in length in the same problem
• We want 𝑃(𝑌|𝑋) to MLE and back-prop
https://distill.pub/2017/ctc/
![Page 9: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/9.jpg)
CTC P(Y|X) example
https://distill.pub/2017/ctc/
t 1 2 3 4
P(“a”|X) 0.9 0.7 0.2 0.0
P(“m”|X) 0.1 0.2 0.0 0.9
P(“blank”|X) 0.0 0.1 0.8 0.1
𝑃(𝑌 = "𝑎𝑚"|𝑋) ?
![Page 10: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/10.jpg)
Efficient loss calculation: forward and backward algorithm (dynamic programming)
ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf
![Page 11: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/11.jpg)
Forward pass case 1:
![Page 12: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/12.jpg)
Forward pass case 1A:
![Page 13: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/13.jpg)
Forward pass case 1B:
A
![Page 14: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/14.jpg)
Forward pass case 2:
![Page 15: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/15.jpg)
Training time: Forward and backward
• Forward(calculate α) :
• Backward(calculate 𝛽) and combine with forward:
MLE, start of model backprop
![Page 16: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/16.jpg)
Inference strategies at test time
• Most likely alignment heuristic:
• Collapsing alignments by using “blank” token as divider (Graves, 2006)
• Modified beam search and incorporating a language model (https://distill.pub/2017/ctc/)
![Page 17: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/17.jpg)
OCR w/ CTC System: Layered components
• Step 1: Visual feature extraction (CNN)
• Step 2: Sequential modeling based on visual feature sequence (RNN)
• Step 3: CTC layer to map input sequence (visual feature sequence) to output sequence (character sequence)
![Page 18: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/18.jpg)
OCR w/ CTC Step 1: Visual Feature Extraction
Sliding window CNN
![Page 19: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/19.jpg)
OCR w/ CTC Step 2: RNN
Sliding window CNN
RNN: GRU, LSTM
![Page 20: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/20.jpg)
OCR w/ CTC Step 3: CTC Mapping
RNN: GRU, LSTM
CTC
Output Character Sequence
![Page 21: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/21.jpg)
OCR w/ CTC System Overview
CTC
RNN
CNN
𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 𝑓6 𝑓𝑖 = 𝑣𝑖𝑠𝑢𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑖
Output Character Sequence
![Page 22: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/22.jpg)
End-to-end trainable
Differentiable model: CNN + RNN + CTC
Input: image
“okay”Output: character sequence
Train: argmax
𝜃𝑃 𝑌 𝑋, 𝜃
= argmax𝜃𝐶𝑁𝑁,𝜃𝑅𝑁𝑁
𝑃 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑖𝑚𝑎𝑔𝑒, 𝜃𝐶𝑁𝑁, 𝜃𝑅𝑁𝑁
https://arxiv.org/pdf/1507.05717.pdf
![Page 23: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/23.jpg)
Disadvantages of CTC
• Built-in Conditional Independence: unable to learn language model
https://distill.pub/2017/ctc/
Input sound: “triple-A”• Not explicitly expressed in CTC• Experiments show that adding a
language model boosts performance for specific settings (https://distill.pub/2017/ctc/)
• Does not learn a language model well (https://arxiv.org/pdf/1707.07413.pdf)
![Page 24: CTC with application to OCR - University of California ...cseweb.ucsd.edu/.../student_presentations/CTC_OCR.pdf · •However, more work has shown since CTC paper(2006): ... •Transducer](https://reader033.vdocuments.mx/reader033/viewer/2022050223/5f6875a376606a75e715248e/html5/thumbnails/24.jpg)
Disadvantages of CTC
• Many to one mapping (discussion question): CTC facilitates collapsing
https://distill.pub/2017/ctc/
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6
𝑐1 𝑐2
CTC good in Many to one: Speech recognition, OCR
𝑥1 𝑥2 𝑥3
CTC not so good in Many to many(potentially expanding length of input sequence or changing order):Machine translation, other examples?
𝑥11 𝑥2
1 𝑥12 𝑥2
2 𝑥32 𝑥1
3