a semi-automatic annotation tool for arabic online handwritten text

89
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text Prepared by: Eng. Randa Ibrahim M. Elanwar ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ ﺑﺴﻢ اﷲ اﻟﺮﺣﻤﻦ اﻟﺮﺣﻴﻢ﴿ ﴿ ﴿ ﴿ ﴿ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐ وﻣﺎ ﺗﻮﻓﻴﻘﻰ اﻻ ﺑﺎﷲ ﻋﻠﻴﻪ ﺗﻮﻛﻠﺖ واﻟﻴﻪ أﻧﻴﺐEng. Randa Ibrahim M. Elanwar (M.D.) Assistant Researcher Electronic Research Institute Under the supervision of: Prof. Dr. Mohsen A. A. Rashwan Professor of Digital Signal Processing Faculty of Engineering Cairo University Prof. Dr. Samia A. A. Mashaly Professor of Digital Signal Processing Computers & Systems Dept. Electronic Research Institute

Upload: randa-elanwar

Post on 15-Jul-2015

140 views

Category:

Science


0 download

TRANSCRIPT

Page 1: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Prepared by:

Eng. Randa Ibrahim M. Elanwar

بسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيمبسم اهللا الرحمن الرحيم

﴾﴾﴾﴾﴾﴾﴾﴾وما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيبوما توفيقى اال باهللا عليه توكلت واليه أنيب﴿﴿﴿﴿﴿﴿﴿﴿

Eng. Randa Ibrahim M. Elanwar(M.D.)Assistant Researcher

Electronic Research Institute

Under the supervision of:

Prof. Dr. Mohsen A. A. Rashwan

Professor of Digital Signal Processing Faculty of Engineering

Cairo University

Prof. Dr. Samia A. A. Mashaly

Professor of Digital Signal Processing Computers & Systems Dept.

Electronic Research Institute

Page 2: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Presentation Organization2

1. Introduction, Thesis Goals & Contributions

2. Text Lines Extraction

3. Words Extraction

4. Words Segmentation4. Words Segmentation

5. User Interfaces

6. Annotation performance evaluation

7. Conclusions & Future Work

Page 3: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

3

Page 4: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction4

�� What is ‘Annotation’?What is ‘Annotation’?

�� What is ‘Document Annotation’?What is ‘Document Annotation’?

�� Why Document Annotation?Why Document Annotation?

�� How to Annotate a document?How to Annotate a document?

Page 5: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction5

� Annotation

� Identifying data of particular type using additional data of different type, precisely describing its entities.entities.

� Documents annotation:

� Associating the ASCII/UNICODE corresponding to the document image (offline) or ink info (online).

Page 6: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction6

Trans./

Ground truth

Image/ink

Digital Digital

Library Library

and and

Annotated Document

Key words

Search Engine

Info. Retrieval

Web Web

searchsearch

Page 7: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction7

Annotated Annotated DocumentsDocuments

Building Building

RecognizersRecognizers

Test DataTest Data

Performance Performance EvaluationEvaluation

Result Result AnalysisAnalysis

Training Training DataData

Train Train ModelsModels

Page 8: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction8

Document Annotation

Accelerate & Enhance

Recognizers

Accelerate Digital library construction

Annotation maximizes efficiency, productivity & Annotation maximizes efficiency, productivity & profitability.profitability.

Page 9: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction9

� Region of interest:

line, word, character

� Annotation: identify

boundaries and

associate ASCII /

UNICODE

Page 10: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Introduction10

�� Annotation schemes:Annotation schemes:

• Manual annotation-validation• Laborious, time-consuming• Error prone

Manual

• Manual truth entry• Automatic annotation• Manual Validation

Semi-automatic

• Automatic recognition/truth alignment

• Manual ValidationAutomatic

Page 11: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Thesis Goals11

1• Contribute to the Arabic LR problem solution.

2• Save researchers’ efforts spent on data manipulation.

We provide the We provide the 11stst online Arabic sentence dataset online Arabic sentence dataset (OHASD) and the (OHASD) and the 11stst Arabic semiArabic semi--automatic automatic annotation tool for online handwriting (ATAOH).annotation tool for online handwriting (ATAOH).

3• Pave path to generic tool kits construction.

Page 12: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Contribution: OHASD Dataset12

� unconstrained and natural.

� Texts sampled from daily newspapers.

Texts are dictated to writers. � Texts are dictated to writers.

� 154 paragraphs by 48 writers.

� More than 3800 words and more than 19,400 characters.

Page 13: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Contribution: ATAOH Tool13

1. Easy document browsing and display.

2. Automatic Text-line/Word extraction-segmentation.2. Automatic Text-line/Word extraction-segmentation.

3. Manual options for segmentation validation & annotation correction.

Page 14: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Contribution: ATAOH Tool14

4. Designed and evaluated using OHASD.

5. Composed of a guiding set of interactive user 5. Composed of a guiding set of interactive user interfaces.

6. Reduces human effort by high performance automation

Page 15: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

15

Page 16: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction16

Bottom-up

Smearing

Hough-based

Grouping

Text line Extraction Text line Extraction TechniquesTechniques

Bottom-up Grouping

Graph-based

Cut Text Minimiz.

Top-Down Projection-based

Page 17: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction17

�� Extraction errors are due to:Extraction errors are due to:

• Fluctuating lines1

• Skew variability2

• Touching text lines3

• Fragments due to massive presence of diacritics4

Page 18: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction18

Read the input document

(stroke data)

Start DP to merge

segments pairs

Precede DP as long as valid merges exist

ATAOH provides an automatic text line extraction utility based on dynamic ATAOH provides an automatic text line extraction utility based on dynamic programmingprogramming

Preprocessing: remove dots

Shred the document into

strips

For each strip build

CCs/units

Extend units horiz. To build

segments

If DP stops, final paths are the text lines

Restore dots

Page 19: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction19

Page 20: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction20

DP cost function design

]log[])log[*tan( enaltyCrossOverPenaltyDirectionPtyWidthPenalcePenaltyDistMergingCos ++=

Direction Penalty

Insures merging from

right to left

Cross-over Penalty

Insures merging adjacent segments

Distance Penalty

Insures merging close

segments

Width Penalty

Avoids having left-alone segments

Page 21: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction21

�� Stuck text lines are fixed using a postStuck text lines are fixed using a post--processing step processing step of sticking detection and correction. Gap separated of sticking detection and correction. Gap separated text line segments (bridges) are also fixed.text line segments (bridges) are also fixed.

OHASD dataset is divided to OHASD dataset is divided to 124 124 documents for documents for �� OHASD dataset is divided to OHASD dataset is divided to 124 124 documents for documents for training (training (558 558 text lines) and text lines) and 30 30 documents for test documents for test ((112 112 text lines).text lines).

�� Experiments are conducted for system parameters Experiments are conducted for system parameters optimization (DP stopping thresholds).optimization (DP stopping thresholds).

Page 22: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Lines Extraction22

�� The results got for both the training and test sets are:The results got for both the training and test sets are:

Document

Accuracy

Text Line

Accuracy

Training set

Exp 1 results 95.16 84.36

Stick resolve 97.58 85.1Training set Stick resolve 97.58 85.1

Bridge concatenation 100 100

Test SetExp 1 results 93.33 87.22

Stick resolve & Bridge concatenation 96.67 98.5

Page 23: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction23

Performance Comparison

Liwicki(2006)

Zahour(2001)

Li (2006) Our system(2006)

100 docs from

IAMonDB

98% Doc. Acc. 99.94% Stroke Acc.

(2001)

100 offline Arabic docs

97%

Li (2006)

100 offline Arabic docs

92% text line Acc.

Our system

154 online Arabic docs

98% text line Acc., 96.7% Doc. Acc.

Page 24: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Text Line Extraction24

�� Conclusions:Conclusions:

• Our method gives promising results.1

• Applicable to off-line documents (minor changes).2

• Overcomes writing on multiple text lines.3

• Applicable to English, French & Greek (overcome diacritics)4

Page 25: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

25

Page 26: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction26

�� Word Extraction Techniques:Word Extraction Techniques:Break text Break text line to CCsline to CCs

Threshold Threshold basedbased

Compare gap Compare gap width to fixed width to fixed

thresholdthreshold

Classifier Classifier basedbased

Classify gap Classify gap as inter/intra as inter/intra

word gapword gap

Page 27: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction27

�� Extraction errors are due to:Extraction errors are due to:

• Wide intra-word gaps (Word split)1

• Narrow inter-word gap (Word stick)2

• Total overlap: no inter-word gap (Word stick)3

Page 28: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction28

Read the input Read the input text line (stroke text line (stroke

data)data)

Decision fusion Decision fusion and stick and stick

correctioncorrectionRestore dotsRestore dots

ATAOH provides an automatic words extraction utility based on classifiers ATAOH provides an automatic words extraction utility based on classifiers decisions fusiondecisions fusion

Preprocessing: Preprocessing: remove dots, remove dots,

build OCsbuild OCs

Feature Feature Extraction Extraction

(Global/local)(Global/local)

Initial word Initial word extraction (best extraction (best

classifier)classifier)

Word Stick Word Stick detectiondetection

Page 29: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction29

�� Experiments done to point out the best performing Experiments done to point out the best performing classifiers.classifiers.

Page 30: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction30

� Feature vectors are fed to SVM (polynomial kernel) for initial word extraction. extraction.

� The extracted words undergo stick detection tests.

Page 31: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction31

�� The stick detection tests based on the likelihood The stick detection tests based on the likelihood probability of output word parameter values:probability of output word parameter values:

1.1. Number of OCS Number of OCS

2.2. Word widthWord width

3.3. Number Number of strokesof strokes

Page 32: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction32

�� If a word is stuck, all gap decisions from If a word is stuck, all gap decisions from 5 5 classifiers are fused. A classifiers are fused. A separate preseparate pre--trained SVM gives the final decision whether or not to trained SVM gives the final decision whether or not to break the word up.break the word up.

Page 33: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction33

OHASD dataset

Training Validation TestTraining

110 Docs, 2802 words

2264 (inter-), 3988 (intra-)

word gaps

Validation

14 Docs, 334 Words

277 (inter-), 437 (intra-) word gaps

Test

30 Docs, 688 Words

616 (inter-), 1117 (intra-) word gaps

Page 34: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction34

�� Validation set results:Validation set results:

� Split words represent about 2% of the validation dataset words.

� 96% of stuck words are detected, 62% are resolved correctly, 8% are wrongly resolved, 16% of the lengthy correctly, 8% are wrongly resolved, 16% of the lengthy correct words are damaged.

� Stick resolution lead to 31.19% error reduction in GCR and 43.89% error reduction in WER.

Page 35: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction35

�� Test set results:Test set results:

� GCR of 88.4% and WER of 71.5%.

� Word Split and total overlap errors are showing up excessively.

Page 36: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction36

Performance Comparison

Liwicki (2006)Quiniou(2009)

Sun (2004) Our systemLiwicki (2006)

Threshold based

86% WER 95% GCR

(2009)

RBF NN classification

96% WER

Sun (2004)

LDA/KNN/

GMM/MLP/

SVM

89.5%, 90%, 89.8%, 93.2%,

93.7% GCR

Our system

4 SVM + RBF NN

71.5% WER, 88.4% GCR

Page 37: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction37

�� Conclusions:Conclusions:

• No publications for Arabic online word extraction problem so far.1

• Results are promising regarding the difficulty of Arabic2

• Odd writers’ habits add more challenge3

• Limitations for not using context help (stick/split detection)4

Page 38: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

38

Page 39: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation39

• Essential for analytic word recognition approaches.1

• Touching/overlapping characters and word ambiguity make it difficult.2 make it difficult.2

• Impossible to segment a given word without knowing its identity.3

Page 40: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation40

RulesRules--basedbased

Propose many SPs & validate using rulesPropose many SPs & validate using rules

Human experts perform classificationHuman experts perform classification

�� Segmentation Techniques:Segmentation Techniques:

Result is measured by WSR, SPRR or CSRResult is measured by WSR, SPRR or CSR

ClassifiersClassifiers--basedbased

Propose many SPs & validate by recognitionPropose many SPs & validate by recognition

Classifiers (e.g. NN) perform classificationClassifiers (e.g. NN) perform classification

Result is measured by WRRResult is measured by WRR

Page 41: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation41

�� Segmentation errors can be:Segmentation errors can be:

• Over-segmentation: excessive number of PSP1

• Under-segmentation: less number of PSP2

• Bad-segmentation: correct number of PSP but mis-located3

Page 42: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation42

Word Preprocessing: reWord Preprocessing: re--sampling, smoothing, sampling, smoothing, remove secondary strokesremove secondary strokes

Feature Extraction (Local/vicinity)Feature Extraction (Local/vicinity)

ATAOH provides an automatic words segmentationATAOH provides an automatic words segmentation--annotation utility using HMMannotation utility using HMM

Feature Extraction (Local/vicinity)Feature Extraction (Local/vicinity)

HMM (recognizer/aligner)HMM (recognizer/aligner)

PSP rulePSP rule--based validationbased validation

Restoring secondary strokesRestoring secondary strokes

Page 43: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation43

Local Features

Delta x-y

Vicinity Features

Aspect

Writing direction

Chain code

Eye (word-PAW)

Curliness

Slope

Chords (angles,

length ratio)

Page 44: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation44

�� HMM parameters:HMM parameters:

Models

• 28 characters (reduced to 19) in all positions• 6 ligatures (لم،�،لح،بح،مح،بم)

Window

• Number of samples per window• Window overlap ratio

States

• Number of states per model• Number of Gaussian mixtures per state

Page 45: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation45

�� First HMM is used as recognizerFirst HMM is used as recognizer

�� automatic segmentationautomatic segmentation--annotationannotation

�� Experiments are conducted for:Experiments are conducted for:

Feature set selection Feature set selection 1.1. Feature set selection Feature set selection

2.2. System parameters optimizationSystem parameters optimization

�� Best Feature set: Eye, Chord angles, Aspect Best Feature set: Eye, Chord angles, Aspect and Curlinessand Curliness

Page 46: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation46

�� Best window parameters: Best window parameters: 9 9 samples/window, no overlapsamples/window, no overlap

�� Best HMM parameters: Best HMM parameters: 20 20 states, states, 16 16 mixtures/statemixtures/state

HMM Gaussian mixtures variation didn’t affect the HMM Gaussian mixtures variation didn’t affect the �� HMM Gaussian mixtures variation didn’t affect the HMM Gaussian mixtures variation didn’t affect the results remarkably.results remarkably.

�� We define a new HMM with variable Gaussian mix. We define a new HMM with variable Gaussian mix. number per state.number per state.

Page 47: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation47

�� Varying the number of HMM states, keeping Varying the number of HMM states, keeping 16 16 mixtures mixtures only for the first only for the first 8 8 states only and a single Gaussian states only and a single Gaussian e.we.w.: .:

�� best HMM has best HMM has 36 36 statesstates

Page 48: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation48

�� Varying the location of multiVarying the location of multi--mix. States along HMMmix. States along HMM

��best location is the first best location is the first 8 8 statesstates

Page 49: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation49

�� HMM average result on HMM average result on the validation data set the validation data set using the best design using the best design HMM: HMM: 4646..2323% WSR % WSR ––8080..8787% CSR.% CSR.8080..8787% CSR.% CSR.

� Same writer may have significantly different WSR per document.

� He may have almost same WSR but significantly different WRR.

Page 50: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation50

� Segmentation accuracy is not only related to writer

habits but also related to the character position within

the word PAW

� Segmentation succeeds when a PAW has reasonable � Segmentation succeeds when a PAW has reasonable

number of obvious valleys.

� ∴ HMMs need to be trained by huge open vocabulary

dataset composed of huge variety of words written by

multiple writers.

Page 51: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation51

�� HMM proposes SP validated using HMM proposes SP validated using 8 8 rulesrules

Rule 1 Rule 5

Page 52: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation52

�� HMM proposes SP validated using HMM proposes SP validated using 8 8 rulesrules

Rule 4 Rule 3

Page 53: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation53

�� HMM proposes SP validated using HMM proposes SP validated using 8 8 rulesrules

Rule 2 Rule 8

Page 54: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation54

�� HMM proposes SP validated using HMM proposes SP validated using 8 8 rulesrules

Rule 7 Rule 6

Page 55: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation55

�� Applying the PSP validation rules:Applying the PSP validation rules:

Page 56: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation56

� We have limits of

improvement as we

don’t have solution

for most of under

segmentation caused

by HMM.

Page 57: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation57

�� Spatial information are used to assign the secondary Spatial information are used to assign the secondary strokes to the nearest/overlapping main character.strokes to the nearest/overlapping main character.

Page 58: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation58

�� Applying our system to the test data set:Applying our system to the test data set:

Page 59: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation59

�� Second HMM is used as alignerSecond HMM is used as aligner

�� semisemi--automatic segmentationautomatic segmentation--annotationannotation

�� Experiments done on best features & window Experiments done on best features & window paramsparams

�� HMM parameters optimizationHMM parameters optimization�� HMM parameters optimizationHMM parameters optimization

�� Best HMM parameters: Best HMM parameters: 24 24 states, states, 16 16 mixtures/statemixtures/state

�� We tried the new HMM design with variable Gaussian We tried the new HMM design with variable Gaussian mix. number per state (mix. number per state (11stst octave).octave).

Page 60: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation60

�� Again we notice rapid enhancement in WSR and CSR Again we notice rapid enhancement in WSR and CSR compared to common HMM design.compared to common HMM design.

�� We notice two peeks at We notice two peeks at 34 34 states and states and 45 45 states. The states. The 4545--states HMM design is better on Writer/document level.states HMM design is better on Writer/document level.

Page 61: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation61

�� The PSP validation rules are modified to benefit from The PSP validation rules are modified to benefit from knowing the word truth.knowing the word truth.

Rule 9 Rule 9

Page 62: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation62

�� System Results on validation dataset:System Results on validation dataset:

WSR WUSR WOSR WBSR CSR

Reference 73.35 0.00 0.00 26.65 88.85

R1-R5 81.74 0.00 2.10 16.17 93.44

R1-R5-R4 82.93 0.00 2.10 14.97 93.88

R1-R5-R4-R2 84.13 0.00 0.90 14.97 94.33

R1-R5-R4-R2-R8 85.03 0.00 0.90 14.07 94.46

R1-R5-R4-R2-R8-

R792.81 0.00 0.90 6.29 96.75

R1-R5-R4-R2-R8-

R7-R994.91 0.00 0.60 4.49 97.83

Secondary stroke

restoration94.61 0.00 0.60 4.49 97.10

Page 63: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation63

�� System results on test dataset:System results on test dataset:

WSR WUSR WOSR WBSR CSR

HMM Output 52.23 3.38 4.06 40.32 74.47

After PSP validation 75.64 4.19 8.12 12.04 89.42

After dot restoration 74.42 3.52 8.25 13.80 87.04

Page 64: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation64

�� Computing SPRR for the test data set: Computing SPRR for the test data set: 9393..7474%%

�� Of the total number of test set words Of the total number of test set words 1313..9393% are having single % are having single SPR error, SPR error, 55..5555% are having double SPR error, % are having double SPR error, 22..33% are having % are having triple SPR error and triple SPR error and 33..5252% are having dot restoration errors.% are having dot restoration errors.

Page 65: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation65

• PSP validation stage recovers 15-20% of mis-segmented words

1

•• Comparing system results in both HMM modes for validation and test Comparing system results in both HMM modes for validation and test datasets we conclude:datasets we conclude:

• Dot restoration may cause a loss 0-3% due to irregular writing habits

2

• Validation set results are higher than test set, as its HMM output is higher

3

Page 66: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Segmentation66

• Poor HMM results are due to: limited training PAWs variability, odd writing styles

1

• Segmentation succeeds when a

•• Conclusions:Conclusions:

• Segmentation succeeds when a PAW has reasonable number of obvious valleys.

2

• There never exists a single classifier that can achieve good results for all writers.

3

• A classifier ensemble for writers clusters may accomplish the mission successfully.

4

Page 67: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction67

Performance Comparison

Kurniawan(2011)

RehmanKhan (2008)

Our system(2011)

1902 SPs

82.63% SPRR

Khan (2008)

2936 SPs

91.21% SPRR

Our system

2859 SPs

93.74% SPRR

Page 68: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Words Extraction68

Performance Comparison

Kavallieratou(2000)

De Stefano (2002)

Abdulla (2008)

Our system (recognition)

Our system (aligner)(2000)

500 English and Greek

words

77.8% WSR

(2002)

1600 English words

68% CSR

(2008)

IFN/INIT and

AHD/AUST

90.58%, 95.66% WSR

(recognition)

OHASD

36.64% WSR, 71.36% CSR

(aligner)

OHASD

74.42% WSR, 87% CSR

Page 69: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

69

Page 70: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces70

�� The Main GUI opens at the start up showing the user The Main GUI opens at the start up showing the user all operations that can be done.all operations that can be done.

Page 71: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces71

�� Word Extraction GUI: appears at pressing “Word Extraction” Word Extraction GUI: appears at pressing “Word Extraction” pushbutton on the Main GUI & specifying the document pathpushbutton on the Main GUI & specifying the document path

Page 72: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces72

�� The Add Transcription GUI appears when pressing the “Transcript Data File” The Add Transcription GUI appears when pressing the “Transcript Data File” pushbutton on the Main GUI and specifying document path.pushbutton on the Main GUI and specifying document path.

Page 73: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces73

�� Annotation is done by entering the word truth in the ground truth text area.Annotation is done by entering the word truth in the ground truth text area.

Page 74: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces74

�� Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.

Page 75: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces75

�� Manually segmentation correction is done drawing lines by mouse clicks Manually segmentation correction is done drawing lines by mouse clicks “Manual Segment” pushbutton“Manual Segment” pushbutton

Page 76: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces76

�� Each character model strokes data are calculated and displayed by pressing 'Insert Each character model strokes data are calculated and displayed by pressing 'Insert data' pushbutton.data' pushbutton.

Page 77: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces77

�� 'CHECK' pushbutton plots each character model in a separate figure.'CHECK' pushbutton plots each character model in a separate figure.

Page 78: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

User Interfaces78

�� In the output text file format, each word is indexed. Each In the output text file format, each word is indexed. Each character names is listed in order (from right to left). character names is listed in order (from right to left).

�� Beside each character name, stroke information is listed Beside each character name, stroke information is listed (prototype , number of stroke parts, stroke number(s) (prototype , number of stroke parts, stroke number(s) and start(s) and end(s) indices. and start(s) and end(s) indices.

Page 79: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

79

Page 80: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Annotation Performance Evaluation80

�� We used samples from We used samples from test dataset.test dataset.

�� AWAT: average word AWAT: average word annotation time.annotation time.

Performance Comparison

Our Test

Automation

Volunteers

Automationannotation time.annotation time.

�� ADAT: average ADAT: average document annotation document annotation time.time.

Automation15.09 sec AWAT5.42 min ADAT

Manual26 sec AWAT

9.26 min ADAT

Average time Save 43%

Automation16.18 sec AWAT9.89 min ADAT

Manual32.75 sec AWAT16.20 min ADAT

Average time save51.5% (word),40%

(Doc)

Page 81: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Annotation Performance Evaluation81

�� The time save is proportional with:The time save is proportional with:

• The number of characters per word1

• The number of words per document (database size)2

• The character overlapping (decorative writing styles)3

• The GUI compiler4

• SPR error type correction5

• Automatic segmentation result6

Page 82: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Annotation Performance Evaluation82

�� Model reliability test:Model reliability test: compute WSR and CSR variances compute WSR and CSR variances among the among the validation dataset writersvalidation dataset writers..

�� The most robust model came to be The most robust model came to be 44 44 states HMM.states HMM.

�� Although robust, this doesn’t guarantee higher result for Although robust, this doesn’t guarantee higher result for the test dataset.the test dataset.

�� Robust Model (Robust Model (6969..1616% WSR % WSR -- 8585..7272% CSR) compared to % CSR) compared to Best Model (Best Model (7373..3535% WSR % WSR –– 8888..8585% CSR).% CSR).

Page 83: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

83

Page 84: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Conclusions and Future Work84

�� By our work we aimed at:By our work we aimed at:

• Facilitating development of annotated online datasets for Arabic recognizers.

1

• Providing robust implementations of tools and algorithms.2

• Provide & using OHASD the first sentence dataset of its type.3

• Extend and cluster writer samples variability 1

• Extend dataset vocabulary to all words in Arabic lexica2

• Collaboration with research groups to enhance the ATAOH tool3

�� As futureAs future work wework we want to:want to:

Page 85: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Conclusions and Future Work85

Conclusions and Future Work85

�� Our text line extractionOur text line extraction utilityutility::

• Gives promising results. 1

• Can be applicable to off-line documents with minor changes.2

• Can be appropriate for use with English, French and Greek.3

• Propose solutions to the open issues like skew and touching lines. 1

�� As futureAs future work wework we want to:want to:

Page 86: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Conclusions and Future Work86

Conclusions and Future Work86

Conclusions and Future Work86

�� Our word extractionOur word extraction utilityutility::

• Achieves promising results for validation dataset.1

• Less rates are obtained for test dataset due to excessive occurrence of overlapping and split word problems.

2of overlapping and split word problems.

2

• using the help of natural language resources for stick/split detection on context base.

1

�� As futureAs future work wework we want to:want to:

Page 87: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Conclusions and Future Work87

Conclusions and Future Work87

�� Our word segmentationOur word segmentation--annotation utility:annotation utility:

• Achieves promising results when employing HMM as aligner (semi-automated annotation).

1

• Remarkable performance of the new HMM design used compared to common HMM.

2

• The powerful rule based PSP validation stage enhanced the HMM • The powerful rule based PSP validation stage enhanced the HMM output results remarkably.

3

• Use a large open vocabulary database having huge varieties of words and writing styles.

1

• Integrating different classifiers covering different divisions of the feature space.

2

�� As futureAs future work wework we want to:want to:

Page 88: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

Conclusions and Future Work88

�� Ultimately, we aim at:Ultimately, we aim at:

• Upgrading the tool to a generic toolkit used to build online handwriting recognizers engines simply being integrated.

1

• Add plug-in tools for handwriting data collection, standard algorithms for preprocessing, feature extraction, pattern classification, and error analysis.

2

Page 89: A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text

89

﴾﴾وآخر دعواھم أن الحمد � رب العالمينوآخر دعواھم أن الحمد � رب العالمين﴿﴿

﴾﴾الحمد � الذى ھدانا لھذا وما كنا لنھتدى لو� ان ھدانا هللالحمد � الذى ھدانا لھذا وما كنا لنھتدى لو� ان ھدانا هللا﴿﴿

Thank YouThank YouThank YouThank You

89