document analysis techniques for automatic electoral ... · introduction preprocessing omr icr hwr...
TRANSCRIPT
Document Analysis Techniques for AutomaticElectoral Document Processing: A Survey
J. Ignacio Toledo, Jordi Cucurull, Jordi Puiggalı,Alicia Fornes and Josep Llados
VoteID 2015
4 September 2015
Introduction Preprocessing OMR ICR HWR Conclusions
Contents
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Introduction
Why paper voting?
Legal reasons: On some countries, introducing electronicvoting would require legal modificationsTradition: People is used to paper votingUser Interface: Average citizen is an expert in using penand paperA first step: An automated process can be a first steptowards improving voter privacy and verifiability ofelections of paper based elections adapting techniquesfrom electronic voting
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Plan
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Global Threshold Binarization
Otsu’s MethodExhaustive search of a global threshold value that minimizes intra-classvariance for background and foreground
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Local Threshold Binarization
Sauvola’s MehodDetermine an optimal threshold value for each pixel, depending on itsneighborhood
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Vertical Projection Based Skew Correction
Compute the vertical projection histogram of the image atdifferent rotation angles(i.e. from -5 to 5 degrees, 0.25 degrees resolution)
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Vertical Projection Based Skew Correction
The image with the highest standard deviation of thevertical projection histogram has the right orientation
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hough Transform Based Skew Correction
The Hough Transform can detectlines in an imageWe want to detect the skew angle ofthis ballot so we can correct itThe most common angle of thenear-horizontal lines will be the skewangle
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hough Transform Based Skew Correction
As a preprocessing step, remove all foreground pixels that donot have horizontal neighbours(Mathematical morphology erosion operation with an horizontal rectangle asthe structuring element)
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hough Transform Based Skew Correction
As a preprocessing step, remove all foreground pixels that donot have horizontal neighbours(Mathematical morphology erosion operation with an horizontal rectangle asthe structuring element)
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hough Transform Based Skew Correction
Compute the Hough Transform
Each line is a point inparameter spacedescribed by its distanceto the origin ρ and theangle θ
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hough Transform Based Skew Correction
Threshold and Weighted Average
Discard the lines withlower number of pixels andperform a weightedaverage for the desiredintervalIn our example thedetected skew angle is1.5370 degrees
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Plan
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Ballots
Ballots with voting targetsthat voters should fill inWe have a database withexpected voting targetcoordinatesWe want to decide if avoting target has beenmarked by the voter
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Optical Mark Recognition
Image Difference Based OMR
We have an empty ballot as ourmodel.It is unskewed and correctlythresholded
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Optical Mark Recognition
Image Difference Based OMR
We have voted ballot images.They are unskewed, correctlythresholded and aligned withthe ballot model
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Optical Mark Recognition
Image Difference Based OMR
Perform the difference betweenthe ballot we are examining andthe model.The output is noisy due to smallmisalignments.
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Optical Mark Recognition
Image Difference Based OMR
Remove foreground pixels notsurrounded (4-connectivity) byneighbours(Mathematical morphology erosionwith ’+’ kernel)
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Optical Mark Recognition
Style Based OMR
Most common OMR approaches, rely onthe number of foreground pixels to detect amarkHowever, some commonly accepted markshave a recognizable shapeTo detect those specific shapes, we cantrain specialized classifiers
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Plan
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Ballot Statements
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Intelligent Character Recognition
A particular case of image classification
We want to find out the class a character image belongs to(and model matching techniques do not work due to high intra-classvariability)We need:
1 Features to describe the image that are robust tointra-class variability but discriminative
2 Classifiers that can deal with high dimensional ”noisy” dataand are robust to outliers
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Feature Representation
Histogram of Oriented Gradients (HOG)
1 Compute the gradients of the image2 Divide it into small spatial regions,
called ”cells”3 For each cell:
Accumulate a histogram of gradientmagnitudes using fixed number ofpredefined bins for the gradientangle
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
ClassifierSupport Vector Machines
Support Vector Machines (SVM)
The best decision boundary canbe found by:
Minimizing classification errorMaximizing the distance to theclosest points of each class(margin)
To be able to separate n different classes, you must learn n one-versus-allclassifiers
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Deep Convolutional Neural Networks
Biologically inspiredLearn robust anddiscriminative featuresPerform a non-linearclassificationProne to overfittingBest error rate (0.23%on MNIST)
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Plan
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Connected Handwriting in electoral documents
Some different examples:Amounts in text fields in ballot statementsWrite-in fields in ballotsObservations fields in ballot statements
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Why is cursive handwriting recognition hard?Sayre’s Paradox
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Feature extraction
Sliding Window
Features: Marti Features
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Hidden Markov Models
Each timestep an observation is generated by an unknownstate
State Transition MatrixEmission Probability associated to each stateTrained using Baum-Welch algorithm
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Bidirectional Long Short Term Memory Network
Sequence Classifier: BLSTM+CTC
Open Vocabulary Handwriting Recognition Character errorRate: 18%
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Plan
1 Introduction
2 PreprocessingBinarizationSkew Correction
3 Optical Mark Recognition
4 Intelligent Character Recognition
5 Handwriting Recognition
6 Conclusions
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Introduction Preprocessing OMR ICR HWR Conclusions
Conclusions and Future Work
Preprocessing is importantOMR can do much more than just counting dark pixelsICR error rates are at human levelConnected Handwriting Recognition could only be used inconstrained scenariosTest in realistic elections scenarios
Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.
Questions?