on combining multiple segmentations in scene text recognition

1

On Combining Multiple Segmentations in Scene Text Recognition

Lukáš Neumann and Jiří MatasCentre for Machine Perception, Department of Cybernetics

Czech Technical University, Prague

Neumann, Matas, ICDAR 2013

1. End-to-End Scene Text Recognition - Problem Introduction

2. The TextSpotter System 3. Character Detection as Extremal Region (ERs)

Selection4. Line formation & Character Recognition5. Character Ordering6. Optimal Sequence Selection7. Experiments

Talk Overview

2/21


Input: Digital image (BMP, JPG, PNG) / video (AVI)Lexicon-free methodOutput: Set of words in the imageword = (horizontal) rectangular bounding box, text content

End-to-End Scene Text Recognition

Bounding Box=[240;1428;391;1770]Content="TESCO"

3/21


1. Multi-scale Character Detection [1] with Gaussian Pyramid (new)

2. Text Line Formation [2]3. Character Recognition [3]4. Optimal Sequence Selection (new)

4/21

System Overview

[1] L. Neumann, J. Matas, “Real-time scene text localization and recognition”, CVPR 2012[2] L. Neumann, J. Matas, “Text localization in real-world images using efficiently pruned exhaustive search”, ICDAR 2011[3] L. Neumann, J. Matas, “A method for text localization and recognition in real-world images”, ACCV 2010


Character Detection - Thresholding

Input image(PNG, JPEG,

BMP)

1D projection<0;255>

(grey scale, hue,…)

Extremal regions with threshold

( =50, 100, 150, 200)

5/21


Let image I be a mapping I: Z2 S Let S be a totally ordered set, e.g. <0, 255>

Let A be an adjacency relation (e.g. 4-neigbourhood)

Region Q is a contiguous subset w.r.t. A (Outer) Region Boundary δQ is set of

pixels adjacent but not belonging to Q Extremal Region is a region where

there exists a threshold that separates the region and its boundary

: pQ,qQ : I(p) < I(q)

6/21

Extremal Regions (ER)

= 32

Assuming character is an ER, 3 parameters still have to be determined:

1. Threshold2. Mapping to a totally order set (colour space

projection)3. Adjacency relation


Character boundaries are often fuzzy It is very difficult to locally determine the threshold value, typical

document processing pipeline (image binarization OCR) leads to inferior results

Thresholds that most probably correspond to a character segmentation are selected using a CSER classifier [1], multiple hypotheses for each character are generated

ER Detection - Threshold Selection

[1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012

7/21


p(r|character) estimated at each threshold for each region

Only regions corresponding to local maxima selected by the detector

Incrementally computed descriptors used for classification [1]◦ Aspect ratio◦ Compactness◦ Number of holes◦ Horizontal crossings

Trained AdaBoost classifier with decision trees calibrated to output probabilities

Linear complexity, real-time performance (300ms on an 800x600px image)

8/21

ER Detection – Threshold Selection



Color space projection maps a color image into a totally ordered set Trade-off between recall and speed (although can be easily

parallelized) Standard channels (R, G, B, H, S, I) of RGB / HSI color space 85.6% characters detected in the Intensity channel, combining all

channels increases the recall to 94.8%

ER Detection - Color Space Projection

Source Image Intensity Channel

(no threshold exists for the

letter “A”)

Red Channel

9/21


Pre-processing with a Gaussian pyramid alters the adjacency relation

At each level of the pyramid only a certain interval of character stroke widths is amplified

Not a major overhead as each level is 4 times faster than the previous one, total processing takes ~ 4/3 of the first level (1 + ¼ + ¼2 …)

10/21

ER Detection - Gaussian Pyramid

Characters formed of multiple

small regions

Multiple characters joint

together


Regions agglomerated into text lines hypotheses by exhaustive search [1]

Each segmentation (region) labeled by a FLANN classifier trained on synthetic data [2]

Multiple mutually exclusive segmentations with different label(s) present in each text line hypothesis

11/21

Character Recognition

P

A m

nilI f f

n

[1] Neumann, Matas, Text localization in real-world images using efficiently pruned exhaustive search, ICDAR 2011[2] Neumann, Matas, A method for text localization and recognition in real-world images”, ACCV 2010


Region A is a predecessor of a region B if A immediately precedes B in a text line

Approximated by a heuristic function based on text direction and mutual overlap

The relation induces a directed graph for each text line

Character Ordering

12/21


The final region sequence of each text line is selected as an optimal path in the graph, maximizing the total score

Unary terms ◦ Text line positioning (prefers regions which “sit nicely” in the text line)◦ Character recognition confidence

Binary terms (regions pair compatibility score)◦ Threshold interval overlap (prefers that neighboring regions have similar threshold)◦ Language model transition probability (2nd order character model)

Optimal Sequence Selection

Accommodation

13/21


Experiments

pipeline recall precision f time / image

SM+SS 45.9 69.8 55.4 1.87sSM+MS 55.5 75.2 63.8 2.35sSWT+SS 38.0 66.0 48.0 0.60sSWT+MS 41.0 80.0 54.0 0.84sMLM+SS 62.1 85.9 72.0 2.52sMLM+MS 67.5 85.4 75.4 3.10s

Single Maximum (SM) Segmentation with the highest CSER score

Multiple Local Maxima (MLM)

Segmentations which correspond to local maxima of the CSER score

Stroke Width Transform (SWT)

Reimplementation of character detector based on Epshtein et al. [1]

SS = Single Scale MS = Multiple Scales (Gaussian Pyramid)

[1] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform”, CVPR 2010

ICDAR 2011 Dataset – Text Localization

14/21

Neumann, Matas, ICDAR 2013 15/21

Experiments

pipeline recall precision fProposed method 67.5 85.4 75.4Shi’s method [1] 63.1 83.3 71.8Kim’s method [2] (ICDAR 2011 winner) 62.5 83.0 71.3Neumann & Matas [3] 64.7 73.1 68.7Yi’s Method [4] 58.1 67.2 62.3TH-TextLoc System [5] 57.7 67.0 62.0

ICDAR 2011 Dataset – Text Localization

[1] C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, “Scene text detectionusing graph model built upon maximally stable extremal regions”, Pattern Recognition Letters, 2013[2] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust readingcompetition challenge 2: Reading text in scene images”, ICDAR 2011[3] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012[4] C. Yi and Y. Tian, “Text string detection from natural scenes by structure-based partition and grouping”, Image Processing, 2011 [5] S. M. Hanif and L. Prevost, “Text detection and localization in complex scene images using constrained adaboost algorithm”, ICDAR 2009


Experiments

pipeline recall precision fProposed method 37.8 39.4 38.5Neumann & Matas (CVPR 2012) [1] 37.2 37.1 36.5

ICDAR 2011 Dataset – End-to-End Text Recognition

Percentage of words correctly recognized without any error – case-sensitive comparison (ICDAR 2003 protocol)


16/21


Sample Results on the ICDAR 2011 Dataset

chipscut

CABOTPLACF

FREEDON

17/21


Multi-scale processing / Gaussian Pyramid improves text localization results without a significant impact on speed

Combining several channels and postponing the decision about character detection parameters (e.g. binarization threshold) to a later stage improves localization and OCR accuracy

Method current state◦ The method placed second in ICDAR 2013 Text Localization

competition, 1.4% worse than the winner (f-measure)(unfortunately, end-to-end text recognition is not part of the competition)

◦ Online demo available at http://www.textspotter.org/◦ OpenCV implementation of the character detector in progress by the

open source community Future work

◦ OCR accuracy improvement◦ Overcoming limitations of CC-based methods (e.g. non-linearity non-

robustness caused by a single pixel)

Conclusions

18/21

http://www.textspotter.org/

on combining multiple segmentations in scene text recognition

Documents

character boundaries

character segmentation

color image

threshold value

800x600px image

extremal regions er

qextremal region

aouter region boundary