expressreader pro adopted to retrodigitization of mathematical documents kazuaki yokota

19
ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

ExpressReader Pro adopted toretrodigitization of mathematicaldocuments

Kazuaki Yokota

ExpressReader Pro

■ Printed Text OCR

■ Japanese / English

■ Recognition Rate

99.7% for Japanese

99.8% for English

■ Powerful Layout Analysis

■ for x86 based Windows PC

Features

Layout analysis 1

Layout analysis 2

Adoption for mathematical document

■ Application framework

■ Detection and recognition of mathematical formula

■ Output format

Problems

Flow diagram

Image scanning

Skew correction

Layout analysis

Character recognition

User modificationOutput conversion

Formula recognition

Formula detection

Component relation

Scanning

GraphicalUserInterface

INFTYformulaRecognition

Layout analysis

Character recognitionFormula detection

Formula detection 1

■ Score each words for both mathematical formula and text word, obtained by character recognition.

M 0 90 100 100 0 90 70 90

T 100 40 20 20 100 40 70 90

Formula detection 2

■ Parse by context-free grammar(CFG) - Formula is also non-terminal symbol of this CFG.

XML based processing

■ Input Recognition parameter, Image

■ While processing Layout information, etc

■ Output Result

OCR needs various data while processing

To implement OCR to certain application system,user must program to treat these data.----- Unify to XML

XML Based Processing

Layout analysis

Character recognitionFormula detection

GraphicalUserInterface

XML

XML

XML

Advantage of XML

■ Easy to convert to other formats (XSLT)

■ Easy to treat (DOM/SAX)

■ Extensible / Flexible

■ MathML

■ Platform independent

XML format 1

<OCR> <Parameter> ……Recognition Parameters </Parameter> <Document> <Sheet> <Area> <Text> ….. Recognized Results(After Recognition) </Text> </Area> </Sheet> </Document></OCR>

XML format 2

<Text tag="paragraph" language="English" line_direction="horz" rect="56,308,3258,714">

<ExpText tag_id="0"/> <Field> <Line rect="56,308,3257,392"> <Character rect="56,332,96,392" code="0x67">g <ExpCharacter original_code="0x67" offset="0" size="40"/> <Candidate id="1" code="0x67" sim="867"/> </Character> …… </Line> </Field></Text>

XML format 3

<Character rect="56,332,96,392" code="0x67">g <ExpCharacter original_code="0x67" offset="0" size="40"/> <Candidate id="1" code="0x67" sim="867"/></Character><Formula rect=“98,332,205,392”> <MathML> ….Mathematical formulae </MathML></Formula>

Demonstration

■ ….

Product form

■ Software Development Kit

■ Simple OCR Software

For x86 based Windows PC

Summary

■ More convenient GUI is needed

■ We wish our product will make your business to be more efficient....