projects cs 661. das 02, princeton, nj ocr features and systems –degradation models, script id,...

34
Projects CS 661

Upload: buddy-beasley

Post on 18-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Projects

CS 661

Page 2: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

DAS 02, Princeton, NJ• OCR Features and Systems

– Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks, traffic ticket reading

• Handwriting Recognition– Stochastic models, holistic methods, Japanese OCR

• Classifiers and Learning– Multi-classifier systems

• Layout Analysis– Skew correction, geometric methods, test/graphics separation, logical

labeling

• Tables and Forms– Detecting tables in HTML documents, use of graph grammars, semantics

• Text Extraction• Indexing and Retrieval• Document Engineering• New Applications

– CAPTCHA, Tachograph chart system, accessing driving directions

Page 3: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

ICDAR 03, Edinburgh, UK

• Multiple Classifiers• Postal Automation and Check Processing• Document Understanding• HMM Classifiers• Segmentation• Character Recognition• Graphics Recognition• Non-Latin Alphabets- Kanji/Chinese, Korean/Hangul,

Arabic/Indian• Web Documents, Video• Word Recognition• Image Processing• Writer Identification• Forms and Tables

Page 4: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Project Assignments

Faisal Farooq Multilingual Digital Library- Indexing, Retrieval, Script discrimination

Swapnil Khedekar Multilingual document layout analysis, OCR

Kompalli Surya Multilingual OCR using HMMs

Lei Hansheng Off-line and on-line handwriting integration and matching

Sumit Manocha Fingerprint image enhancement and minutiae extraction

Lin Yu-Hsuan ** Multiple Classifier Combination- multiple modlaities

Praveer Mansukhani Interactive Handwriting Recognition Model

Amalia Rusu Handwritten Captchas

Sutanto Adi ** Indirect biometric data extraction from medical forms

Page 5: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Multilingual Digital Library

Page 6: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Query Result

Control Panel

Query Input

Telugu and Arabic modules under development

Page 7: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Multilingual DIA and OCR

Page 8: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Text/Image Separation

Intervals between peaks

Page 9: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Line Separation• Ascenders & descenders interfering with lines

• Region-growing approach• In Devanagari, single word is a single

connected component• Grow regions using horizontally adjacent

components

Page 10: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Word Separation

• In Devanagari, all characters in a word are glued together by Shirorekha

• Vertical Projection profile easily separates words

Page 11: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Multilingual OCR using HMMs

Page 12: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Continuous Attributes

grapheme pos orientation angle

Down cusp

3.0 -90o

Up loop

Down arc

Page 13: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Stochastic Model

Page 14: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Observations

Page 15: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Integrating Online and Offline Handwriting Recognition

Page 16: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Structural FeaturesBAG

JunctionLoops

LoopTurns

End

End

Page 17: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Feature Extraction and Ordering

Critical node: removal disconnects a connected component.

2-degree critical nodes keep feature ordering from left to right.

LeftComponent

RightComponent

Loop

EndTurns

Junction

LoopsEnd

Turns

Page 18: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Fingerprint Enhancement and Feature Extraction

Page 19: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Fingerprint Recognition

Orientation maps and minutiae detection

Page 20: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Preprocessing Operations

Filtering

•Image Enhancement

•Image Segmentation

•Correlation among fingers

Page 21: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Multiple Classifier Systems

Page 22: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Combination and Dynamic Selection[Govindaraju and Ianakiev, MCS 2000]

WR 1

WR 2

WR 3+Lexicon

1

Top 5

<55Top 50

image

•Optimization problem

•Combinatorial explosion in

•arrangement of recognizers

•lexicon reduction levels

Page 23: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Lexicon Density[Govindaraju, Slavik, and Xue, IEEE PAMI 2002]

Lexicon 1 Lexicon 2

Me MeHe MemoSo MemoryTo MemoirsIn Mellon

Page 24: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Interactive Handwriting Recognition

Page 25: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Handwriting Recognition

Context Ranked Lexicon

Page 26: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Multiple Choice Question

ContextRanked Lexicon

Page 27: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Interactive Models[McClelland and Rumelhart, Psychological Review, 1981]

ABLE TRIPTRAP

A TN

Words

Letters

Features

Page 28: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Handwritten CAPTCHAs

Page 29: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

“CAPTCHAs”: Completely Automated Public Turing

Tests to Tell Computers & Humans Apart

• challenges can be generated & graded automatically (i.e. the judge is a machine)• accepts virtually all humans, quickly & easily• rejects virtually all machines• resists automatic attack for many years (even assuming that its algorithms are known?)

NOTE: the machine administers, but cannot pass the test!L. von Ahn, M. Blum, N.J. Hopper, J. Langford, “CAPTCHA: Using Hard AI Problems For Security,” Proc., EuroCrypt 2003, Warsaw, Poland, May 4-8, 2003 [to appear].

Page 30: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Yahoo!’s present CAPTCHA: “EZ-Gimpy”

• Randomly pick: one English word, deformations, degradations,

occlusions, colored backgrounds, etc

• Better tolerated by users• Now used on a large scale to protect various

services• Weaknesses: a single typeface, English lexicon

Page 31: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Indirect Biometrics from Medical Forms Images

Page 32: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Hard biometrics

Face

Eye :Retina & Iris

Fingerprint

Hand Geometry

Handwriting

Speech

DNA

Soft biometrics

Age

Ethnicity

Nationality

Build

Gait

Mannerisms

Writing style

(Semantic)

Derived biometrics

Text/News

WWW

Indirect biometrics

Driver’s License

Medical Records

INS Forms

The Biometrics Spectrum

•Biometric Consortium (www.biometrics.org) lists several products:

–Faces (30); Fingerprints (50); Hand geometry (30); Handwriting (5); Iris (5); Multimodal (6); Retinal (2); Vein (3); Voice (22); Other (20)

–NONE on soft biometrics

–NONE on the fusion of indirect and derived biometrics

Page 33: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

NYS EMS PCR FormNYS PCR Example

Thousands are filed a day.Passed from EMS to Hospital.

PCR Purpose:– Medical care/diagnosis– Legal Documentation– Quality Assurance

EMS AbbreviationsCOPD Chronic Obstructive Pulmonary DiseaseCHF Congestive Heart FailureD/S Dextrose in SalinePID Pelvic Inflammatory DiseaseGSW Gunshot WoundNKA No known allergiesKVO Keep vein openNaCL Sodium Chloride

Page 34: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,

Medical Text Recognition and Data Mining