ocr at inis

Download OCR at  INIS

Post on 25-Feb-2016

57 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

OCR at INIS. INIS Training Seminar 7-11 October 2013, Vienna, Austria. Branko Krznari . INIS Unit. ( ba sed on the presentation b y Yves Reynaud). Outline. What is OCR ? OCR Objectives Principles Techniques Software. What is OCR?. (source: pcmag.com). - PowerPoint PPT Presentation

TRANSCRIPT

OCR at INIS Database Production & Imaging Group Yves Reynaud Y.Reynaud-Pulido@iaea.org

OCR at INISINIS Training Seminar7-11 October 2013, Vienna, AustriaBranko Krznari(based on the presentation by Yves Reynaud)INIS UnitIAEAInternational Atomic Energy AgencyInternational Atomic Energy AgencyInternational Nuclear Information System (INIS)IAEAOutlineWhat is OCR?OCR ObjectivesPrinciplesTechniquesSoftwareINIS Training Seminar 7-11 October 2013, Vienna, Austria 2IAEAWhat is OCR?

INIS Training Seminar 7-11 October 2013, Vienna, Austria 3(source: pcmag.com)IAEAOptical Character Recognition (OCR)OCR is the conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. [1]Make digitized images of printed documents searchable.Font encoding issues.

INIS Training Seminar 7-11 October 2013, Vienna, Austria 4IAEA4OCR definition can beOCR ObjectivesWe can find the needle in the haystack

OCR offers a basic search from an unstructured document.OCR adds an extra value to your image.OCR brings to life your digitized collection.INIS Training Seminar 7-11 October 2013, Vienna, Austria 5IAEA5OCR TechniquesPre-processingDe-skewDespeckleBinarization (optional)Line removalLayout analysis (zoning)Post-processing (dictionary)INIS Training Seminar 7-11 October 2013, Vienna, Austria 6IAEA

INIS Training Seminar 7-11 October 2013, Vienna, Austria 7Scanned vs. Vector ImageIAEA7What we want by OCRing is to give the raster documents some of the characteristics of the vector image.Do not look at the trees (letters)try to see the forest (sentences)F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V38 1783LF 48 4 7RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD.INIS Training Seminar 7-11 October 2013, Vienna, Austria 8IAEA8Try to read the paragraph as if it were written by an illiterate peson.The paragraph is composed of 190 characters, 105 have been changed, thus 55 % are not the correct characters. Nevertheless, without special training we can read them and understand the content.We can do this because we can easyly adapt the forms of the chacarteres to a certain context.Verdana FontFOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 7-11 October 2013, Vienna, Austria 9IAEA9This is the same text with the well known verdana font.Brush Script MT (Windows Font)FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 7-11 October 2013, Vienna, Austria 10IAEA10windows fonts not always make it easier...PCs HumansOCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization.

People adapt to circumstances and can circumvent misspellings if context is clear.

INIS Training Seminar 7-11 October 2013, Vienna, Austria 11IAEA11We knew we had words to be read and not formulas to deciferTrue or false Usually, printed text is adequately sampled if each line is at least two pixels in thickness:

INIS Training Seminar 7-11 October 2013, Vienna, Austria 12IAEA12Lets OCR this paragraph to see the resultsZoom in

INIS Training Seminar 7-11 October 2013, Vienna, Austria 13IAEA13selected words are not complete in shape

Zoom inINIS Training Seminar 7-11 October 2013, Vienna, Austria 14IAEA14We can recognize the single pixels and which parts are missing...Results from OCRIt is in this context that I

and an additional protocol on the basisINIS Training Seminar 7-11 October 2013, Vienna, Austria 15IAEA15Most OCR programs too, text has been properly recognized.

Chinese Raster Image (scanned)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 16IAEA16OCR works not only for Latin and CyrillicChinese Vector Image (OCR)INIS Training Seminar 7-11 October 2013, Vienna, Austria 17IAEA17Patterns can be recognized if we use the proper contextArabic Raster Image (scanned)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 18IAEA18OCR programs are now affordable for everyone.Arabic Vector Image (OCR) INIS Training Seminar 7-11 October 2013, Vienna, Austria 19IAEA19As OCR technology develops and PC power increases.Japanese Raster Image (scanned)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 20IAEAJapanese Vector Image (OCR)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 21IAEAFont Encoding

INIS Training Seminar 7-11 October 2013, Vienna, Austria 22IAEAFont Encoding (cont.)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 23IAEAOCR SoftwareAbbyy FineReader (multilingual OCR)Adobe AcrobatInftyReader

INIS Training Seminar 7-11 October 2013, Vienna, Austria 24IAEAAbbyy FineReader (interface)

INIS Training Seminar 7-11 October 2013, Vienna, Austria 25IAEA

InftyReader - an OCR System for Math DocumentsINIS Training Seminar 7-11 October 2013, Vienna, Austria 26IAEAWe will be able to reach new standards.26Reference[1] Optical character recognition http://en.wikipedia.org/wiki/Optical_character_recognition. Retrieved 2013-09-23.INIS Training Seminar 7-11 October 2013, Vienna, Austria 27IAEAThank you!INIS Training Seminar 7-11 October 2013, Vienna, Austria 28IAEA28