converting printed documents into digital formats with ocrfeeder (linuxtag 2011)

30
static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha [email protected] OCRFeeder Converting printed documents into digital formats Berlin, May 2011

Upload: igalia

Post on 29-Nov-2014

634 views

Category:

Technology


1 download

DESCRIPTION

By Joaquim Rocha. Even with all the existing alternatives, nowadays a lot of information is still printed on paper. From historical documents to places that only recently started to use digital documents in their usual workflow, this information is conditioned by the limitations of paper: hard to index and search, risk of deterioration, ecological implications, etc. Fortunately the state of the art Optical Character Recognition engines, including Free Software solutions, have high success rates in converting printed text into a digital format. However, these command-line tools only convert graphics in text, usually not taking into consideration the layout of the documents analyzed. This means that a regular page with a mixture of text in columns and eventual graphics will be read as if it was only text. There are some solutions that do take into account the structure of the documents besides performing OCR but these are proprietary and commercial and usually do not a version for Linux. OCRFeeder attempts to solve this problem. It automatically tries to outline the structure of the document (using its own algorithm), detect between graphics and text and performs OCR. Its main exportation format is ODT but it can also export to HTML and save or load projects. It is also possible to use different system-wide OCR engines in the same document and manually override any automatic action (for example, to correct its results, etc.). OCRFeeder is published under GPL v3 and was thought to be used mainly in the GNOME desktop environment and it's developed in its infrastructure. It stands as the only Free Software solution to provide a complete and easy to use graphical user interface application to convert printed documents. When used with the Orca screen reader, OCRFeeder is also a useful application for the visually impaired since it enables a printed document to be converted and read by Orca. Thus, during 2010, the main focus of OCRFeeder's development was the improvement of its accessibility Currently, the main features of OCRFeeder are: - Detection of the contents in a document's page; - Classification of those contents (graphics or text); - Deskew of images; - Importation from a scanner device or PDF; - Exportation to ODT or HTML; - Manual edition of any results. - Save and load projects. In this presentation I will explain how OCRFeeder's contents detection algorithm works, how a usual workflow to convert a document should be and give an overview of the main features of OCRFeeder with a demo.

TRANSCRIPT

Page 1: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

Converting printed documents into digital formats

Berlin, May 2011

Page 2: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Page 3: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Why?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Page 4: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Page 5: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Page 6: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Page 7: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Page 8: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Page 9: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Page 10: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What's needed is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Page 11: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Where are were we at?

* Some closed solutions* Only for proprietary systems

* Various prices* still... arguable results

Page 12: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

How

Page 13: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Page 14: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Layouts vary with the type of document

What works on detecting one, won't work on others

Page 15: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCRFeeder focuses on contents, not on layouts!

Page 16: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Page 17: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 18: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Page 19: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 20: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Most known free OCR engines are detected and configured

automatically:

* Tesseract* GOCR

* OCRAD* Cuneiform

Page 21: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Exportation formats:

ODTHTML

Plain text

Page 22: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Page 23: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 24: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Demo time!

Page 25: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Image deskewing

* OCR results cleaning* Project saving/loading

Page 26: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

A11y:

* OCRFeeder is a very useful tool for visually impaired users

* Last year, the main target of its development was to improve a11y

Page 27: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Future:

* Integrate Ocropus as an alternative analysis backend

* More exportation formats: HOCR, PDF, etc.

* Make OCR engines' management easier

Page 28: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder

Page 29: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Manual in German:

http://wiki.ubuntuusers.de/OCRFeeder

Page 30: Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Thank you!