converting printed documents into digital formats with ocrfeeder (linuxtag 2011)

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

Converting printed documents into digital formats

Berlin, May 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What is it?

Document Analysis and Optical Character Recognition

for GNOME


Why?

Paper has a number of problems

No applications for GNU/Linux to do a fair job


Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

http://www.flickr.com/photos/badwsky/


Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

http://www.flickr.com/photos/98469445@N00/


Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

http://www.flickr.com/photos/hugovk/


Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

http://www.flickr.com/photos/pranavsingh/


No fair conversion apps for GNU/Linux

apart from OCR engines, but...


OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)


What's needed is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)


Where are were we at?

* Some closed solutions* Only for proprietary systems

* Various prices* still... arguable results


How


So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

http://www.flickr.com/photos/uber-tuber/


Layouts vary with the type of document

What works on detecting one, won't work on others


OCRFeeder focuses on contents, not on layouts!


Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents


Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files


Most known free OCR engines are detected and configured

automatically:

* Tesseract* GOCR

* OCRAD* Cuneiform


Exportation formats:

ODTHTML

Plain text


User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode


Demo time!


Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Image deskewing

* OCR results cleaning* Project saving/loading


A11y:

* OCRFeeder is a very useful tool for visually impaired users

* Last year, the main target of its development was to improve a11y


Future:

* Integrate Ocropus as an alternative analysis backend

* More exportation formats: HOCR, PDF, etc.

* Make OCR engines' management easier


Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder

http://live.gnome.org/OCRFeeder

http://git.gnome.org/ocrfeeder


Manual in German:

http://wiki.ubuntuusers.de/OCRFeeder


Thank you!

converting printed documents into digital formats with ocrfeeder (linuxtag 2011)

Technology