ocrfeeder linuxtag 2011

30
static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha [email protected] OCRFeeder Converting printed documents into digital formats Berlin, May 2011

Upload: joaquim-rocha

Post on 18-Jan-2015

1.384 views

Category:

Technology


1 download

DESCRIPTION

The slides for the presentation about OCRFeeder given at LinuxTag 2011.

TRANSCRIPT

Page 1: OCRFeeder LinuxTag 2011

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

Converting printed documents into digital formats

Berlin, May 2011

Page 2: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Page 3: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Why?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Page 4: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Page 5: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Page 6: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Page 7: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Page 8: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Page 9: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Page 10: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What's needed is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Page 11: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Where are were we at?

* Some closed solutions* Only for proprietary systems

* Various prices* still... arguable results

Page 12: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

How

Page 13: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Page 14: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Layouts vary with the type of document

What works on detecting one, won't work on others

Page 15: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCRFeeder focuses on contents, not on layouts!

Page 16: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Page 17: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 18: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Page 19: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 20: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Most known free OCR engines are detected and configured

automatically:

* Tesseract* GOCR

* OCRAD* Cuneiform

Page 21: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Exportation formats:

ODTHTML

Plain text

Page 22: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Page 23: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Page 24: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Demo time!

Page 25: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Image deskewing

* OCR results cleaning* Project saving/loading

Page 26: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

A11y:

* OCRFeeder is a very useful tool for visually impaired users

* Last year, the main target of its development was to improve a11y

Page 27: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Future:

* Integrate Ocropus as an alternative analysis backend

* More exportation formats: HOCR, PDF, etc.

* Make OCR engines' management easier

Page 28: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder

Page 29: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Manual in German:

http://wiki.ubuntuusers.de/OCRFeeder

Page 30: OCRFeeder LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Thank you!