ocrfeeder - ocr made easy on gnome (guadec 2012)
DESCRIPTION
By Joaquim Rocha. Currently there are still a lot of documents still stored in paper format and this presents some problems related to preservation, flexibility and even ecology. With the current Free Software OCR engines it is possible to get a good accuracy rate when converting printed text to digital format but these engines only perform that basic conversion and know nothing about a document's structure and elements. OCRFeeder presents itself as an easy to use solution implemented for GNOME that performs automatic content detection in pages, allows manual correction and uses the system-wide OCR engines to convert the text. It allows to export the documents in various formats such as ODT, HTML or PDF. This project stands as the most complete Free Software solution for converting printed documents to digital formats and competes with the proprietary alternatives.TRANSCRIPT
![Page 1: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/1.jpg)
static void_f_do_barnacle_install_properties(GObjectClass
*gobject_class){
GParamSpec *pspec;
/* Party code attribute */ pspec = g_param_spec_uint64
(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",
0, G_MAXUINT64,
G_MAXUINT64 /* default value */,
G_PARAM_READABLE | G_PARAM_WRITABLE |
G_PARAM_PRIVATE);
g_object_class_install_property (gobject_class,
F_DO_BARNACLE_PROP_CODE,
Joaquim [email protected]
OCRFeeder
OCR Made Easy on GNOME
July 27 2012
![Page 2: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/2.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
What is it?
Document Analysis and Optical Character Recognition
for GNOME
![Page 3: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/3.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Why?
Paper has a number of problems
No applications for GNU/Linux to do a fair job
![Page 4: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/4.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Security
CC Photo by: http://www.flickr.com/photos/badwsky/
![Page 5: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/5.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Preservation
CC Photo by: http://www.flickr.com/photos/98469445@N00/
![Page 6: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/6.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Data processing
CC Photo by: http://www.flickr.com/photos/hugovk/
![Page 7: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/7.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Ecology
CC Photo by: http://www.flickr.com/photos/pranavsingh/
![Page 8: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/8.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Accessibility
CC Photo by: http://www.flickr.com/photos/illustrator/
![Page 9: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/9.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
No fair conversion apps for GNU/Linux
apart from OCR engines, but...
![Page 10: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/10.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
OCR != Document Conversion
(it only deals with chars)(does not consider the layout)(does not distinguish contents)
![Page 11: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/11.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
What's needed is
Document Analysis and Recognition
(conversion of documents to an electronic format)
(first projects in the 80s)
![Page 12: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/12.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
![Page 13: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/13.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
![Page 14: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/14.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
How it works
![Page 15: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/15.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
So many layouts...
CC Photo by: http://www.flickr.com/photos/uber-tuber/
![Page 16: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/16.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Layouts vary with the type of document
What works on detecting one, won't work on others
![Page 17: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/17.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
OCRFeeder focuses on contents, not on layouts!
![Page 18: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/18.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Key concept:
If a document image can be divided in windows of 1 (content)
or 0 (not content), then it is possible to group all the
1s and outline the contents
![Page 19: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/19.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
![Page 20: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/20.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Recognition:
System-wide OCR engines are used
Engines are configured from the GUI or XML files
![Page 21: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/21.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
![Page 22: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/22.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Most known free OCR engines are detected and configured
automatically:
* Tesseract* GOCR
* OCRAD* Cuneiform
![Page 23: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/23.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Exportation formats:
ODTHTML
Plain textPDF
![Page 24: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/24.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
User interaction:
Users can edit everythingand review the algorithm's results
So, UI can work in attended and unattended ways
CLI only works in an unattended mode
![Page 25: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/25.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
![Page 26: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/26.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Demo time!
![Page 27: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/27.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Other features:
* PDF importation* Unpaper preprocessor
* Font style edition* Image deskewing
* OCR results cleaning* Project saving/loading
![Page 28: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/28.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Future:
* More exportation formats: HOCR, etc.
* Make OCR engines' management easier
![Page 29: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/29.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Webpage:http://live.gnome.org/OCRFeeder
git:http://git.gnome.org/ocrfeeder
Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder
![Page 30: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)](https://reader036.vdocuments.mx/reader036/viewer/2022081401/5595096a1a28ab57068b468f/html5/thumbnails/30.jpg)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Thank you!