dco-ds boundary activity: data extraction from tables and plots in scanned pdf publications congrui...

1
DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 ([email protected] ), John Erickson 1 (erickj 4 @rpi.edu ), Xiaogang Ma 1 (max 7 @rpi.edu ), Patrick West 1 ([email protected] ), Mark Ghiorso 2 ([email protected] ), and Peter Fox 1 (pfox@cs . rpi.edu ) 1 Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY 12180, United States 2 OFM Research Inc, Seattle, WA 98115, United States Abstract Reusability of data is a point of major importance in scientific research. There are many occasions when we would like to reuse the data in the old publications in the 1960’s or even older. However, the data in those old publications are normally not ready for direct reuse as they are not in the machine readable formats yet. It is very common that the document formats used are not geared toward reusability. A particularly difficult format to reuse is the Portable Document Format (PDF) as it was never designed for this purpose. This DCO boundary activity focused on the task of retrieving data from tables and plots in the scanned pdf publications as efficiently and accurately as possible. Optical character recognition (OCR) is the key technique for this task. It refers to the process of extracting machine characters from input images (usually in the form of scanned documents). A variety of open source programs have been tested for different use cases. There are also some issues remained to be improved which have been listed below. Visit boundary activity webpage Use Case of Data Extraction from Tables: original scanned PDF file image of Table I (.png) machine readable .txt file after OCR (using PyTesser) Available Tools for Data Extraction from Tables: PyTesser (https://code.google.com/p/pytesser/ ) OCRopus (https://code.google.com/p/ocropus/ ) TableSeer (http://tableseer.sourceforge.net/ ) ChemXSeerTableExtractor ( http://chemxseer.ist.psu.edu/ChemXSeerTableExt ractor/TableExtractorServlet ) Apache Tika (http://tika.apache.org/ ) Google Docs (http://docs.google.com/ ) FreeOCR (http://www.paperfile.net/ ) (not open scource) Problems left to be Solved: • precision of OCR, especially for irregular characters (superscripts, subscripts, Greek letters, math symbols, etc) preservation of table structure after OCR automatic table detection • very time consuming for manually double check the OCR results Available Tools for data extraction from plots: Plot Digitizer (http://plotdigitizer.sourceforge.net/ ) o use autotrace (http://autotrace.sourceforge.net/ ) to make it semi- automatic WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/ ) Plot Digitizer (http:// www.southalabama.edu/physics/software/plotdigitizer.htm ) X Y 6.96537 4.00383 9.41672 6.33851 14.3473 7.34124 19.2881 7.01065 26.7057 5.68145 31.5093 23.3506 31.5195 22.0173 33.9962 21.0187 35.2028 24.686 40.1766 20.0221 original scanned PDF file image of Fig. 1 (.png) in Plot Digitiz er, simply indicat e where the line is on the plot with a thick paint brush the program attempts to automatically sort out the data from the grid line. This auto- digitizing feature depends on an image vectorization program called "autotrace". machine readable .cs v file output from Plot Digitizer with the autotrace feature (totally 278 data points) Use Case of Data Extraction from Plots:

Upload: allen-lambert

Post on 30-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 (lic10@rpi.edu), John Erickson 1 (erickj4@rpi.edu),

DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications

Congrui Li1 ([email protected]), John Erickson1 ([email protected]), Xiaogang Ma1 ([email protected]), Patrick West1 ([email protected]), Mark Ghiorso2 ([email protected]), and Peter Fox1 ([email protected])

1 Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY 12180, United States

2 OFM Research Inc, Seattle, WA 98115, United States

Abstract Reusability of data is a point of major importance in scientific research. There are many occasions when we would like to reuse the data in the old publications in the 1960’s or even older. However, the data in those old publications are normally not ready for direct reuse as they are not in the machine readable formats yet. It is very common that the document formats used are not geared toward reusability. A particularly difficult format to reuse is the Portable Document Format (PDF) as it was never designed for this purpose. This DCO boundary activity focused on the task of retrieving data from tables and plots in the scanned pdf publications as efficiently and accurately as possible. Optical character recognition (OCR) is the key technique for this task. It refers to the process of extracting machine characters from input images (usually in the form of scanned documents). A variety of open source programs have been tested for different use cases. There are also some issues remained to be improved which have been listed below.

Visit boundary activity webpage

Use Case of Data Extraction from Tables:

original scanned PDF file image of Table I (.png)machine readable .txt file after OCR (using PyTesser)

Available Tools for Data Extraction from Tables:• PyTesser (https://code.google.com/p/pytesser/)• OCRopus (https://code.google.com/p/ocropus/)• TableSeer (http://tableseer.sourceforge.net/)• ChemXSeerTableExtractor (

http://chemxseer.ist.psu.edu/ChemXSeerTableExtractor/TableExtractorServlet)

• Apache Tika (http://tika.apache.org/)• Google Docs (http://docs.google.com/)• FreeOCR (http://www.paperfile.net/) (not open scource)

Problems left to be Solved:• precision of OCR, especially for irregular characters

(superscripts, subscripts, Greek letters, math symbols, etc)

• preservation of table structure after OCR• automatic table detection• very time consuming for manually double check the

OCR results

Available Tools for data extraction from plots:• Plot Digitizer (http://plotdigitizer.sourceforge.net/)o use autotrace (http://autotrace.sourceforge.net/) to make it semi-automatic

• WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/)• Plot Digitizer (http://www.southalabama.edu/physics/software/plotdigitizer.htm)

X Y6.96537 4.003839.41672 6.3385114.3473 7.3412419.2881 7.0106526.7057 5.6814531.5093 23.350631.5195 22.017333.9962 21.018735.2028 24.68640.1766 20.0221

… …… …

original scanned PDF file

image of Fig. 1 (.png)

in Plot Digitizer, simply indicate where the line is on the plot with a thick paint brush

the program attempts to automatically sort out the data from the grid line. This auto-digitizing feature depends on an image vectorization program called "autotrace".

machine readable .csv file output from Plot Digitizer with the autotrace feature (totally 278 data points)

Use Case of Data Extraction from Plots: