dco-ds boundary activity: data extraction from tables and plots in scanned pdf publications congrui...

DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications

Congrui Li1 ([email protected]), John Erickson1 ([email protected]), Xiaogang Ma1 ([email protected]), Patrick West1 ([email protected]), Mark Ghiorso2 ([email protected]), and Peter Fox1 ([email protected])

1 Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY 12180, United States

2 OFM Research Inc, Seattle, WA 98115, United States

Abstract Reusability of data is a point of major importance in scientific research. There are many occasions when we would like to reuse the data in the old publications in the 1960’s or even older. However, the data in those old publications are normally not ready for direct reuse as they are not in the machine readable formats yet. It is very common that the document formats used are not geared toward reusability. A particularly difficult format to reuse is the Portable Document Format (PDF) as it was never designed for this purpose. This DCO boundary activity focused on the task of retrieving data from tables and plots in the scanned pdf publications as efficiently and accurately as possible. Optical character recognition (OCR) is the key technique for this task. It refers to the process of extracting machine characters from input images (usually in the form of scanned documents). A variety of open source programs have been tested for different use cases. There are also some issues remained to be improved which have been listed below.

Visit boundary activity webpage

Use Case of Data Extraction from Tables:

original scanned PDF file image of Table I (.png)machine readable .txt file after OCR (using PyTesser)

Available Tools for Data Extraction from Tables:• PyTesser (https://code.google.com/p/pytesser/)• OCRopus (https://code.google.com/p/ocropus/)• TableSeer (http://tableseer.sourceforge.net/)• ChemXSeerTableExtractor (

http://chemxseer.ist.psu.edu/ChemXSeerTableExtractor/TableExtractorServlet)

• Apache Tika (http://tika.apache.org/)• Google Docs (http://docs.google.com/)• FreeOCR (http://www.paperfile.net/) (not open scource)

Problems left to be Solved:• precision of OCR, especially for irregular characters

(superscripts, subscripts, Greek letters, math symbols, etc)

• preservation of table structure after OCR• automatic table detection• very time consuming for manually double check the

OCR results

Available Tools for data extraction from plots:• Plot Digitizer (http://plotdigitizer.sourceforge.net/)o use autotrace (http://autotrace.sourceforge.net/) to make it semi-automatic

• WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/)• Plot Digitizer (http://www.southalabama.edu/physics/software/plotdigitizer.htm)

X Y6.96537 4.003839.41672 6.3385114.3473 7.3412419.2881 7.0106526.7057 5.6814531.5093 23.350631.5195 22.017333.9962 21.018735.2028 24.68640.1766 20.0221

… …… …

original scanned PDF file

image of Fig. 1 (.png)

in Plot Digitizer, simply indicate where the line is on the plot with a thick paint brush

the program attempts to automatically sort out the data from the grid line. This auto-digitizing feature depends on an image vectorization program called "autotrace".

machine readable .csv file output from Plot Digitizer with the autotrace feature (totally 278 data points)

Use Case of Data Extraction from Plots:

mailto:[email protected]












https://code.google.com/p/pytesser/

https://code.google.com/p/ocropus/

http://tableseer.sourceforge.net/

http://chemxseer.ist.psu.edu/ChemXSeerTableExtractor/TableExtractorServlet

http://chemxseer.ist.psu.edu/ChemXSeerTableExtractor/TableExtractorServlet

http://tika.apache.org/

http://docs.google.com/

http://www.paperfile.net/

http://plotdigitizer.sourceforge.net/

http://autotrace.sourceforge.net/

http://arohatgi.info/WebPlotDigitizer/

http://www.southalabama.edu/physics/software/plotdigitizer.htm

http://www.southalabama.edu/physics/software/plotdigitizer.htm

dco-ds boundary activity: data extraction from tables and plots in scanned pdf publications congrui...

Documents

compocropustableseer

comfreeocr http

task of retrieving data

netuse autotrace http

orggoogle docs http

pngin plot digitizer

old publications

dco boundary activity