grotoap2 - the methodology of creating a large ground truth dataset of scientific articles
DESCRIPTION
An article "GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles" presented during WOSP 2014TRANSCRIPT
![Page 1: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/1.jpg)
GROTOAP2 — The methodology of creatinga large ground truth dataset of scientific articles
Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski
Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw
3rd International Workshop on Mining Scientific Publications12 September 2014
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23
![Page 2: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/2.jpg)
Background
CERMINE extracts:
document’smetadata,
bibliographicreferences,
structured fulltext.
CERMINE needsa training set forits zone classifiers!
PDFBT /F13 10 Tf 250 720 Td (PDF) TjET
<title>Syst...<author>M...<author>J.I...<journal>J...<date>2009..
<ref> <author>M.. <title>Sys... <journal>J...</ref><ref>...
Basicstructureextraction
Metad
ata
extra
ction
Textextraction
<JATS><front> <meta><title</front><body> <sec><title></body><back> <ref>1. <aut</back>
<body> <sec> <title>1. In <p>The ... ...</body>
<XML>
<XML>
<XML>Referencesextraction
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23
![Page 3: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/3.jpg)
Requirements
A good dataset for documentregion classification should be:
large,
diverse,
preserving document text,
and the way text is displayed,
with fine-grained labels,
open.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23
![Page 4: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/4.jpg)
GROTOAP
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools+ manual correction of everydocument = non-scalable
∼100% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23
![Page 5: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/5.jpg)
GROTOAP vs. GROTOAP2
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools+ manual correction of everydocument = non-scalable
∼100% accurate
GROTOAP2 dataset:
13,210 documents
119,334 pages
1,640,973 zones
22 zone labels
208 publishers
created by automatic tools+ manually developedcorrection rules = scalable
∼93% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23
![Page 6: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/6.jpg)
The content
GROTOAP2 is composed of:
13,210 ground-truth files in XML format storing thecontent of scientific publications from PubMed Central,
a list of URLs to corresponding PDF files,
a bash script for downloading PDF files from PMC repository.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23
![Page 7: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/7.jpg)
The model
The document’s model in GROTOAP2contains:
geometric hierarchical structure:pages, zones, lines, words andcharacters,
the text content of all the objects,
the dimentions and positions,
the reading order,
zone labels.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23
![Page 8: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/8.jpg)
Zone labels
front: type, title, author,title author, editor,affiliation, abstract,keywords, bib info, dates,correspondence, glossary,copyright
body: body content, figure,table, equation
back: references,acknowledgment,conflict statement
other: page number,unknown
BIB_INFO
BODY_CONTENT
REFERENCES
AFFILIATION
PAGE_NUMBER
ABSTRACT
AUTHOR
DATESTITLE
COPYRIGHT
ACKNOWLEDGMENT
UNKNOWN
FIGURE
CORRESPONDENCE
CONFLICT_STATEMENT
TABLETYPE
KEYWORDS
EDITOR
TITLE_AUTHOR
GLOSSARY
EQUATION
0
20
40
60
80
100
% o
f doc
umen
ts
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23
![Page 9: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/9.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 10: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/10.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 11: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/11.jpg)
TrueViz format
<Document><Page>
<PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 12: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/12.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 13: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/13.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 14: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/14.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 15: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/15.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 16: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/16.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT Text Value=”B”/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 17: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/17.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5” y=”58.3”/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5”y=”58.3”/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”115.3” y=”58.3”/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”74.1” y=”58.3”/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 18: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/18.jpg)
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification>
<Category Value=”TITLE”/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 19: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/19.jpg)
TrueViz format
<Document><Page>
<PageID Value=”0”/><PageNext Value=”1”/><Zone>
<ZoneID Value=”0”/><ZoneNext Value=”1”/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line>
<LineID Value=”0”/><LineNext Value=”1”/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value=”0”/><WordNext Value=”1”/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character>
<CharacterID Value=”0”/><CharacterNext Value=”1”/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
![Page 20: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/20.jpg)
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
![Page 21: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/21.jpg)
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
![Page 22: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/22.jpg)
Structure extraction
CERMINE tools were used to:
extract individual characters and their bounding boxes fromPDF files,
group individual characters into words, lines and zones,
compute the reading order of all the elements.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
![Page 23: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/23.jpg)
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
![Page 24: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/24.jpg)
Zone text matching
Labels were assigned to zones:
the text content of zones was matched with correspondingNLM files,
Smith-Watermann sequence alignment algorithm was usedto measure string similarity,
the label was chosed by selecting a string with the highestsimilarity score above a threshold,
additional attempt to assign a label to every ”unknown”zone based on the labels of the neighbouring zones wasmade.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23
![Page 25: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/25.jpg)
Document filtering
43% of all processeddocuments have atleast 90% of zoneslabelled.
0 20 40 60 80 100Percentage of labelled zones
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Frac
tion
of d
ocum
ents
in b
in
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23
![Page 26: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/26.jpg)
Distribution similarity
Publisher distribution similarity of two datasets A and B can be calculated as:
sim(A,B) =∑p∈P
min(dA(p),dB(p))
where P is the set of all publishers in A ∪ B and dA(p) and dB (p) are thepercentage share of a given publisher in sets A and B, respectively.
Some examples:
sim({60% X, 40% Y}, {60% X, 40% Y}) = 1.0
sim({60% X, 40% Y}, {40% X, 60% Y}) = 0.8
sim(entire processes set, selected set) = 0.78
sim({30% X, 70% Y}, {100% Z}) = 0.0
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
![Page 27: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/27.jpg)
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
![Page 28: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/28.jpg)
Rules
a zone containing both title and authors → title author
pages numbers from range 1–n → page number
figures captions → figure
tables captions → table
small zones lying in the close neighbourhood of table zones → table
zones that occur on every page or every odd/even page and areplaced close to the top or bottom of the page → bib info
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23
![Page 29: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/29.jpg)
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 16 / 23
![Page 30: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/30.jpg)
The evaluation
manual evaluation — using a small random sample of documents
indirect evaluation — evaluating the performance of CERMINEtrained on GROTOAP2
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23
![Page 31: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/31.jpg)
Manual evaluation
without rules with rules
prec. recall F-score prec. recall F-score
abstract 0.93 0.96 0.94 0.98 0.98 0.98
acknowledgement 0.98 0.67 0.80 1.0 0.90 0.95
affiliation 0.77 0.90 0.83 0.95 0.95 0.95
author 0.85 0.95 0.90 1.0 0.98 0.99
bib info 0.95 0.45 0.62 0.96 0.94 0.95
body content 0.65 0.98 0.79 0.88 0.99 0.93
conflict statement 0.63 0.24 0.35 0.82 0.89 0.85
copyright 0.71 0.94 0.81 0.93 0.78 0.85
correspondence 1.0 0.72 0.84 1.0 0.97 0.99
dates 0.28 1.0 0.44 0.94 1.0 0.97
editor - 0 - 1.0 1.0 1.0
equation - - - - - -
figure 0.99 0.36 0.53 0.99 0.46 0.63
glossary 1.0 1.0 1.0 1.0 1.0 1.0
keywords 0.94 0.94 0.94 1.0 0.94 0.97
page number 0.99 0.53 0.69 0.98 0.97 0.98
references 0.91 0.95 0.93 0.99 0.95 0.97
table 0.98 0.83 0.90 0.98 0.96 0.97
title 0.51 1.0 0.67 1.0 1.0 1.0
title author - 0 - 1.0 1.0 1.0
type 0.76 0.46 0.57 0.89 0.47 0.62
unknown 0.22 0.46 0.30 0.62 0.94 0.75
average 0.79 0.68 0.73 0.95 0.91 0.92
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 18 / 23
![Page 32: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/32.jpg)
CERMINE-based evaluation
precision recall F-score
title 93.05% 88.40% 90.67%
author 94.38% 90.01% 92.14%
affiliation 84.20% 78.03% 81.00%
abstract 85.24% 83.67% 84.45%
keywords 87.98% 65.30% 74.96%
journal name 71.88% 63.40% 67.38%
volume 96.28% 93.20% 94.72%
issue 49.12% 55.67% 52.19%
pages 47.41% 45.79% 46.59%
year 99.79% 97.80% 98.29%
DOI 96.12% 85.34% 90.41%
average 82.22% 76.96% 79.34%
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 19 / 23
![Page 33: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/33.jpg)
CERMINE-based evaluation
GROTOAP
GROTOAP2
without with
rules rules
Precision 77.13% 81.88% 82.22%
Recall 55.99% 70.94% 76.96%
F-score 62.41% 75.38% 79.34%
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 20 / 23
![Page 34: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/34.jpg)
Future work
enriching the ground truth files with the names of the fonts,
assigning more specific body labels, eg. section titles,
generating a dataset of parsed bibliographic referencesin a similar way.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23
![Page 35: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/35.jpg)
Links
GROTOAP2: http://cermine.ceon.pl/grotoap2/
CERMINE web service: http://cermine.ceon.pl
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 22 / 23
![Page 36: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles](https://reader033.vdocuments.mx/reader033/viewer/2022042511/55944dbb1a28ab456f8b46f0/html5/thumbnails/36.jpg)
Thank you
Thank you!Questions?
Dominika [email protected]
c© 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license.
The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23