incorporating visual layout structures for scientiﬁc text

Incorporating Visual Layout Structures for Scientific Text Classification

Zejiang Shen1 Kyle Lo1 Lucy Lu Wang1 Bailey Kuehl1Daniel S. Weld1,2 Doug Downey1

1Allen Institute for AI 2University of Washington{shannons, kylel, lucyw, baileyk, danw, dougd}@allenai.org

Abstract

Classifying the core textual components of ascientific paper—title, author, body text, etc.—is a critical first step in automated scientificdocument understanding. Previous work hasshown how using elementary layout informa-tion, i.e., each token’s 2D position on the page,leads to more accurate classification. We in-troduce new methods for incorporating VIsualLAyout (VILA) structures, e.g., the groupingof page texts into text lines or text blocks,into language models to further improve per-formance. We show that the I-VILA approach,which simply adds special tokens denoting theboundaries of layout structures into model in-puts, can lead to 1.9% Macro F1 improve-ments for token classification. Moreover, wedesign a hierarchical model, H-VILA, thatencodes the text based on layout structuresand record an up-to 47% inference time re-duction with less than 1.5% Macro F1 lossfor the text classification models. Experi-ments are conducted on a newly curated eval-uation suite, S2-VLUE, with a novel metricmeasuring classification uniformity within vi-sual groups and a new dataset of gold an-notations covering papers from 19 scientificdisciplines. Pre-trained weights, benchmarkdatasets, and source code will be available athttps://github.com/allenai/VILA.

1 Introduction

Scientific papers are usually stored in the PortableDocument Format (PDF) without extensive se-mantic markup. It is critical to extract structureddocument representations from these PDF filesfor downstream NLP tasks (Beltagy et al., 2019;Wang et al., 2020) and to improve PDF accessibil-ity (Wang et al., 2021). Unlike other real-world doc-uments, scholarly documents are usually typesetwith rectilinear templates of layout structures withinterleaving textual and image objects. As shown inFigure 1 (a), while this rich layout information cansignal semantics, existing methods (Devlin et al.,

2019; Beltagy et al., 2019) resort to analyzing theflattened text obtained by PDF-to-text converters,ignoring the style and layout information stored inthe original file.

Recent methods demonstrate that token-levelposition information, i.e., tokens’ 2D spatial lo-cation on the page, can be incorporated to im-prove language models by introducing position-awareness (Xu et al., 2020b) and enhance scien-tific document parsing (Li et al., 2020). However,humans reading these documents also make useof high-level layout structures like the groupingof text blocks1 and lines (Todorovic, 2008) to in-fer the semantic contents accordingly, which werefer to as structure-awareness. Existing modelsdo not explicitly incorporate structure-awareness,resulting in two major drawbacks: 1) deprived ofstructure-awareness, token-level predictions can beinconsistent within a group (Li et al., 2020); 2) theinference process is less efficient, as redundant pre-dictions need to be made for each token within agroup.

In this paper, we investigate to what extent suchVIsual LAyout (VILA) structures can be used toimprove NLP models for parsing scholarly docu-ments. Following Zhong et al. (2019) and Tkaczyket al. (2015)’s work, our key assumption is thata document page can be segmented into visualgroups of tokens (either lines or blocks), and thetokens in a group typically have the same semanticcategory, as shown in Figure 1 (b), which we referto as the group uniformity assumption. We firstlyshow that VILA can be used as an additional in-formation source to improve existing BERT-basedlanguage models (Devlin et al., 2019). Moreover,VILA can also be used as a strong prior that guidesthe design of model architecture. By injecting spe-

1In this paper, we consider a text block to be a groupof tokens that appear adjacent to each other on a page, arevisually set off from other tokens, and belong to the samesemantic category. Section headers, lists, equation regions,etc. are valid instances of text blocks.

arX

iv:2

106.

0067

6v2

[cs

.CL

] 2

1 Ju

n 20

21

https://github.com/allenai/VILA

Figure 1: (a) Real-world scientific documents usually have intricate layout structures, and analyzing the flattenedraw text could be sub-optimal. (b) The complex structures could be broken down into groups like text blocksand lines that are composed of tokens with the same semantic category. (c) We design the I-VILA model thatinjects a special layout indicator token [BLK] into the text inputs to enable structure-awareness. (d) We alsodevelop a hierarchical model, H-VILA, that models each group to reduce the redundant token-level predictionsand significantly improve efficiency.

cial layout indicator tokens that delimit boundariesbetween blocks or lines into existing textual inputs,the I-VILA models are informed of the documentstructure, and yield token predictions of better accu-racy and consistency. Further, the group uniformityassumption suggests that semantic category predic-tion can be performed at the group level rather thanthe token level. We therefore introduce a hierarchi-cal model, H-VILA, uses the layout structure torepresent a page as a two-level hierarchy of groupsand tokens. Because the individual groups can bemodeled independently, this reduces the numberof tokens that must be fed into the transformer atonce, improving efficiency.

The proposed methods are evaluated on a newlydesigned benchmark suite, the Semantic ScholarVisual Layout-enhanced Scientific Text Under-standing Evaluation (S2-VLUE). The benchmarkconsists of two datasets built on existing re-sources (Tkaczyk et al., 2015; Li et al., 2020) anda newly curated dataset S2-VL. With high-qualityhuman annotations for papers from 19 disciplines,S2-VL fills important gaps of existing data sets,which only contain machine-generated labels forpapers from certain domains. Besides typical mea-surements of prediction accuracy, we also introducea novel metric, group category inconsistency, thatmeasures the homogeneity of token classes withineach group in terms of entropy. This metric indi-cates how well the token categories accord withVILA structures.

Our study is well-aligned with recent efforts for

incorporating structural information into languagemodels (Lee et al., 2020; Bai et al., 2020; Yanget al., 2020; Zhang et al., 2019). However, one keydifference is that we obtain the structure from lay-out rather than language structures like sentencesor paragraphs. Evaluated on S2-VLUE, we showthat when available, VILA structures lead to betterprediction accuracy compared to using languagestructures.

Our main contributions are as follows:

1. We explore new ways of injecting layout in-formation in terms of VILA structures intolanguage models, and find that they can im-prove text classification for scientific text.

2. We design two models that incorporate VILAfeatures differently. The I-VILA model injectsspecial layout indicator tokens into the text in-puts and improves prediction accuracy (up to+1.9% Macro F1) and consistency comparedto previous layout-augmented language mod-els. The H-VILA model performs group-levelpredictions and can reduce the inference timeby 47% with less than 1.5% Macro F1 loss.

3. We construct a unified benchmark suite S2-VLUE to systematically evaluate performanceon the scientific document text classificationtask with layout features. We enhance exist-ing datasets with VILA structures, develop anovel dataset S2-VL that addresses gaps inexisting resources, with gold labels for 16 to-ken categories and VILA structures for papersfrom 19 disciplines.

As suggested by improvements on our compre-hensive benchmark, our proposed methods alsohave the potential to benefit PDF extraction in thegeneral domain. The benchmark datasets, model-ing code, and trained weights will be available athttps://github.com/allenai/VILA.

2 Related Work

The complex organization of scientific paper el-ements poses a challenge for identifying the keysemantic components and converting them to astructured format. Previous work tackles this chal-lenge by utilizing visual or textual features fromthe documents.

Vision-based Approaches (Zhong et al., 2019;He et al., 2017; Siegel et al., 2018) treat this taskas an image object detection problem: given animage of a document page, the models predict a se-ries of rectangular bounding boxes segmenting thepage into individual components of different cate-gories. These models excel at capturing complexvisual layout structures like figures or tables, butcannot accurately generate fine-grained semanticcategories like title, author, or abstract, which areof central importance for parsing scientific docu-ments.

Text-based Methods, on the other hand, for-mulate this task as a text classification problem.Methods like ScienceParse (Ammar et al., 2018),GROBID (GRO, 2008–2021) or Corpus Conver-sion Service (Staar et al., 2018) firstly converta source PDF document to a sequence of to-kens via PDF-to-text parsing engines like CER-MINE (Tkaczyk et al., 2015) or pdfalto.2 Machinelearning models like RNN (Hochreiter and Schmid-huber, 1997), CRF (Lafferty et al., 2001), or Ran-dom Forest (Breiman, 2001) trained to model theinput texts are then used to classify the token cat-egories. However, without considering documentvisual structures, these trained models fall short inprediction accuracy or generalize poorly for out-of-domain documents.

Recently, Joint Approaches have been exploredthat combine visual and textual features to boostmodel performance. The LayoutLM model (Xuet al., 2020b) combines token textual and 2D po-sition information, and records a 6% F1 improve-ment over the baseline BERT model (Devlin et al.,2019) on a scientific text classification task (Liet al., 2020). Livathinos et al. (2021) built a

2https://github.com/kermitt2/pdfalto

seq2seq model (Sutskever et al., 2014) incorporat-ing both layout and text features that significantlyimproves model robustness on an evaluation set ofdiverse paper layouts. Very recent work like Lay-outLMv2 (Xu et al., 2020a) and SelfDoc (Li et al.,2021) proposes to incorporate documents’ imagefeatures when modeling the text, yet it is shownless helpful for text-dense scholarly articles (Liet al., 2021).

The training and evaluation datasets of thesemodels are in many cases automatically gen-erated using paper XML source from PubMedCentral (Ammar et al., 2018; GRO, 2008–2021;Tkaczyk et al., 2014) or LaTeX source fromarXiv (Li et al., 2020). Despite having large sam-ple sizes, these datasets do not contain significantlayout variation, leading to poor generalization topapers from other disciplines with distinct layouts.Also, due to the heuristic nature in which thesedatasets are constructed, they contain systematicclassification errors that can affect downstreammodeling performance. We refer the readers toa more detailed description of the limitations forthe GROTOAP2 (Tkaczyk et al., 2014) and theDocBank Dataset (Li et al., 2020) in Section 4.1.The PubLayNet dataset (Zhong et al., 2019) pro-vides high-quality text block annotations on 330kdocument pages; however, its annotations only in-clude five distinct categories, which is insufficientfor fully representing the semantic elements foundin papers. Livathinos et al. (2021) and Staar et al.(2018) curated a dataset with manual annotationson 2,940 paper pages from various publishers us-ing diverse layouts, but only the processed pagefeatures are publicly available, not the raw papertexts or the source PDFs needed for experimentswith new layout-aware methods.

3 Method

3.1 Problem Formulation

Following the previous literature (Tkaczyk et al.,2015; Li et al., 2020), our task is to map each tokenti from an input document to a semantic categoryci such as title, body text, or reference. For sim-plicity, we consider only a page of a paper as input;different pages are separately modeled. Input to-kens are extracted via PDF-to-text tools, whichoutput both the word wi and its 2D position, i.e,the rectangular bounding box ai = (x0, y0, x1, y1)denoting the left, top, right, and bottom position ofthe word boundary. Formally, the input sequence

https://github.com/allenai/VILA

Figure 2: Visualization of the token level prediction results on the original page. From left to right, we present theground-truth token category and text block bounding boxes (highlighted in red rectangles), and model predictionsfrom the baseline, I-VILA, and H-VILA model. We use G(B) for VILA-based methods. When VILA is injected,the model achieves more consistent predictions as indicated by arrow (1) and (2) in the figure.

is T = (t1, . . . , tn) where ti = 〈wi, ai〉 and theoutput sequence is (c1, . . . , cn). It’s worth notingthat the order of tokens in sequence T might notreflect the actual reading order of text due to incor-rect PDF-to-text conversion (e.g., in the DocBankdataset (Li et al., 2020)), which is an additionalchallenge for language models pre-trained on regu-lar texts.

Besides the token sequence T , additional visualstructures G can also be retrieved from the sourcedocument. Scientific documents are organized intogroups of tokens text lines or blocks, which consistof consecutive pieces of text and can be segmentedfrom other pieces based on spatial gaps. The groupinformation can be extracted via visual layout de-tection models (Zhong et al., 2019; He et al., 2017)or PDF parsing (Tkaczyk et al., 2015).

Formally, given an input page, our system de-tects a series ofm rectangular boxes for each groupB = {b1, . . . , bm} in the input document page. Itfurther allocates the page tokens to the group re-gion and generates the visual groups gi = (bi, ti),where ti = {tj | aj � bi, tj ∈ T} is all tokens inthe i-th group, and aj � bi denotes the center pointof token tj’s bounding box aj is strictly within thegroup box bi. When two group regions overlapand share some common tokens, the system willonly assign each token to the earlier (in terms ofestimated reading order) of the two groups. Thetoken order in each group is consistent with thetoken order extracted from the page.

We refer to text block groups of a page as G(B)

and text line groups as G(L). In our case, we de-fine text lines as consecutive tokens appearing at

the nearly same vertical position.3 Text blocksare adjacent text lines with gaps smaller than acertain threshold, and the same semantic category.That is, even two close lines of different semanticcategories should be allocated to separate blocks.However, in practice, block or line detectors maygenerate incorrect predictions.

In the following sections, we show how languagemodels can benefit from these different types ofdocument structures.

3.2 I-VILA: Injecting Visual LayoutIndicators

According to the group uniformity assumption, to-ken categories are homogeneous within a group,and categorical changes should happen at groupboundaries. This suggests that layout informationshould be incorporated in a way that informs to-ken category consistency intra-group and signalspossible token category changes inter-group.

Our first method supplies VILA structures via in-serting a special layout indicator token at the groupboundary in the input text, which we refer to as theI-VILA method. Compared to Xu et al. (2020b),our method provides explicit document structuresignals. As shown in Figure 1(c), the inserted to-kens highlight individual text segments, resultingin a more structured input for the language mod-els that hints at possible category changes. Ad-ditionally, in I-VILA, the special token is seen atall layers of the model, providing VILA signalsat different stages of modeling, rather than onlyproviding positional information at the initial em-

3Or horizontal position, when the text is written vertically.

bedding layers (Xu et al., 2020b). We empiricallyshow that BERT-based models can learn to leveragesuch special tokens to improve both the accuracyand the consistency of category predictions, evenwithout an additional loss penalizing inconsistentintra-group predictions.

In practice, given G, we linearize tokens tifrom each group and flatten them into a 1D se-quence TG. To avoid capturing confounding in-formation in existing pre-training tasks, we in-sert a new special token [BLK] in-between titexts. The resulting input sequence is of theform {[CLS], t1,1, . . . , ti,ni ,[BLK], ti+1,1, . . . ,tm,nm ,[SEP]}, where ti,j and ni indicate the j-th token and the total number of tokens respectivelyin the i-th group, and [CLS] and [SEP] are thespecial tokens used by the BERT model and areinserted to preserve a similar input structure. TheBERT-based models are fine-tuned over the tokenclassification objective with a cross entropy loss.Token positions can also be injected in a similarway as in LayoutLM (Xu et al., 2020b), and the po-sitional embedding for the newly injected [BLK]tokens are derived from the corresponding group’sbounding box bi.

3.3 H-VILA: Visual Layout-guidedHierarchical Model

The uniformity of group token categories also sug-gests the possibility of building a group-level clas-sifier. Inspired by recent advances in modelinglong documents, hierarchical structures (Yang et al.,2020; Zhang et al., 2019) provide an ideal architec-ture for the end task while optimizing for com-putational cost. Illustrated in Figure 1(d), twotransformer-based models are used to encode eachgroup in terms of its words and modeling the wholedocument in terms of the groups, respectively, andwe provide the modeling details as follows.

The Group Encoder is a l1-layer transformerthat converts each group gi into a hidden vectorhi. Following the typical transformer model set-ting (Vaswani et al., 2017), the model takes a col-lection of tokens within a group ti as input, andmaps each token ti,j into a dense vector ei,j ofdimension d. Subsequently, a group vector aggre-gation function f : Rni×d → Rd is applied thatprojects the token representations (ei,1, . . . , ei,ni)to a single vector hi that represents the group’s tex-tual information. A group’s 2D spatial informationis incorporated in the form of positional embed-

dings, and the final group representation hi can becalculated as:

hi = hi + p(bi). (1)

where p is the 2D positional embedding similar tothe one used in LayoutLM:

p(b) =Ex(x0) + Ex(x1) + Ew(x1 − x0)+ (2)

Ey(y0) + Ey(y1) + Eh(y1 − y0),

where Ex, Ex, Ew, Eh are the embedding matri-ces for x, y coordinates and width and height. Weenable model position-awareness at the group levelrather than the token level to avoid capturing noisypositional signals from individual tokens. In prac-tice, we find that injecting positional informationusing the bounding box of the first token within thegroup leads to better results, and we choose groupvector aggregation function f to be the averageover all tokens representations.

The Page Encoder is another stacked trans-former model of l2 layers that operates on thegroup representation hi generated by the groupencoder. Modeling text representations with bothstructure- and position-awareness, the page encoderefficiently captures layout-contextualized grouprepresentations and generates a final group rep-resentation si optimized for downstream textualanalysis. A MLP-based linear classifier is attachedthereafter, and is trained to generate the group-levelcategory probability pic.

Different from previous work (Yang et al., 2020),we restrict the variation in l1 and l2 such that we canload pre-trained weights. Therefore, no additionalpre-training is required, and the H-VILA modelcan be fine-tuned directly for the downstream clas-sification task. Specifically, we set l1 = 1 andinitialize the group encoder from the first-layertransformer weights of BERT. The page encoder isconfigured as either a one-layer transformer or a 12-layer transformer that resembles a full LayoutLMmodel. Weights are initialized from the first-layeror full 12 layers of the LayoutLM model, whichis trained to model texts in conjunction with theirpositions.

Group Token Truncation As suggested inYang et al. (2020)’s work, when an input documentof length N is evenly split into segments of Ls,the memory footprint of the hierarchical model isO(l1NLs+ l2(

NLs

)2), and for long documents withN � Ld, it approximates asO(N2/L2

s). However,

GROTOAP2 DocBank S2-VL

Train / Dev / Test Pages 834k / 18k / 18k 398k / 50k / 50k 1.3k1

Annotation Method Semi-Automatic Automatic Human AnnotationPaper Domain Life Science Math / Physics / CS 19 DomainsVILA Structure PDF parsing Vision model Gold Label / Detection methods# of Categories 22 12 16

# Tokens Per Page 1203 / 591 / 23072 838 / 503/ 1553 790 / 453 / 1591# Tokens Per Text Block 90 / 184 / 431 57 / 138 / 210 48 / 121 / 249# Tokens Per Text Line 17 / 12 / 38 16 / 43 / 38 14 / 10 / 30# Text Lines Per Page 90 / 51 / 171 60 / 34 / 125 64 / 54 / 154# Text Blocks Per Page 12 / 16 / 37 15 / 8 / 30 22 / 36 / 68

1 This is the total number of pages in the S2-VL dataset; we use 5-fold cross-validation for training and testing.2 For this and all following cells, we report the average / standard deviation / 95-th percentile value for this item.

Table 1: Distinct features for the three datasets in the S2-VLUE benchmark.

in our case, it is infeasible to adopt the Greedy Sen-tence Filling technique (Yang et al., 2020) as itwill mingle signals from different groups and ob-fuscate group structures. It’s also less desirable tosimply use the maximum token count per groupmax1≤i≤m ni to batch the contents due to the highvariance of group token length (see Table 1). In-stead, we choose a group token truncation countn empirically based on key stats of the group to-ken length distribution such that N ≈ nm, anduse the first n to aggregate the group hidden vec-tor hi for all groups. Therefore, H-VILA modelscan achieve similar efficiency gains as seen in theprevious method (Yang et al., 2020).

4 Benchmark Suite

To systematically evaluate the proposed meth-ods, we develop the the Semantic Scholar VisualLayout-enhanced Scientific Text UnderstandingEvaluation (S2-VLUE) benchmark suite. S2-VLUE consists of three datasets—two previouslyreleased resources augmented with VILA informa-tion and a newly curated dataset S2-VL—as well asevaluation metrics for measuring prediction quality.

4.1 Datasets

Key statistics for S2-VLUE are provided in Ta-ble 1. Notably, the three constituent datasets differin regards to: 1) annotation method, 2) VILA gen-eration method, and 3) paper domain coverage. Weprovide details below.

GROTOAP2 The GROTOAP2 dataset (Tkaczyket al., 2014) is semi-automatically annotated. Itfirst creates text block and line groupings usingthe CERMINE PDF parsing tool (Tkaczyk et al.,

2015); text block category labels are then obtainedby pairing block texts with structured data fromdocument source files obtained from PubMed Cen-tral. A small subset of data is inspected by experts,and a set of post-processing heuristics is developedto further improve annotation quality. Since to-ken categories are annotated by group, the datasetachieves perfect accordance between token labelsand VILA structure. However, the method of rule-based PDF parsing employed by the authors intro-duces labeling inaccuracies due to imperfect VILAdetection: the authors find that block-level annota-tion accuracy achieves only 92 Macro F1 in a smallgold evaluation set. Additionally, all samples areextracted from the PMC Open Access Subset4 thatincludes only life sciences publications; these pa-pers have less representation of classification typeslike “equation”, which are common in other scien-tific disciplines.

DocBank The DocBank dataset (Li et al., 2020)is fully machine-labeled without any postprocess-ing heuristics or human assessment. The authorsfirst identify token categories by automatically pars-ing the source TEX files available from arXiv. Textblock annotations are then generated by groupingtogether tokens of the same category using con-nected component analysis. However, only a spe-cific set of token tags was extracted from the mainTEX file for each paper, leading to inaccurate andincomplete token labels, especially for papers em-ploying LaTeX macro commands,5 and thus in-

4https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/5For example, in DocBank, “Figure 1” in a figure caption

block is usually labeled as “paragraph” rather than “caption”.DocBank labels all tokens that are not explicitly contained inthe set of processed LaTeX tags as “paragraph.”

correct visual groupings. Hence, we develop aCNN-based vision layout detection model basedon a collection of existing resources (Zhong et al.,2019; MFD, 2021; He et al., 2017; Shen et al.,2021) to fix these inaccuracies and generate trust-worthy VILA annotations at both the text blockand line level.6 As a result, this dataset can beused to evaluate VILA models under a differentsetting, since the VILA structures are generatedindependently from the token annotations. Yet asthe papers are extracted from arXiv, they primarilyrepresent domains like Computer Science, Physics,and Mathematics, limiting the amount of layoutvariation; the automatic token classification pro-cess also produces labels for only a small numberof semantic categories, which may be insufficientfor some paper modeling needs.

S2-VL S2-VL is created to address the three ma-jor drawbacks in existing work: 1) annotation qual-ity, 2) VILA creation, and 3) domain coverage.S2-VL is developed with help from graduate stu-dents who frequently read scientific papers. Usingthe PAWLS annotation tool (Neumann et al., 2021),annotators draw rectangular text blocks directly oneach PDF page, and specify the block-level seman-tic categories from 16 possible candidates. Tokenswithin a group can therefore inherit the categoryfrom the parent text block. Inter-annotator agree-ment, in terms of token-level accuracy measured ona 12-paper subset, is high at 0.95. The ground-truthVILA labels in S2-VL can be used to fine-tune vi-sual layout detection models, and paper PDFs arealso included, making PDF-based structure pars-ing feasible: this enables VILA annotations to becreated by different means, which is helpful forbenchmarking VILA-based models in different sce-narios. Moreover, S2-VL currently contains 1337pages of 87 papers from 19 different disciplines,including Philosophy and Sociology that are notpresent in previous data sets.

Overall, the datasets in S2-VLUE cover a widerange of scientific disciplines with different layouts.The VILA structures are curated differently, andcan be used to evaluate a variety of VILA-basedmethods and assess their generalizability.

6The original generation method for DocBank requiresrendering LaTeX source, which results in layouts differentfrom the publicly available versions of these documents onarXiv. However, because the authors of the dataset only pro-vide document page images, rather than the rendered PDF, wecan only use image-based approaches for layout detection. Werefer readers to supplementary materials for details.

4.2 MetricsPrediction Accuracy The token label dis-tribution is heavily skewed towards cate-gories indicating papers’ body texts (e.g., the“BODY_CONTENT” category in GROTOAP2 orthe “paragraph” category in S2-VL and DocBank).Therefore, we choose to use Macro F1 as the mainevaluation metric for prediction accuracy.

Group Category Inconsistency We also de-velop a metric that calculates the uniformity of thetoken categories within a group. Hypothetically,tokens in T (i)

g share the same category c, and natu-rally the group inherits the semantic label c. We usethe group token category entropy to measure theinconsistency of the (predicted) token categorieswithin a group:

H(g) = −∑c

pc log pc, (3)

where pc denotes the probability of a token in groupg being classified as category c. When all tokensin a group have the same category, the group tokencategory inconsistency is zero. H(g) reaches themaximum when pc is a uniform distribution acrossall possible categories. The inconsistency for G isthe arithmetic mean of all individual groups gi:

H(G) =1

m

m∑i

H(gi) (4)

H(G) acts as an auxiliary metric for evaluating pre-diction quality with respect to the provided VILAstructures. In the remainder of this paper, we reportthe inconsistency metric for text blocks G(B) andscale the values up by a factor of 100.

5 Experiments

5.1 Experimental SetupImplementation Details Our models are imple-mented using PyTorch (Paszke et al., 2019) and thetransformers library (Wolf et al., 2020). A seriesof baseline and VILA models are fine-tuned on 4-GPU RTX8000 or A100 machines. The AdamWoptimizer (Kingma and Ba, 2014; Loshchilov andHutter, 2019) is adopted with a 5 × 10−5 learn-ing rate and (β1, β2) = (0.9, 0.999). The learningrate is linearly warmed up over 5% steps then lin-early decayed. For different datasets (GROTOAP2,DocBank, S2-VL), unless otherwise specified, weselect the best fine-tuning batch size (40, 40 and 12)

GROTOAP2 DocBank S2-VL

F1-macro � H(G)� F1-macro � H(G)� F1-macro � H(G)� Inference Time (ms)

Baseline LayoutLM 92.34 0.78 91.06 3.04 81.65(5.05)1 3.56(0.62) 52.56(0.25)

I-VILASentence 91.83 0.78 91.44 3.03 82.03(4.20) 3.50(0.38) 54.09(0.37)

Text Line G(L) 92.37 0.73 92.79 2.59 82.87(3.95)2 3.22(0.53) 56.31(0.40)

Text Block G(B) 93.38 0.53 92.00 2.51 82.65(4.90) 2.51(0.44) 53.27(0.13)

H-VILANaïve 92.65 0.00 87.01 0.00 75.60(4.72) 0.38(0.17) 82.57(0.30)

Text Line G(L) 91.65 0.32 91.27 1.07 80.47(2.03) 2.62(0.70) 28.07(0.37)3

Text Block G(B) 92.37 0.00 87.78 0.00 76.06(4.66) 0.38(0.22) 16.37(0.15)

1 For S2-VL, we show the averaged scores and their standard deviation in the parentheses across the 5-fold cross validation subsets.2 In this table, we report S2-VL results using VILA structures detected by visual layout models. When the ground-truth VILA structures areavailable, both I-VILA and H-VILA models can achieve better accuracy, shown in Table 5 and 6.3 When reporting efficiency in other parts of the paper, we use this result because of its optimal combination of accuracy and efficiency.

Table 2: Model Performance on the Three Benchmark Datasets. Models with layout indicator achieves betteraccuracy, while hierarchical models achieves significant efficiency gain compared to the baseline methods.

and training epochs (24, 6,7 and 10) for all models.As for S2-VL, given its smaller size, we use 5-foldcross validation and report averaged scores, anduse 2× 10−5 learning rate with 20 epochs. Sincepapers may have variable page numbers, we splitS2-VL based on papers rather than pages to avoidexposing paper templates of test samples in thetraining data. Mixed precision training (Micikevi-cius et al., 2018) is used to speed up the trainingprocess.

For I-VILA models, we fine-tune several BERT-variants with VILA-enhanced text inputs, and themodels are initialized from pre-trained weightsavailable in the transformers library. The H-VILAmodels are initialized as mentioned in Section 3.3,and by default, positional information is injectedfor each group.

Competing Methods We consider three ap-proaches that compete with the proposed methodsfrom different perspectives: 1) The LayoutLM (Xuet al., 2020b) model is the baseline method. It is theclosest model counterpart to our VILA-augmentedmodels as it also injects layout information, andachieves previous SOTA performance on the Sci-entific PDF parsing task (Li et al., 2020). 2)For indicator-based methods, besides using VILA-based indicators, we also compare with indica-tors generated from sentence breaks detected byPySBD (Sadvilkar and Neumann, 2020). 3) Forhierarchical models, we consider a naïve approach

7We try to keep gradient update steps the same for theGROTOAP2 and the DocBank dataset. As DocBank con-tains 4× examples, the number of DocBank models’ trainingepochs is reduced by 75%.

where the group texts are separately fed into aLayoutLM-based group token classifier. However,despite the group texts being relatively short, thismethod causes extra computational overhead asthe full LayoutLM model needs to be run m� 3times for all groups.8 As such, we only considertext block groupings in this case for efficiency, andmodels are only trained for 5, 2, and 5 epochs forGROTOAP2, DocBank, and S2-VL.

Measuring Efficiency We report the inferencetime per sample as a measure of model efficiency.We select 1,000 pages from the GROTOAP2 testset, and report the average model runtime for 3 runson this subset. All models are tested on an isolatedmachine with a single V100 GPU. We report thetime incurred for text classification; additional timecosts associated with PDF-to-text conversion orVILA structure detection are not included.

5.2 Evaluation ResultsSummarized in Table 2, VILA-based approachesachieve better accuracy or efficiency under differ-ent scenarios.

I-VILA models lead to consistent accuracy im-provements when layout information is injected.Compared to the baseline LayoutLM model, whichuses regular textual input, inserting layout indica-tors results in +1.1%, +1.9%, and +1.5% Macro F1improvements across the three benchmark datasets.I-VILA models also achieve better token prediction

8Using GROTOAP2 as an example (see Table 1), thereis an average of 12 text blocks per page and m ≈ 12. Withan average of 1203 tokens per page, the inputs must be splitinto three 512-length segments that are separately provided asinputs into the LayoutLM model.

Base Model I-VILA F1-marco � H(G) �

DistilBERT

None 90.52 1.95

Text Line 91.14 1.33

Text Block 92.12 0.73

BERT

None 90.78 1.58



LayoutLM

None 92.34 0.78



Table 3: Inserting VILA indicator tokens leads to con-sistent improvements on different BERT-based models.

consistency; the corresponding group category in-consistency is reduced by 32.1%, 14.8%, and 9.6%compared to baseline. Moreover, VILA informa-tion is also more helpful than language structures:I-VILA models based on text blocks and lines alloutperform the sentence boundary-based methodby a similar margin. Figure 2 shows an example ofthe VILA model predictions.

H-VILA models, on the other hand, lead to sig-nificant efficiency improvements. In Table 2, wereport results for H-VILA models with l1 = 1 andl2 = 12. As block-level models perform predic-tion directly at the text block level, the group cate-gory inconsistency is naturally zero. Compared toLayoutLM, H-VILA models with text lines brings46.59% reduction in the inference time, while thefinal prediction accuracy is not heavily penalized(-0.74%, +0.23%, -1.45% Macro F1). When textblocks are used, H-VILA models are even more effi-cient (68.85% and 80.17% inference time reductioncompared to the LayoutLM and naïve baseline),and they also achieves better accuracy comparedto the naïve counterparts (-0.30%, 0.88%, 0.61%Macro F1).

However, in H-VILA models, the inductive biasfrom the group uniformity assumption is a double-edged sword: models are often less accurate thanI-VILA counterparts, and performing block levelclassification may sometimes lead to worse results(-3.6% and -6.8% Macro F1 in the DocBank andS2-VL datasets compared to LayoutLM). Shown inFigure 3, when text block detection is incorrect, theH-VILA method lacks the flexibility to assign dif-ferent token categories within a group, which canlead to lower accuracy. Text lines contain shorter

Figure 3: Different from Figure 2, here we show mod-els trained and evaluated with “inconsistent” text blockdetections. The blocks are created by the CERMINEPDF parsing program (Tkaczyk et al., 2015), which (1)fails to capture the correct table structure and (2) doesnot separate body text contents into different blocks.Though VILA-based models utilize the group structureto increase the block uniformity, overall prediction ac-curacy is hurt due to the inconsistent block signal.

token sequences than blocks, and the empirical riskof erroneous predictions are consequently lower.Additional analysis of the two different H-VILAmodels and VILA signals is detailed in Section 6.

5.3 I-VILA on different BERT variants

We also apply I-VILA to different BERT variantson the GROTOAP2 dataset, and show that themethod leads to consistent improvements for allmodel variants in Table 3. For BERT and Dis-tilBERT (Sanh et al., 2019), 2D coordinates forlayout indicator tokens are not injected. Yet westill observe consistent improvements (+1.68% and+1.76% Macro F1) compared to non-VILA coun-terparts, showing that VILA structural informationcan be used separately from positional information.Moreover, I-VILA + BERT has nearly identicalperformance as LayoutLM. This further verifiesthat injecting layout indicator tokens is a novel andeffective way of incorporating layout informationinto language models.

5.4 Ablating the Optimal Configurations forH-VILA

Next we analyze different architectures of the H-VILA models using the GROTOAP2 dataset. Byvarying the transformer layers l2 in the page en-

Text Line Text Block

l2 Use 2D Position F1-macro H(G) Inference Time F1 macro H(G) Inference Time

17 89.77 1.73 24.23(0.06) 91.93 0.00 15.90(0.19)

3 91.30 1.17 24.47(0.45) 91.89 0.00 16.05(0.28)

127 91.07 0.63 27.60(0.11) 92.14 0.00 16.65(0.09)

3 91.65 0.32 28.07(0.37) 92.37? 0.00 16.37(0.15)

Table 4: Model performances for different H-VILA architectures.

coder, we aim to find the best configurations withan optimal balance between accuracy and effi-ciency. We also investigate the importance of 2Dgroup positions by experimenting with only usingthe textual representation hi in the page encoders.Results are presented in Table 4.

As mentioned previously, we experiment withl2 ∈ {1, 12} in order to take advantage of exist-ing pre-trained BERT weights. Interestingly, evenwith only a two layer structure (when l2 = 1),the model can capture sufficient semantic informa-tion, and generate predictions with good accuracy.This formulation brings considerable efficiency im-provements, with the most accurate model (markedwith ? in the table) beating the baseline LayoutLMby 68% in terms of efficiency, let alone the naïvebaseline (80%). This model also outperforms thebaseline DistilBERT model with 90.52 Macro F1and 26.48 ms Inference Time.

We observe consistent improvements when 2Dpositional information is injected into the grouprepresentations. Models are able to leverage bothstructure and positional information to generate pre-dictions of better accuracy and consistency: whentext lines are used, modeling with groups’ 2D posi-tions leads to lower group category inconsistency.

6 Discussion

Though we treat text lines and blocks (almost) in-terchangeably in the previous sections, using G(B)

and G(L) may lead to different modeling outcomes.In this section, we try to answer questions about1) which grouping level is preferred and 2) whatis the best way to obtain these groupings. We startwith a careful analysis of the definition of the text“blocks” and “lines”, and identify possible issuesin text blocks that might violate the group tokenuniformity assumption. Additional experiments areimplemented where we compare different groupdetection methods on the S2-VL dataset, and drawconclusions that might be helpful for practical use.

I-VILA H-VILA

G(B) Source F1-macro H(G) F1-macro H(G)

Ground-Truth 86.09(4.76) 1.90(0.46) 79.32(3.65) 0.45(0.28)

PDF Parsing 81.88(5.18) 3.67(0.87) 75.17(4.10) 0.02(0.01)

Vision Model 82.65(4.90) 2.51(0.44) 76.06(4.66) 0.38(0.22)

Table 5: Model performances when using differentG(B) for training and evaluation on the S2-VL dataset.

I-VILA H-VILA

G(L) Source F1-macro H(G) F1-macro H(G)

PDF Parsing 82.40(4.58) 3.38(0.77) 78.88(3.89) 2.52(0.62)

Vision Model 82.87(3.95) 3.22(0.53) 80.47(2.03) 2.62(0.70)

Table 6: Model performances when using differentG(L) for training and evaluation on the S2-VL dataset.

6.1 Text Blocks vs. Text LineText lines are well-studied in layout analysis, partlybecause of the relatively unambiguous definition:consecutive texts appearing at the same verticalposition9 are considered as a text line. Text blocks,however, are less well-defined and the may vary indifferent contexts. For example, section headersand paragraphs are labeled in separate text blocksin some cases (Zhong et al., 2019)), while theyare merged as a single text block in other casesdue to their spatial proximity on the page (Tkaczyket al., 2015). If block detection is inconsistentwith the token semantic schema (e.g., when themodel is required to differentiate section headersand paragraph texts yet they are included in thesame text block), the block signal might be lesshelpful or even hurt VILA-based models.

6.2 S2-VL with Different Group DetectorsTo verify this idea, we investigate different G(B)

derived via PDF parsing (Tkaczyk et al., 2015) orvision model detection (He et al., 2017) for the S2-VL dataset. We train and evaluate our models on

9or horizontal position when the text are written vertically.

5 10 15 25 45 70Number of Training Papers

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90F1

Sco

re

LayoutLMI-VILAblk

I-VILArow

H-VILAblk

H-VILArow

Figure 4: Few-shot learning experiments on the S2-VLdataset with the ground-truth text block annotations.

these data variants, and report the performance inTable 5. As shown in Figure 3, the PDF parsingengine’s block detection is inconsistent with ourtoken labeling schema, and the detected blocksusually contain tokens of different categories. I-VILA models record less accuracy improvementsin this case, and H-VILA methods are penalizedheavily as they, by design, can only generate thesame category prediction for all tokens in the samegroup.

We also vary the text line detection methods, re-train the models, and report performance in Table 6.Similarly, the vision model yields better line detec-tion results, and thus leads to better performance inthe VILA-based models.

We conclude with some practical suggestions forusing VILA-based methods.

1. When consistent text blocks are available,G(B) is preferred as it can lead to better accu-racy and efficiency gains. Equivalently, whencreating new datasets, it is ideal to design thetoken labeling schema to align with the blockdetection method.

2. Otherwise, using G(L) or I-VILA is a viableoption that can lead to consistent modelingimprovements.

3. Despite being more computationally expen-sive, using a vision detection model can leadto more robust accuracy improvements.

6.3 Few-shot Experiments on S2-VL

Finally, we show that I-VILA methods can helpwith model training and improve sample efficiencywhen the perfect text block information is given.We conduct few-shot learning experiments on theS2-VL dataset, training models using samples from

5, 10, 15, 25, and 45 papers. Similar to previ-ous settings, 5-fold cross validation is applied toeach few-shot experiment, and train-test splits areheld constant while varying training sample size.Equipped with text block indicators, I-VILA mod-els trained on a 15-paper subset can outperformthe baseline LayoutLM model trained on the fulldataset of 70 papers. However, the text line based I-VILA models lead to relatively less improvementscompared to the text block based counterpart. Thisdifference could be explained by the distinct textblock and line count per page (Table 1). With50% more text lines on a page, the special tokenis injected into the inputs more frequently. Thiscan cause a bigger difference in the text structurethan what the models are pre-trained on, and thefrequent appearance of such tokens may diminishtheir relative importance.

7 Conclusion

In this paper, we introduce two new ways to in-tegrate Visual Layout (VILA) structures into theNLP pipeline for analyzing scientific documents.We show that inserting special indicator tokensbased on VILA (I-VILA) can lead to robust im-provements in token classification accuracy (upto +1.9% Macro F1) and consistency (up to -32%group category inconsistency). In addition, wedesign a hierarchical transformer model based onVILA (H-VILA), which can reduce inference timeby 46% with less than 1.5% Macro F1 reductioncompared to previous SOTA methods. We releasea benchmark suite, along with a newly curateddataset S2-VL, to systematically evaluate the pro-posed methods. We ablate the influence of differ-ent visual layout detectors on VILA-based mod-els, and provide suggestions for practical use. Ourstudy is well-aligned with the recent explorationof injecting structure into language models, andprovides new perspectives on how to incorporatevisual structures.

References2008–2021. Grobid. https://github.com/kermitt2/grobid.

2021. Icdar2021 competition on mathematical for-mula detection. http://transcriptorium.eu/~htrcontest/MathsICDAR2021/. Ac-cessed: 2021-04-30.

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-

http://arxiv.org/abs/1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c

https://github.com/kermitt2/grobid

https://github.com/kermitt2/grobid

http://transcriptorium.eu/~htrcontest/MathsICDAR2021/

http://transcriptorium.eu/~htrcontest/MathsICDAR2021/

son Dunkelberger, Ahmed Elgohary, Sergey Feld-man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Pe-ters, Joanna Power, Sam Skjonsberg, Lucy Wang,Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen,and Oren Etzioni. 2018. Construction of the litera-ture graph in semantic scholar. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 3 (IndustryPapers), pages 84–91, New Orleans - Louisiana. As-sociation for Computational Linguistics.

He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, LuchenTan, Kun Xiong, Wen Gao, and Ming Li. 2020.Segatron: Segment-aware transformer for languagemodeling and understanding. arXiv preprintarXiv:2004.14996.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A pretrained language model for scientific text.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computa-tional Linguistics.

Leo Breiman. 2001. Random forests. Machine learn-ing, 45(1):5–32.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Kaiming He, Georgia Gkioxari, Piotr Dollár, and RossGirshick. 2017. Mask r-cnn. In Proceedings of theIEEE international conference on computer vision,pages 2961–2969.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data.

Haejun Lee, Drew A. Hudson, Kangwook Lee, andChristopher D. Manning. 2020. SLM: Learning adiscourse language representation with sentence un-shuffling. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 1551–1562, Online. Associa-tion for Computational Linguistics.

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang,Furu Wei, Zhoujun Li, and Ming Zhou. 2020.Docbank: A benchmark dataset for document layoutanalysis.

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu,Handong Zhao, Rajiv Jain, Varun Manjunatha, andHongfu Liu. 2021. Selfdoc: Self-supervised doc-ument representation learning. arXiv preprintarXiv:2106.03331.

Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak,Viktor Kuropiatnyk, Ahmed Nassar, Andre Car-valho, Michele Dolfi, Christoph Auer, Kasper Din-kla, and Peter Staar. 2021. Robust pdf documentconversion using recurrent neural networks. arXivpreprint arXiv:2102.09395.

I. Loshchilov and F. Hutter. 2019. Decoupled weightdecay regularization. In ICLR.

P. Micikevicius, Sharan Narang, Jonah Alben, G. Di-amos, Erich Elsen, David García, Boris Ginsburg,Michael Houston, O. Kuchaiev, Ganesh Venkatesh,and H. Wu. 2018. Mixed precision training. ArXiv,abs/1710.03740.

Mark Neumann, Zejiang Shen, and Sam Skjonsberg.2021. Pawls: Pdf annotation with labels and struc-ture. arXiv preprint arXiv:2101.10281.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In Advances in Neural Information Pro-cessing Systems, volume 32. Curran Associates, Inc.

Nipun Sadvilkar and Mark Neumann. 2020. PySBD:Pragmatic sentence boundary disambiguation. InProceedings of Second Workshop for NLP OpenSource Software (NLP-OSS), pages 110–114, On-line. Association for Computational Linguistics.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Zejiang Shen, Ruochen Zhang, Melissa Dell, BenjaminCharles Germain Lee, Jacob Carlson, and WeiningLi. 2021. Layoutparser: A unified toolkit for deeplearning based document image analysis. arXivpreprint arXiv:2103.15348.

Noah Siegel, Nicholas Lourie, Russell Power, andWaleed Ammar. 2018. Extracting scientific figureswith distantly supervised neural networks. In Pro-ceedings of the 18th ACM/IEEE on joint conferenceon digital libraries, pages 223–232.

https://doi.org/10.18653/v1/N18-3011

https://doi.org/10.18653/v1/N18-3011

https://doi.org/10.18653/v1/D19-1371

https://doi.org/10.18653/v1/D19-1371

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/2020.emnlp-main.120



http://arxiv.org/abs/2006.01038

http://arxiv.org/abs/2006.01038

https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf



https://www.aclweb.org/anthology/2020.nlposs-1.15

https://www.aclweb.org/anthology/2020.nlposs-1.15

Peter WJ Staar, Michele Dolfi, Christoph Auer, andCostas Bekas. 2018. Corpus conversion service: Amachine learning platform to ingest documents atscale. In Proceedings of the 24th ACM SIGKDD In-ternational Conference on Knowledge Discovery &Data Mining, pages 774–782.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Sys-tems, volume 27. Curran Associates, Inc.

Dominika Tkaczyk, Pawel Szostek, and Lukasz Bo-likowski. 2014. Grotoap2—the methodology of cre-ating a large ground truth dataset of scientific arti-cles. D-Lib Magazine, 20(11/12).

Dominika Tkaczyk, Paweł Szostek, Mateusz Fedo-ryszak, Piotr Jan Dendek, and Łukasz Bolikowski.2015. Cermine: automatic extraction of structuredmetadata from scientific literature. InternationalJournal on Document Analysis and Recognition (IJ-DAR), 18(4):317–335.

Dejan Todorovic. 2008. Gestalt principles. Scholarpe-dia, 3(12):5345.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, volume 30. Curran Associates, Inc.

Lucy Lu Wang, Isabel Cachola, Jonathan Bragg, EvieYu-Yen Cheng, Chelsea Haupt, Matt Latzke, BaileyKuehl, Madeleine van Zuylen, Linda Wagner, andDaniel S Weld. 2021. Improving the accessibility ofscientific documents: Current state, user needs, anda system solution to enhance scientific pdf accessi-bility for blind and low vision users. arXiv e-prints,pages arXiv–2105.

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Kinney, Ziyang Liu, William Merrill,et al. 2020. Cord-19: The covid-19 open researchdataset. ArXiv.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, FuruWei, Guoxin Wang, Yijuan Lu, Dinei Florencio, ChaZhang, Wanxiang Che, et al. 2020a. Layoutlmv2:Multi-modal pre-training for visually-rich documentunderstanding. arXiv preprint arXiv:2012.14740.

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,Furu Wei, and Ming Zhou. 2020b. Layoutlm: Pre-training of text and layout for document imageunderstanding. In Proceedings of the 26th ACMSIGKDD International Conference on KnowledgeDiscovery & Data Mining, pages 1192–1200.

Liu Yang, Mingyang Zhang, Cheng Li, Michael Ben-dersky, and Marc Najork. 2020. Beyond 512 tokens:Siamese multi-depth transformer-based hierarchicalencoder for long-form document matching. In Pro-ceedings of the 29th ACM International Conferenceon Information & Knowledge Management, pages1725–1734.

Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI-BERT: Document level pre-training of hierarchicalbidirectional transformers for document summariza-tion. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics,pages 5059–5069, Florence, Italy. Association forComputational Linguistics.

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes.2019. Publaynet: largest dataset ever for documentlayout analysis. In 2019 International Conferenceon Document Analysis and Recognition (ICDAR),pages 1015–1022. IEEE.

https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://doi.org/10.18653/v1/2020.emnlp-demos.6



https://doi.org/10.18653/v1/P19-1499

https://doi.org/10.18653/v1/P19-1499

https://doi.org/10.18653/v1/P19-1499

https://doi.org/10.18653/v1/P19-1499

incorporating visual layout structures for scientiﬁc text

Documents