arxiv:2010.03084v5 [cs.ai] 26 nov 2020 · arxiv:2010.03084v5 [cs.ai] 26 nov 2020 however, such...

16
Program Enhanced Fact Verification with Verbalization and Graph Attention Network Xiaoyu Yang †* , Feng Nie §* , Yufei Feng , Quan Liu , Zhigang Chen , Xiaodan Zhu ECE & Ingenuity Labs Research Institute, Queen’s University § Sun Yat-sen University State Key Laboratory of Cognitive Intelligence, iFLYTEK Research {xiaoyu.yang, xiaodan.zhu, yufei.feng}@queensu.ca [email protected] {quanliu, zgchen}@iflytek.com Abstract Performing fact verification based on struc- tured data is important for many real-life ap- plications and is a challenging research prob- lem, particularly when it involves both sym- bolic operations and informal inference based on language understanding. In this paper, we present a Program-enhanced Verbalization and Graph ATtention Network (ProgVGAT) to integrate programs and execution into tex- tual inference models. Specifically, a verbal- ization with program execution model is pro- posed to accumulate evidences that are embed- ded in operations over the tables. Built on that, we construct the graph attention verification networks, which are designed to fuse differ- ent sources of evidences from verbalized pro- gram execution, program structures, and the original statements and tables, to make the fi- nal verification decision. To support the above framework, we propose a program selection module optimized with a new training strat- egy based on margin loss, to produce more ac- curate programs, which is shown to be effec- tive in enhancing the final verification results. Experimental results show that the proposed framework achieves the new state-of-the-art performance, a 74.4% accuracy, on the bench- mark dataset TABFACT. Our code is avail- able at https://github.com/arielsho/Program- Enhanced-Table-Fact-Checking. 1 Introduction With the overwhelming information available on the Internet, fact verification has become crucial for many applications such as detecting fake news, ru- mors, and political deception (Rashkin et al., 2017; Thorne et al., 2018; Goodrich et al., 2019; Vaibhav et al., 2019; Kry´ sci ´ nski et al., 2019), among others. Existing research has mainly focused on collecting * Equal contribution to this work. The work was done during the second author’s visiting to Queen’s University. Year Tournaments Played Avg. Score Scoring Rank 2007 22 72.46 81 2008 29 71.65 22 2009 25 71.90 34 2010 18 73.42 92 2011 11 74.42 125 Table with title ‘Ji-young Oh’ Statement Label Program Ji-young Oh played more tournament in 2008 than any other year. ENTAILED eq { max { all_rows ; tournaments played } ; hop { filter_eq { all_rows ; year ; 2008 } ; tournaments played } } = True Figure 1: An example of fact verification over tables. and processing evidences from unstructured text data (Liu et al., 2020; Nie et al., 2019; Hanselowski et al., 2018; Yoneda et al., 2018), which is only one type of data where important facts exist. Structured and semi-structured data, e.g., tables in relational databases or in the HTML format is also ubiquitous. Performing fact validation based on structured data is important yet challenging and further study is highly desirable. Fig. 1 depicts a simplified ex- ample in which systems are expected to decide whether the facts in the table support the natural language statement. In addition to its importance in applications, the task presents research challenges of fundamental interests—the problem inherently involves both in- formal inference based on language understand- ing (Dagan et al., 2005; MacCartney and Man- ning, 2009, 2008; Bowman et al., 2015, 2016) and symbolic operations such as mathematical opera- tions (e.g., count and max). Recently, pre-trained language models such as BERT (Devlin et al., 2019) have shown superior performances in nat- ural language inference by leveraging knowledge from large text datasets and can capture compli- cated semantic and syntactic information among premises and hypotheses (Radford, 2018; Radford et al., 2019; Liu et al., 2019; Dong et al., 2019b). arXiv:2010.03084v5 [cs.AI] 26 Nov 2020

Upload: others

Post on 23-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Program Enhanced Fact Verification with Verbalization andGraph Attention Network

Xiaoyu Yang†∗, Feng Nie§∗, Yufei Feng†, Quan Liu‡, Zhigang Chen‡, Xiaodan Zhu†† ECE & Ingenuity Labs Research Institute, Queen’s University

§ Sun Yat-sen University‡ State Key Laboratory of Cognitive Intelligence, iFLYTEK Research{xiaoyu.yang, xiaodan.zhu, yufei.feng}@[email protected] {quanliu, zgchen}@iflytek.com

Abstract

Performing fact verification based on struc-tured data is important for many real-life ap-plications and is a challenging research prob-lem, particularly when it involves both sym-bolic operations and informal inference basedon language understanding. In this paper,we present a Program-enhanced Verbalizationand Graph ATtention Network (ProgVGAT)to integrate programs and execution into tex-tual inference models. Specifically, a verbal-ization with program execution model is pro-posed to accumulate evidences that are embed-ded in operations over the tables. Built on that,we construct the graph attention verificationnetworks, which are designed to fuse differ-ent sources of evidences from verbalized pro-gram execution, program structures, and theoriginal statements and tables, to make the fi-nal verification decision. To support the aboveframework, we propose a program selectionmodule optimized with a new training strat-egy based on margin loss, to produce more ac-curate programs, which is shown to be effec-tive in enhancing the final verification results.Experimental results show that the proposedframework achieves the new state-of-the-artperformance, a 74.4% accuracy, on the bench-mark dataset TABFACT. Our code is avail-able at https://github.com/arielsho/Program-Enhanced-Table-Fact-Checking.

1 Introduction

With the overwhelming information available onthe Internet, fact verification has become crucial formany applications such as detecting fake news, ru-mors, and political deception (Rashkin et al., 2017;Thorne et al., 2018; Goodrich et al., 2019; Vaibhavet al., 2019; Kryscinski et al., 2019), among others.Existing research has mainly focused on collecting

*Equal contribution to this work. The work was doneduring the second author’s visiting to Queen’s University.

Year Tournaments Played Avg. Score Scoring Rank

2007 22 72.46 81

2008 29 71.65 22

2009 25 71.90 34

2010 18 73.42 92

2011 11 74.42 125

Table with title ‘Ji-young Oh’

Statement

Label

Program

Ji-young Oh played more tournament in 2008 than any other year.ENTAILEDeq { max { all_rows ; tournaments played } ; hop { filter_eq { all_rows ; year ; 2008 } ; tournaments played } } = True

Figure 1: An example of fact verification over tables.

and processing evidences from unstructured textdata (Liu et al., 2020; Nie et al., 2019; Hanselowskiet al., 2018; Yoneda et al., 2018), which is only onetype of data where important facts exist. Structuredand semi-structured data, e.g., tables in relationaldatabases or in the HTML format is also ubiquitous.Performing fact validation based on structured datais important yet challenging and further study ishighly desirable. Fig. 1 depicts a simplified ex-ample in which systems are expected to decidewhether the facts in the table support the naturallanguage statement.

In addition to its importance in applications, thetask presents research challenges of fundamentalinterests—the problem inherently involves both in-formal inference based on language understand-ing (Dagan et al., 2005; MacCartney and Man-ning, 2009, 2008; Bowman et al., 2015, 2016) andsymbolic operations such as mathematical opera-tions (e.g., count and max). Recently, pre-trainedlanguage models such as BERT (Devlin et al.,2019) have shown superior performances in nat-ural language inference by leveraging knowledgefrom large text datasets and can capture compli-cated semantic and syntactic information amongpremises and hypotheses (Radford, 2018; Radfordet al., 2019; Liu et al., 2019; Dong et al., 2019b).

arX

iv:2

010.

0308

4v5

[cs

.AI]

26

Nov

202

0

Page 2: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolicoperations and language inference (Gupta et al.,2020) such as the case depicted in Fig. 1.

To effectively enable symbolic operations and in-tegrate them into language-based inference models,we propose a framework centered around programs,i.e., logical forms that can be executed to find evi-dences from structured data. Our model starts witha program selection module, for which we proposea new training strategy based on margin loss tofind programs that can accurately extract support-ing facts from tables. To bridge the semantic gapbetween structured programs and tables as well asto leverage the structures of programs, we proposea novel model based on verbalization with programexecution. The verbalization algorithm interweaveswith program execution in order to accumulate evi-dences inherently embedded in operations, and thealgorithm recursively converts executed operationsin programs into natural language sentences. Builton that, we propose graph-based verification net-work to fuse different sources of evidences fromverbalized program execution, together with theoriginal statements and tables, to support the finalverification decision.

We conduct experiments on the recentlyproposed large scale benchmark dataset TAB-FACT (Chen et al., 2020). Experimental resultsshow that our proposed framework achieves newstate-of-the-art performance, an accuracy of 74.4%,substantially improving the previously reportedbest performance with the accuracy of 71.7%. Ourdetailed analysis shows the effectiveness of verbal-ization and graph-based verification network in uti-lizing programs to achieve the improvement. Theanalysis also demonstrates that the program selec-tion optimized with the proposed training strategybased on margin loss effectively improves the finalverification results.

2 Related Work

Fact Verification. Existing work on fact verifica-tion is mainly based on collecting and using ev-idences from unstructured text data (Liu et al.,2020; Nie et al., 2019; Hanselowski et al., 2018;Yoneda et al., 2018). FEVER (Thorne et al., 2018)is one of the most influential benchmark datasetsbuilt to evaluate systems in checking claims by re-trieving Wikipedia articles and extracting evidencesentences. Recent proposed FEVER 2.0 (Thorne

et al., 2019) has a more challenging dataset to ver-ify factoid claims and an adversarial attack task.Some previous models are developed on the of-ficial baseline (Thorne et al., 2018) with threestep pipeline (Chen et al., 2017a) for fact verifica-tion (Hanselowski et al., 2018; Yoneda et al., 2018;Yin and Roth, 2018; Nie et al., 2019). Others for-mulates fact verification as graph reasoning (Zhouet al., 2019; Liu et al., 2020). Natural languageinference (NLI) task is also a verification problemwhich is fully based on unstructured text data (Da-gan et al., 2005, 2010; Bowman et al., 2015; Parikhet al., 2016; Chen et al., 2017c; Ghaeini et al., 2018;Peters et al., 2018). Neural models proposed forNLI have been shown to be effective (Parikh et al.,2016; Chen et al., 2017d,e; Ghaeini et al., 2018;Peters et al., 2018), including models incorporatingexternal knowledge (Chen et al., 2017b; Yang et al.,2019). Our work focuses on fact verification basedon structured tables (Chen et al., 2020).

For verification performed on structured data,Chen et al. (2020) propose a typical baseline (Table-BERT), which is a semantic matching model takinga linearized table T and statement S as input andemploys BERT for verification. The other model(LPA) proposed in (Chen et al., 2020) uses Trans-former blocks to compute semantic similarity be-tween a statement and program. A contemporane-ous work (Zhong et al., 2020) proposes Logical-FactChecker aiming to leverage programs for factverification. LogicalFactChecker utilizes inherentstructures of programs to prune irrelevant informa-tion in evidence tables and modularize symbolicoperations with module networks. Different fromtheirs, our proposed framework verbalizes the ac-cumulated evidences from program execution tosupport the final verification decision with graphattention networks.

Semantic Parsing. A line of work uses programsynthesis or logic forms to address different naturallanguage processing problems, such as questionanswering (Berant et al., 2013; Berant and Liang,2014), code generation (Yin and Neubig, 2017),SQL synthesis (Zhong et al., 2017; Yu et al., 2018)and mathematical problem solving (Kushman et al.,2014; Shi et al., 2015). Traditional semantic pars-ing methods greatly rely on rules and lexicons toparse texts into structured representations (Zettle-moyer and Collins, 2005; Berant et al., 2013; Artziand Zettlemoyer, 2013). Recent semantic parsingmethods strives to leverage the power of neural

Page 3: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Program Selection Verbalization with Program Execution Graph-based Verification

Year Tournaments Played …

2008 29 …

… … …

Ji-young Oh played more tournament in 2008 than any other year.Table

Statement

Program_n

Program_1

Program_0

...

eq { max { all_rows ; tournaments played } ; hop { filter_eq { all_rows ; year ; 2008 } ; tournaments played } }

eq

hop

filter_eq

all_rows year 2008

row 2 tournamentsplayed

29

V1

V2

V3max

all_rows tournamentsplayed

29

True

V4

Selected Program:

Verbalized Evidence:

Candidate Programs

[CLS] V1 [SEP] … [CLS] Vn [SEP]

Gated Attention

Final Prediction

Transformer Layers

Table-BERT

[CLS] Table [SEP] Statement [SEP]

[CLS] V1’ [SEP] … [CLS] Vn’ [SEP]

V1 V3V2

V4

Table-BERT Node

Prog-Exe Nodes

Entity Nodes

Graph Attention Module

Candidate Programs Generation

V1 The max value of column tournaments played is 29.

V2 The table where column year equal to 2008 is row 2.

V3 The value of column tournaments played in the table where column year with value 2008 is 29.

V4 29 is equal to 29.

Figure 2: An overview of the proposed framework.

networks (Neelakantan et al., 2016; Jia and Liang,2016; Liang et al., 2017; Yu et al., 2018; Donget al., 2019a). Our work leverages symbolic oper-ations inherited in programs produced by neuralsemantic parsers to enhance fact verification overstructured data.

3 Model

We present the proposed framework (ProgVGAT),which centers around programs and execution tointegrate symbolic operations for fact verification.Fig. 2 depicts the overall architecture. This sectionis organized as follows. We first introduce the taskformulation along with program representations inSec. 3.1. Then, we describe in Sec. 3.2 the programselection module that aims to obtain a semanticallyrelevant program for verification. Built on that, wepresent our proposed verbalization algorithm andgraph attention network to dynamically accumu-late evidences embedded in symbolic operationsof programs for final verification in Sec. 3.3 andSec. 3.4.

3.1 Task Formulation and NotationsFormally, given a structured evidence table T anda statement S, the fact verification task aims to pre-dict whether T entails S or refutes it. The evidencetable T = {Ti,j |i ≤ R, j ≤ C} has R rows and Ccolumns, and Ti,j is the value in the (i, j)-th cell.Ti,j can be of different data types, e.g., a word,

number, phrase, or even natural language sentence.

Program representation. Given a statement S,a semantic consistent program z = {opi}Mi=1

is a tree consisting of multiple executable sym-bolic operations opi. An example of programsis shown in the center of Fig. 2. An operationopi = (opi.t, opi.arg) contains an operator opi.t(e.g., max in the figure) and arguments opi.argrelevant to table T (e.g., all rows and tournamentsplayed), and the execution of an operation yields anoutput/answer ans (e.g., 29). Before building themodel, we follow the previous work (Chen et al.,2020) and perform rule-based entity linking andlatent program search to obtain a set of candidateprograms Z = {zi}Ni=1. Specifically, entity link-ing (Nie et al., 2018) detects relevant entities (i.e.,cells in evidence table T ) in statement S using aset of string matching rules. And the latent pro-gram search algorithm finds all valid combinationsof pre-defined operations and detected entities bytraversing and executing them recursively throughthe evidence table T .

3.2 Program Selection

Given a statement S, program selection aims toobtain a high quality program z∗ from a set ofcandidate programs Z = {zi}Ni=1. Previous work(Chen et al., 2020) optimizes the model via a cross

Page 4: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

entropy loss:

J(θ) = −∑z∈Z

1[z]y log pθ(z|S, T ) (1)

where 1[z]y is an indicator function which takesthe value 1 if the execution result of a programz (i.e., the output of the root; “True” in Fig. 2)is consistent with the ground-truth verification la-bel y, otherwise 0. The former type of programsare called label-consistent programs and the latterlabel-inconsistent programs. Despite being sim-ple, it ignores that only one of the label-consistentprograms is correct and can potentially assign toomuch credit to spurious programs that execute toget the correct verification labels with incorrectoperations during training. Meanwhile, the lossfunction considers every program inZ during train-ing and there is only one most relevant programz∗ selected in testing phase, creating discrepanciesbetween training and testing.

To remedy these issues, we introduce a marginloss which encourages to select a most positive pro-gram (i.e., positive program with maximum seman-tic similarity score) while maintaining a marginwith label-inconsistent programs:

J(θ)=max(pθ(z

′neg|S, T )−pθ(z

′pos|S, T )+γ, 0

)(2)

where z′neg and z

′pos refer to the label-inconsistent

program and the label-consistent program with thehighest probability in their own categories, respec-tively. γ is a hyperparameter controlling the mar-gin between the positive instances and negativeones. To measure the semantic similarity betweena candidate program z and the corresponding state-ment S, we leverage pre-trained language modelBERT (Devlin et al., 2019) instead of simply train-ing transformer layers as proposed in (Chen et al.,2020). Specifically, given a (S, z) pair, it is pre-fixed with the [CLS] token and suffixed with the[SEP] token to form the input of BERT. Then a1-layer MLP with a sigmoid layer is applied on topof the BERT model to produce the final probabilitypθ(z|S, T ) = σ(Wrh).

Instead of selecting the top program based onEq. 2, we also tried the exploration strategy pro-posed in (Guu et al., 2017) to sample a non-toplabel-consistent program with a small probability.However, this does not further improve the verifica-tion performance on the development dataset. We

Algorithm 1 VerbalizationRequire Statement and evidence table pair (S, T ),and parsed program z∗ = {opi}Mi=1; Pre-definedoperator P = {pi}Ri=1; A template function F(.)maps operation and operation results into sen-tences.

1: function VERBALIZATION(op, ret)2: args = {}, verb args = {}3: for aj in arguments of operation op do4: if aj is an operator in P then5: arg ans, verb arg = VERBALIZA-

TION(aj , ret)6: args← args ∪ arg ans7: verb args← verb args ∪ verb arg8: else9: args← args ∪ aj

10: verb args←verb args ∪ str(aj)11: end if12: end for13: Apply operation (op.t, args) over evi-

dence table T , obtain operation result ans14: Apply F(op.t, verb args, ans), obtain

verbalized operation result verb ans and ver-balized operation verb op

15: Update ret← ret ∪ verb ans16: Return ans, verb op17: end functionSet verbalized program execution ret = {}VERBALIZATION(op1, ret)Return ret

therefore use the first ranked program in the remain-der of this paper. Our proposed method relies onthe program produced by this section. We furtherconclude the importance of the program quality inthe experimental sections (i.e., Sec.5).

3.3 Verbalization with Program Execution

With the derived program z∗ for each (S, T ) pair,we propose to verbalize program execution—withoperations in a program being recursively executedover the evidence table with a post-order traversalalong the program tree. The verbalization algo-rithm works to convert the execution, including op-erators, arguments, and execution output, into nat-ural language sentences, to accumulate evidencesthat are inherently embedded in operations.

Formally, an operation opi = (opi.t, opi.arg)contains an operator opi.t and arguments opi.arg,and its execution yields an output/answer ansi. Al-

Page 5: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Operation TemplatesOperation Operation Results

count, [verb arg1], ans:string/number the number of [verb arg1] the number of [verb arg1] is [to string(ans)]greater, [verb arg1,verb arg2], ans:true/false

[verb arg1] greater than[verb arg2]

[arg1’s template] [true:is]/[false:is not]greater than [arg2’s template]

filter eq, [verb arg1,verb arg2, verb arg3], ans:rows

[verb arg1] where column[verb arg2] equal to [verb arg3]

[verb arg1] where column [verb arg2]equal to [verb arg3] is row [indices of ans]

Table 1: Examples of generation templates for different operations.

gorithm 1 describes the verbalization procedure.The post-order traversal over program z∗ and theexecution of operations can be found in line 3 toline 13. The template-based generation that con-verts an executed operation (its operation, the argu-ments, and output) into natural language sentencescan be found in line 14. As such, the executionof each operation {opi}Mi=1 in the program z∗ isconverted into an evidence sentence V = {vi}Mi=1.Table 1 lists a few operation templates.* Note thatthe proposed verbalization can be easily general-ized to other domains by extending templates. Weleave it as future work for exploring different gen-eration methods, although for structured programswith fixed operations, template-based methods areoften very effective already. Fig. 2 gives an exam-ple produced by our verbalization algorithm.

3.4 Graph-based Verification Network

We propose graph attention verification networks,which is designed to fuse different sources of ev-idences from verbalized program execution, pro-gram structures, together with the original S andtable T , to make the final verification decision,shown on the right subfigure of Fig. 2.

3.4.1 Graph Definition

Nodes The graph G = (V, E) contains threetypes of nodes V and three types of edges E . Thefirst type of nodes, (n0, . . . , nM−1), encode ver-balized program executions obtained above, calledProg-Exec nodes, shown as green nodes on theright of Fig. 2. M is the number of operationsin a program. The second type encodes programentities, called entity nodes, shown as grey nodes.As each operation execution oi consists of argu-ments and execution output, we construct nodes(nM , . . . , nK−2) for these entities. The third typeof nodes utilize information in original tables andstatements. We design a Table-BERT node, nK−1,initialized with the output of Table-BERT proposed

*Full templates are listed in Appendix A.3

in (Chen et al., 2020), denoted as the orange nodein Fig. 2. In total, we have K nodes, where Kvaries for different (S, T , z∗) triples.

Edges For a graph G with K nodes, the adja-cency matrix A ∈ K × K reflects their connec-tivity, where Ai,j is set to 1 if node ni and nj areconnected with an edge. Similar to graph nodes,we have three types of edges. We design differ-ent attention heads to handle different types ofedges/connections as detailed in Section 3.4.3.

The first type of edges connect the nodes ofverbalized program execution V based on theprogram tree structure—we connect node ni andnode nj , if the corresponding operation oi is a fa-ther or child operation of operation oj in a pro-gram z.† The second type of edges connect Prog-Exec nodes {ni}M−1i=0 with the corresponding en-tity nodes (nM , . . . , nK−2). The third type con-nects Prog-Exec nodes, i.e., verbalized symbolicoperations and executions, with Table-BERT node,which is NLI-based verification performed directlyon statement S and T .

3.4.2 Graph Construction and InitializationFor Table-BERT node, we utilize the Table-BERTmodel proposed in (Chen et al., 2020) to obtain therepresentation:

hK−1 = fBERT ([T ;S]), (3)

where T linearizes table T into sentences; hK−1 ∈RF×1 and F are the number of features in node.We recommend readers to refer to (Chen et al.,2020) for details.

For Prog-Exec nodes, instead of directly usingthe verbalized program executions, the nodes areconstructed and initialized as follows to consider

†We define both directed and undirected graphs; the differ-ence is that in the directed graph, the Prog Exec part of theadjacency matrix is not symmetric again. The experiments onthe development set shows these two versions of graphs havesimilar performance, so in the remainder of the paper, we usethe undirected version of the graph.

Page 6: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

context. Given a program z∗, verbalization pro-posed in Sec. 3.3 generates M sentences {vi}Mi=1.We use document-level BERT proposed in (Liu andLapata, 2019) to encode these sentences by first in-serting a [CLS] token at the beginning of each viand a [SEP] token at the end of it. The segmentembedding is set toEA when i is odd andEB wheni is even. After applying BERT, we take the cor-responding [CLS] vector at the top layer (e.g., the[CLS] inserted before v2) to be the representationfor the corresponding Prog-Exec node (e.g., n2).

For entity nodes, we take the contextualized em-beddings at positions corresponding to the entitiesin the top layer of BERT model as the node rep-resentation. For entities with multiple words, anaverage pooling is applied to produce the final en-tity representation.

3.4.3 Reasoning with Graph AttentionsGraph Propagation Unlike the standard graphpropagation (Velickovic et al., 2018), we modeldifferent types of edges in the propagation process.Specifically, we use the following formula to up-date each node representation hi in graph G:

hnewi = f( Dn

d=1

σ(∑j∈N d

i

αdijWhj))

(4)

where

eij = a(Uhi,Uhj), (5)

αdij =exp(eij)∑K

k=1Adi,kexp(eik)

(6)

where U ∈ RF×L,W ∈ RF×F are trainable pa-rameters and a(.) denotes shared attention mech-anism RL × RL → R. Note that

frefers to the

concatenation operation and D denotes number ofdifferent types nodes (D is set to 3 in this paper).

To propagate along different types of edges, weextend masked attention with a multi-head mech-anism to encode different types of edges in differ-ent heads. Particularly, masked attention in self-attention mechanism is performed for each type ofedges d. The masked attention computes normal-ized attention coefficients αdi,j between node ni andits neighbor nj under edge type d (i.e., Adi,j = 1means node i and node j is connected with theedge type d. Ad is the adjacency matrix we con-structed above). To aggregate node representationfrom each head, we concatenate D updated nodeswith a feed-forward layer in Eq. 4, yielding thefinal node representation hnewi .

Gated Attention To aggregate information in agraph, we employ a gated attention mechanism toobtain final aggregated representation hfinal andpredict final verification label y as follows:

hfinal =

M−1∑i=0

pihinew; pi = σ(hTK−1hinew), (7)

y = σ(Wf ([hfinal‖hK−1])) (8)

where Wf are trainable parameters, σ is the sig-moid function, and ‖ the concatenation operation.

4 Experiment Setup

Data and Evaluation Metric As discussedabove, although (semi-)structured and unstructuredtext data are ubiquitous in our daily life, per-forming fact verification across these different for-mats is relatively new. We conduct our experi-ments on recently released large-scale dataset TAB-FACT (Chen et al., 2020).

TABFACT contains 92,283, 12792, and 12779table-statement pairs for training, validation andtesting respectively. Verification on some state-ments requires higher-order semantics such asargmax, the test set is further split into a simpleand complex subset according to verification dif-ficulty. A small subset is provided in which thehuman upper-bound performance is given. Fol-lowing the existing work, we use accuracy as theevaluation metric. Detailed data statistics are listedin Appendix A.1.

Training Details For parameters in all BERTmodels, the hidden size is set to 768, we use Adamoptimizer (Kingma and Ba, 2014) with learningrate 2e-5, warmup step 3000, dropout 0.1. For pa-rameters in graph attention network, the hiddenfeature dimensions F and L are all set to 768. Allcodes are implemented with PyTorch (Paszke et al.,2019). All hyper-parameters are decided accordingto the validation performance.

Compared Systems We compare our modelswith typical baselines proposed in (Chen et al.,2020). We also present results of a contemporane-ous work LogicalFactChecker (Zhong et al.,2020) which reports the best performance in theliterature. Details of baselines are discussed inrelated work (Section 2).

5 Results

Overall Performance Table 2 presents the re-sults of different verification models. Our proposed

Page 7: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Model Val Test Test (simple) Test (complex) Small TestHuman Performance - - - - 92.1Table-BERT-Horizontal-S+T-Concatenate 50.7 50.4 50.8 50.0 50.3Table-BERT-Vertical-S+T-Template 56.7 56.2 59.8 55.0 56.2Table-BERT-Vertical-T+S-Template 56.7 57.0 60.6 54.3 55.5Table-BERT-Horizontal-S+T-Template 66.0 65.1 79.0 58.1 67.9Table-BERT-Horizontal-T+S-Template 66.1 65.1 79.1 58.2 68.1LPA-Voting w/o Discriminator 57.7 58.2 68.5 53.2 61.5LPA-Weighted-Voting 62.5 63.1 74.6 57.3 66.8LPA-Ranking w/ Discriminator 65.2 65.0 78.4 58.5 68.6LogicalFactChecker (Zhong et al., 2020) 71.8 71.7 85.4 65.1 74.3ProgVGAT 74.9 74.4 88.3 67.6 76.2

Table 2: Performance (accuracy) of different models on TABFACT. For Table-BERT baseline, different strate-gies of linearizing tables to bridge semantic gap with statements. Horizontal and Vertical refer to horizontally orvertically traverse items in tables respectively. S denotes statements, T denotes tables, + indicates concatenationorder between S and T. Concatenate refers to directly concatenating items in tables. Template convert items intables into sentences with pre-defined templates. For LPA baseline, to select one program among all candidates foreach statement, they take either a (weighted) voting strategy or a discriminator.

method obtains accuracy of 74.4% on the test set,achieving new state-of-the-art in this dataset.

For Table-BERT baseline, it leverages pre-trained language models to measure semantic sim-ilarities for table-statement pairs. LPA derives asynthesized program best describing the statement-table pair, and executes derived program againstsemi-structured tables for verification. Our pro-posed method proposes a verbalization and graphattention network for fact verification. It integratesexecution of programs into pre-trained languagemodels, outperforming Table-BERT and LPAwith a large margin.

Compared with LogicalFactChecker, ourproposed method is built to leverage operationexecution evidences and the inherent structuresinformation with verbalization and graph atten-tions. While LogicalFactChecker focuseson pruning irrelevant rows and columns in evi-dence table with programs and utilizing structuresof operations with module networks. Our pro-posed method achieves better results (74.4%) thanLogicalFactChecker (71.7%). The resultsuggests the effectiveness of our proposed methodby introducing executed operation results. Sym-bolic operations performed on evidence tables pro-vide useful information for verification.

Although the proposed model improves the state-of-the-art performance on the entire dataset as wellas all subsets, we can see that the complex sub-set of the problem remains hard to solve. On the

small test set where the human upper-bound per-formance is provided (92.1%), there is still a largegap between the system and human performance.We hope our work will serve asOther for base forfurther work on this problem.

Model Val TestTable-BERT w/ prog 70.3 70.0LogicalFactChecker 71.8 71.7Table-BERT w/ verb. prog 71.8 71.6Table-BERT w/ verb. prog exec 72.4 72.2ProgVGAT 74.9 74.4

Table 3: Results of different ways of using operations.

Model Val TestProgVGAT w/o graph attention 73.6 73.4ProgVGAT 74.9 74.4

Table 4: Ablation results (accuracy) that shows the ef-fectiveness of our graph attention component.

Effect of Program Operations A key compo-nent of our proposed framework is utilizing the ex-ecuted operation results for verification, in whichwe introduce a verbalization algorithm to trans-form the recursive execution of symbolic opera-tions into evidence sentences. We further investi-gate different ways of leveraging operations: (1)Table-BERT w/ prog directly concatenates

Page 8: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Final VerificationVal Test ∆Test

LPAw/ CE

Val Test73.3 72.8 -

65.2 65.0LPA+ BERTw/ CE

Val Test73.9 73.4 +0.6

67.7 67.3LPA +BERTw/ Margin loss

Val Test 74.9 74.4 +1.669.4 68.5

Table 5: Accuracy of different program selection models and corresponding final verification performance basedon verbalized evidence derived from each program selection model.

the derived program z∗ with the original table asnew evidence, and employs Table-BERT on thenew evidence-statement pair; (2) Table-BERTw/ verb. prog differs from Table-BERTw/ prog in converting the derived program zinto sentences with templates proposed in verbal-ization algorithm in Sec. 3.3 for verification (3)Table-BERT w/ verb. prog exec ver-balizes program along with execution results usingthe algorithm proposed in Sec. 3.3.

Table 3 shows the results. Compared to directlyconcatenate structured program with original Table-BERT, Table-BERT w/ verb. prog con-verts the structured program into natural sentences,and achieves better results. The result demonstratesthe importance of eliminating the semantic discrep-ancy between structured data and natural languagein BERT based verification model. Table-BERTw/ verb. prog exec leverages executedoperation results with the verbalization algo-rithm and outperforms Table-BERT w/ verb.prog as well as LogicalFactChecker, Theexecution output provides complementary cluesfrom evidence tables and improves the verificationresults. Our proposed ProgVGAT further lever-ages structures in program execution and boost theperformance. The results confirm the effectivenessof leveraging symbolic operations with our method.

Effect of Graph Attention We investigate thenecessity of leveraging structure information inprogram execution for verification. We presenta simpler version of our model by removing thegraph attention module and integrating verbalizedprogram execution with gated attention mechanismin Eq. 7. Table 4 presents the results. By remov-ing the graph attention network, the verificationaccuracy drops 1.3% on the validation set and 1.0%on the test set, respectively. The results show that

integrating the structures of programs in the graphattention network is important for verification.

Effect of Derived Programs We investigate theeffectiveness of accurate program selection modelsfor final verification. Table 5 represents the resultsof our model on leveraging different programs pro-duced by different program selection models.

We start by introducing different program se-lection models in Table 5. LPA w/ CE (Chenet al., 2020) applies Transformer encoders with-out pre-training stage (i.e., BERT) to computesemantic similarity between candidate programsand statements, and optimizes via a cross entropyloss in Eq. 1. It achieves a 65.0% accuracy onthe test set. Our proposed program selectiondenoted as LPA+BERT w/ Margin loss, re-placing transformer encoders with BERT and opti-mizing the model with our proposed margin loss,can effectively improve the program accuracy to68.5%. Comparing with the accuracy of 67.3% ob-tained by LPA+BERT w/ CE, which is optimizedwith cross entropy loss instead of margin loss, wecan conclude that our proposed margin loss plays apositive role in program selection.

Accordingly, we compare the final verifica-tion results based on different programs selectedby the above models. Our proposed methodleverages programs produced by LPA+BERT w/Margin loss obtains better verification results(e.g., 74.4% on test set) compared with using pro-grams derived by LPA (e.g., 72.8% on test set).The results indicate more accurate programs canprovide more useful operation information and isbeneficial for the final verification.

Qualitative Analysis We provide an example,where integrating program execution informationwith our proposed verbalization and graph atten-tion network can yield the right label. In Fig. 3, the

Page 9: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Year Gold Silver Bronze

2002 Gunilla Svärd Brigitte Wolf Birgitte Husebye

2004 Hanne Staff Dainora Alšauskaitė Tatiana Ryabkina

2006 Minna Kauppi Marianne Andersen Heli Jukkola

2008 Heli Jukkola Merja Rantanen Minna Kauppi

2010 Simone Niggli Signe Soes Lena Eliasson

2012 Simone Niggli Minna Kauppi Tatiana Ryabkina

Statement Tatiana Ryabkina has won the bronze medal more times than Lena Eliasson has.

Table

Label ENTAILED

greater { count { filter_eq { all_rows ; bronze ; tatiana ryabkina } } ; count { filter_eq { all_rows ; bronze ; lena eliasson } } }

Program

Verbalized Evidence:

V1 The table where column bronze equal to Tatiana Ryabkina are row 2 , row 6. V2 The number of the column bronze equal to Tatiana Ryabkina is 2. V3 The table where column bronze equal to Lena Eliasson is row 5.V4 The number of the column bronze equal to Lena Eliasson is 1. V5 2 is greater than 1.

Figure 3: An example in qualitative analysis.

statement requires symbolic manipulation on count-ing bronze medals of two players and compare thenumber of medals of them. First, program selectionproduces a semantic-consistent program for thestatement and then correctly captures the seman-tic correlations between the phrase “more timesthan” and operations greater, count. With the de-rived program, a verbalization algorithm executesoperations over the table and produces sentencesdescribing useful operation results. For example,“the number of the column bronze equal to TatianaRyabkina is 2”, “the number of the column bronzeequal to Lena Eliasson is 1”, and “2 is greater than1”. Then the sentences are integrated into the veri-fication model with the graph attention network toperform the final verification.

6 Conclusions

In this paper, we propose a framework centeredaround programs and execution to provide sym-bolic manipulations for table fact verification. Wepropose a verbalization technique together with agraph-based verification network to aggregate andfuse evidences inherently embedded in programsand the original tables for fact verification. More-over, we design a new training strategy adaptingmargin loss for the program selection module toderive more accurate programs. The experimentsshow that the proposed model improves the state-of-the-art performance to a 74.4% accuracy on thebenchmark dataset TABFACT. Our studies alsoreveal the importance of accurate program acqui-sition for improving the performance of table factverification. In the future, we will investigate the

properties of our proposed method on verifyingstatements with more complicated operations andexplore the explainability of the model.

Acknowledgments

The first, third, and last author’s research is sup-ported by NSERC Discovery Grants and DiscoveryAccelerator Supplement Grants (DAS). We thankZhan Shi, Hui Liu, Jiachen Gu, and Jinpeng Wangfor insightful discussions.

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion for Computational Linguistics, 1:49–62.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of the 2013conference on empirical methods in natural lan-guage processing, pages 1533–1544.

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1415–1425, Baltimore, Maryland. Association for Compu-tational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2015, Lisbon, Portugal, September 17-21,2015, pages 632–642. The Association for Compu-tational Linguistics.

Samuel R. Bowman, Jon Gauthier, Abhinav Ras-togi, Raghav Gupta, Christopher D. Manning, andChristopher Potts. 2016. A fast unified model forparsing and sentence understanding. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics, ACL 2016, August 7-12,2016, Berlin, Germany, Volume 1: Long Papers. TheAssociation for Computer Linguistics.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017a. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2017b. Neural natural languageinference models enhanced with external knowledge.arXiv preprint arXiv:1711.04289.

Page 10: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017c. EnhancedLSTM for natural language inference. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1657–1668, Vancouver, Canada. Asso-ciation for Computational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017d. Enhanced LSTMfor natural language inference. pages 1657–1668.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017e. Recurrent neuralnetwork-based sentence encoder with gated atten-tion for natural language inference. arXiv preprintarXiv:1708.01353.

Wenhu Chen, Hongmin Wang, Jianshu Chen, YunkaiZhang, Hong Wang, Shiyang Li, Xiyou Zhou, andWilliam Yang Wang. 2020. Tabfact: A large-scaledataset for table-based fact verification. In 8th Inter-national Conference on Learning Representations,ICLR 2020, Addis Ababa, Ethiopia, April 26-30,2020.

Ido Dagan, Bill Dolan, Bernardo Magnini, and DanRoth. 2010. Recognizing textual entailment: Ratio-nal, evaluation and approaches. Journal of NaturalLanguage Engineering, 4.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The pascal recognising textual entailmentchallenge. In Machine Learning Challenges Work-shop, pages 177–190. Springer.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang,Lihong Li, and Denny Zhou. 2019a. Neural logicmachines. In 7th International Conference onLearning Representations, ICLR 2019, New Orleans,LA, USA, May 6-9, 2019.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019b. Unified languagemodel pre-training for natural language understand-ing and generation. In Advances in Neural Informa-tion Processing Systems, pages 13042–13054.

Reza Ghaeini, Sadid A. Hasan, Vivek Datla, JoeyLiu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aa-ditya Prakash, Xiaoli Fern, and Oladimeji Farri.2018. DR-BiLSTM: Dependent reading bidirec-tional LSTM for natural language inference. InProceedings of the 2018 Conference of the North

American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1460–1469, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Ben Goodrich, Vinay Rao, Peter J Liu, and Moham-mad Saleh. 2019. Assessing the factual accuracyof generated text. In Proceedings of the 25th ACMSIGKDD International Conference on KnowledgeDiscovery & Data Mining, pages 166–175.

Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, andMatt Gardner. 2020. Neural module networks forreasoning over text. In 8th International Confer-ence on Learning Representations, ICLR 2020, Ad-dis Ababa, Ethiopia, April 26-30, 2020.

Kelvin Guu, Panupong Pasupat, Evan Liu, and PercyLiang. 2017. From language to programs: Bridgingreinforcement learning and maximum marginal like-lihood. pages 1051–1062.

Andreas Hanselowski, Hao Zhang, Zile Li, DaniilSorokin, Benjamin Schiller, Claudia Schulz, andIryna Gurevych. 2018. UKP-athene: Multi-sentencetextual entailment for claim verification. In Pro-ceedings of the First Workshop on Fact Extractionand VERification (FEVER), pages 103–108, Brus-sels, Belgium. Association for Computational Lin-guistics.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages12–22, Berlin, Germany. Association for Computa-tional Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong,and Richard Socher. 2019. Evaluating the fac-tual consistency of abstractive text summarization.arXiv, pages arXiv–1910.

Nate Kushman, Yoav Artzi, Luke Zettlemoyer, andRegina Barzilay. 2014. Learning to automaticallysolve algebra word problems. In Proceedings ofthe 52nd Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 271–281, Baltimore, Maryland. Associationfor Computational Linguistics.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D.Forbus, and Ni Lao. 2017. Neural symbolic ma-chines: Learning semantic parsers on Freebase withweak supervision. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 23–33,Vancouver, Canada. Association for ComputationalLinguistics.

Page 11: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Yang Liu and Mirella Lapata. 2019. Text summariza-tion with pretrained encoders. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019, pages 3728–3738. Association forComputational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Zhenghao Liu, Chenyan Xiong, Maosong Sun, andZhiyuan Liu. 2020. Fine-grained fact verificationwith kernel graph attention network. In Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics, pages 7342–7351, On-line. Association for Computational Linguistics.

Bill MacCartney and Christopher D Manning. 2008.Modeling semantic containment and exclusion innatural language inference. In Proceedings of the22nd International Conference on ComputationalLinguistics (Coling 2008), pages 521–528.

Bill MacCartney and Christopher D Manning. 2009.Natural language inference. Citeseer.

Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.2016. Neural programmer: Inducing latent pro-grams with gradient descent. In 4th InternationalConference on Learning Representations, ICLR2016, San Juan, Puerto Rico, May 2-4, 2016, Con-ference Track Proceedings.

Feng Nie, Yunbo Cao, Jinpeng Wang, Chin-Yew Lin,and Rong Pan. 2018. Mention and entity descrip-tion co-attention for entity disambiguation. In Pro-ceedings of the Thirty-Second AAAI Conference onArtificial Intelligence, (AAAI-18), the 30th innova-tive Applications of Artificial Intelligence (IAAI-18),and the 8th AAAI Symposium on Educational Ad-vances in Artificial Intelligence (EAAI-18), New Or-leans, Louisiana, USA, February 2-7, 2018, pages5908–5915. AAAI Press.

Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.Combining fact extraction and verification with neu-ral semantic matching networks. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6859–6866.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255,Austin, Texas. Association for Computational Lin-guistics.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, Trevor

Killeen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In Advances in Neural Information Pro-cessing Systems 32, pages 8024–8035. Curran Asso-ciates, Inc.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

A. Radford. 2018. Improving language understandingby generative pre-training.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8):9.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2931–2937.

Shuming Shi, Yuehui Wang, Chin-Yew Lin, XiaojiangLiu, and Yong Rui. 2015. Automatically solvingnumber word problems by semantic parsing and rea-soning. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,pages 1132–1142, Lisbon, Portugal.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extractionand verification. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2018, NewOrleans, Louisiana, USA, June 1-6, 2018, Volume1 (Long Papers), pages 809–819. Association forComputational Linguistics.

James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal. 2019.The fever2. 0 shared task. In Proceedings of the Sec-ond Workshop on Fact Extraction and VERification(FEVER), pages 1–6.

Vaibhav Vaibhav, Raghuram Mandyam Annasamy, andEduard H. Hovy. 2019. Do sentence interactionsmatter? leveraging sentence level representations forfake news classification. In Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing, TextGraphs@EMNLP

Page 12: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

2019, Hong Kong, November 4, 2019, pages 134–139. Association for Computational Linguistics.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph attention networks.

Xiaoyu Yang, Xiaodan Zhu, Huasha Zhao, QiongZhang, and Yufei Feng. 2019. Enhancing unsuper-vised pretraining with external knowledge for natu-ral language inference. In Canadian Conference onArtificial Intelligence, pages 413–419. Springer.

Pengcheng Yin and Graham Neubig. 2017. A syntacticneural model for general-purpose code generation.In Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 440–450, Vancouver, Canada.Association for Computational Linguistics.

Wenpeng Yin and Dan Roth. 2018. TwoWingOS: Atwo-wing optimization strategy for evidential claimverification. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, pages 105–114, Brussels, Belgium. Associationfor Computational Linguistics.

Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pon-tus Stenetorp, and Sebastian Riedel. 2018. Ucl ma-chine reading group: Four factor framework for factfinding (hexaf). In Proceedings of the First Work-shop on Fact Extraction and VERification (FEVER),pages 97–102.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887.

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial grammars.In UAI ’05, Proceedings of the 21st Conferencein Uncertainty in Artificial Intelligence, Edinburgh,Scotland, July 26-29, 2005, pages 658–666. AUAIPress.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2sql: Generating structured queriesfrom natural language using reinforcement learning.arXiv preprint arXiv:1709.00103.

Wanjun Zhong, Duyu Tang, Zhangyin Feng, NanDuan, Ming Zhou, Ming Gong, Linjun Shou, DaxinJiang, Jiahai Wang, and Jian Yin. 2020. Logical-FactChecker: Leveraging logical operations for factchecking with graph module network. In Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics, pages 6053–6065,Online. Association for Computational Linguistics.

Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, LifengWang, Changcheng Li, and Maosong Sun. 2019.

GEAR: Graph-based evidence aggregating and rea-soning for fact verification. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 892–901, Florence, Italy.Association for Computational Linguistics.

Page 13: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

A Appendices

A.1 Statistics of TABFACT DatasetTable 6 provides the statistics of TABFACT (Chenet al., 2020), a recent large-scale table-based factverification dataset on which we evaluate ourmethod. Each evidence table comes along with2 to 20 statements, and consists of 14 rows and 5-6rows in average.

A.2 Pre-defined Operations in ProgramSelection

Programs consists of operations, and the definitionof operations are listed in Table 7, mainly following(Chen et al., 2020).

A.3 Pre-defined Templates for VerbalizationIn our proposed framework, there are 50 pre-defined operations, details are in Table 7. For eachoperation, we define templates for operation and itsexecuted result. There are three types of executedresults: (1) string or number type; (2) boolean type;(3) view or row type, where it is a sub-table or rowsextracted from the evidence table.

We represent templates for different types ofoperations accordingly. The detailed templates arelisted in the following Table 8, Table 9 and Table 10.

Split Sentence Table Row ColTrain 92,283 13,182 14.1 5.5Val 12,792 1,696 14.0 5.4Test 12,779 1,695 14.2 5.4

Table 6: Statistics of TABFACT and the split ofTrain/Val/Test.

Page 14: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Operations Arguments Output Functioncount View Number Return the number of rows in the View

View,Header String,Cell String/Number

BoolReturn whether the cell string/numberexists under the Header Column of thegiven View

without

View,Header String,Cell String/Cell Number

BoolReturn whether the cell string/numberdoes not exist under the Header Columnof the given view

none String BoolWhether the string represents None, like“None”, “No”, “-”, “No information provided”

before/after Row, Row Row Returns whether row1 is before/after row2first/second/third/fourth

View, Row BoolReturns whether the row is in the first/second/third position of the view

avg/sum/max/min

View,Header String

NumberReturns the average/summation/max/min valueunder the Header Column of the given view

argmin/argmax

View,Header String

RowReturns the row with the max/min value underthe Header Column of the given view

HopRow,Header String

Number/String

Returns the cell value under the Header Columnof the given row

diff/add Number, Number Number Perform arithmetic operations on two numbers

greater/less Number, Number BoolReturns whether the first number is greater/lessthan the second number

Equal/Unequal

String, String/Number, Number

BoolCompare two numbers or strings to see whetherthey are the same

filter eq/filter greater/filter less/filter greater or equal/filter less or equal

View,Header String,Number

View

Returns the subview of the given with the cellvalues under the Header columngreater/less/eq/...against the given number

all eq/all greater/all less/all greater or equal/all less or equal

View,Header String,Number

BoolReturns the whether all of the cell values underthe Header column are greater/less/eq/... againstthe given number

and/or Bool, Bool BoolReturns the Boolean operation results oftwo inputs

Table 7: Details of pre-defined operations.

Page 15: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Operation Templatesoperator, [arguments], answer Operation Operation Resultscount, [verb arg1], ans the number of [verb arg1] the number of [verb arg1] is [string(ans)]avg/sum/max/min,[verb arg1,verb arg2], ans

average/sum/maximum/minimum[verb arg1] where column [verb arg2]

average/sum/maximum/minimum [verb arg1]where column [verb arg2] is [string(ans)]

add/diff, [verb arg1,verb arg2], ans

sum/difference of [verb arg1]and [verb arg2]

sum/difference of [verb arg1] and [verb arg2]is [string(ans)]

uniq num/uniq string,[verb arg1, verb arg2],ans

the unique value of [verb arg1]in column [verb arg2]

the unique value of [verb arg1] in column[verb arg2] is [string(ans)]

most freq,[verb arg1, verb arg2],ans

the most frequent value of [verb arg1]in column [verb arg2]

the most frequent value of [verb arg1]in column [verb arg2] is [string(ans)]

half/one third,[verb arg1], ans half/one third of value in [verb arg1] half/one third of value in [verb arg1] is

[string(ans)]num hop,[verb arg1, verb arg2],ans

the first value of [verb arg1] wherecolumn [verb arg2]

the first value of [verb arg1] wherecolumn [verb arg2] is [string(ans)]

Table 8: Templates for operations with string or number type executed results.

Operation Templatesoperator, [arguments], answer Operation Operation Resultsonly, [verb arg1],ans:true/false number of rows in [verb arg1] number of rows in [verb arg1]

[true:is/false:is not] oneseveral, [verb arg1],ans:true/false number of rows in [verb arg1] number of rows in [verb arg1]

[true:is/false:is not] more than onezero/none, [verb arg1],ans:true/false the [verb arg1] the [verb arg1]

[true:is/false:is not] zero/nonefirst/second/third/fourth,[verb arg1, verb arg2],ans:true/false

the first/second/third/fourth row[verb arg2] in [verb arg1]

the first/second/third/fourth rowin [verb arg1] [true:is/false:is not]row [verb arg2]

and/or, [verb arg1,verb arg2], ans:true/false [verb arg1] and/or [verb arg2] [verb arg1] and/or [verb arg2]

[true:is/false:is not] truegreater/less, [verb arg1,verb arg2], ans:true/false

[verb arg1] greater/lessthan [verb arg2]

[verb arg1] [true:is/false: is not]greater/less than [verb arg2]

equal, [verb arg1,verb arg2], ans:true/false

[verb arg1] equal to[verb arg2]

[verb arg1] [true:is/false: is not]equal to [verb arg2]

unequal, [verb arg1,verb arg2], ans:true/false

[verb arg1] not equal to[verb arg2]

[verb arg1] [true:is/false: is not]not equal to [verb arg2]

with/without,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] with/without value[verb arg3]

[verb arg1] where column[verb arg2] with/without value[verb arg3] [true:is/false:is not] true

all equal,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] all equal to value[verb arg3]

[verb arg1] where column[verb arg2] all equal to value[verb arg3] [true:is/false:is not] true

all less/all greater,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] all less/greaterthan value [verb arg3]

[verb arg1] where column[verb arg2] all less/greater thanvalue [verb arg3] [true:is/false:is not] true

all less or equal/all greater or equal,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] all less/greaterthan or equal to value [verb arg3]

[verb arg1] where column [verb arg2]all less/greater than or equal tovalue [verb arg3] [true:is/false:is not] true

Table 9: Templates for operations with boolean type executed results.

Page 16: arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 · arXiv:2010.03084v5 [cs.AI] 26 Nov 2020 However, such methods tend to fail when verifica-tion requires the joint modelling of both symbolic

Operation Templatesoperator, [arguments], answer Operation Operation Resultsbefore/after,[verb arg1, verb arg2], ans

[verb arg1] before/after[verb arg2]

[verb arg1] before/after [verb arg2]is row [indices of ans]

argmax/argmin,[verb arg1, verb arg2], ans

row where column [verb arg2]with maximum/minimumvalue in [verb arg1]

row where column [verb arg2]with maximum/minimumvalue in [verb arg1] is row [indices of ans]

filter eq,[verb arg1, verb arg2,verb arg3], ans

[verb arg1] where column[verb arg2] equal to value[verb arg3]

[verb arg1] where column[verb arg2] equal to value[verb arg3] is row [indices of ans]

filter less/filter greater,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] less/greaterthan value [verb arg3]

[verb arg1] where column[verb arg2] less/greater thanvalue [verb arg3] is row [indices of ans]

less or equal/greater or equal,[verb arg1, verb arg2,verb arg3], ans:true/false

[verb arg1] where column[verb arg2] less/greaterthan or equal to value [verb arg3]

[verb arg1] where column [verb arg2]less/greater than or equal tovalue [verb arg3] is row [indices of ans]

Table 10: Templates for operations with view or row type executed results.