indexing temporal rdf graph -...

Computinghttps://doi.org/10.1007/s00607-019-00703-w

Indexing temporal RDF graph

Li Yan1 · Ping Zhao1 · Zongmin Ma1

Received: 26 May 2018 / Accepted: 12 January 2019© Springer-Verlag GmbH Austria, part of Springer Nature 2019

AbstractTime data can be found in various real-world applications and different models havebeen proposed tomodel temporal information.With thewide utilization of theWeb andthe availability of massiveWeb resources, temporal Resource Description Framework(RDF)model has attractedmore andmore attention. In this paper, we propose an indexapproach for temporal RDF graphs to effectively query massive temporal RDF data.We build the prefix path index for querying subjects of temporal RDF triples and thesuffix index for querying objects of temporal RDF triples, respectively. Meanwhile,we use frequent elements to improve the efficiency of the index we proposed. We alsoadopt B-tree index to manage all elements of triples. Our index approach can supportinserting and deleting temporal triples in temporal RDF graphs. Experimental resultsshow that our index approach can efficiently process queries in temporal RDF graphs.

Keywords RDF · Temporal RDF graph · Suffix path · Prefix path · B-tree · Frequentelements

1 Introduction

Resource Description Framework (RDF) is a framework for representing variousinformation about resources on the Web, which provides a data model in a flexi-ble, schema-free and graph-structural form [1]. Currently, the RDF data model playsan increasingly important role inWeb data management. Many applications have beenstarting to use the RDF data model to represent and process semantic data graduallyand as a result, some RDF data management systems have been developed. Let us lookat some examples. Bio2RDF [3] uses the Semantic Web to build the largest networkof Linked Data for the Life Sciences. Lexvo [4] brings the entity information aboutlanguages, words, characters, and so on to the Semantic Web. LinkedMDB [5] pub-

B Zongmin [email protected]

1 College of Computer Science and Technology, Nanjing University of Aeronauticsand Astronautics, Nanjing 211106, China

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-019-00703-w&domain=pdf

http://orcid.org/0000-0001-7780-6473

L. Yan et al.

lishes the first open semantic database for movies. GovTrack [6] publishes the statusof federal legislation information about representative and senators in Congress by theRDF data model. With wide applications of RDF data model, RDF data managementbecomes very important [29], where RDF data querying is one of the core tasks [30].Given an RDF data model that contains massive RDF data, it is necessary to indexRDF data. Efficient querying of RDF data is supported by their indexing structures[2].

Things in the real world may change as time goes on. So, it is important to captureentire histories of things. Temporal information can be found inmany application areassuch as biomedicine [7], machine learning [8], communication [9] and so on. To rep-resent and manage temporal information, temporal databases have been proposed andseveral temporal database systems (e.g., TDBMS [12]) have been developed. Recently,XML (eXtensibleMarkup Language) has been applied to represent and exchange tem-poral information on the Web and temporal XML model has been proposed (e.g., [11,13, 14]). In the context of knowledge management, temporal knowledge is repre-sented and reasoned with temporal description logics (e.g., [15]) and temporal OWLontologies of the Semantic Web (e.g., [10]).

A major feature of the Web is its dynamics. The present RDF cannot directly rep-resent the Web resources changing over time dynamically. To capture and representtemporal information with RDF, a temporal RDF model is needed. Temporal RDFmodel was first proposed in [16, 17], in which the syntax and semantics of tempo-ral RDF are provided. After that, several different temporal RDF models have beenproposed also [18, 19]. Note that, being an extended RDF model, only few issues ofreasoning [19] and query [18] have been investigated in the context of temporal RDFmodel. Given that RDF data in the real-world application are generally large-scaledand even massive, the efficient queries of massive RDF data are essential [29], inwhich RDF data index can speed up evaluation of RDF data queries. We argue thatmany RDF data are temporally relevant, and in order to efficient query and use tem-poral RDF data, it is crucial to build indexing structure for the temporal RDF data.Temporal RDF index can play an important role in querying massive temporal RDFdata. Unfortunately, to the best of our knowledge, there is not any report on temporalRDF indexing although more attention has been paid on indexing classical RDF data.The present work tries to fill this gap.

In this paper, we view a temporal RDF model as a temporal graph instead of aset of temporal triples. Based on the temporal RDF graph, we propose a temporalRDF index. For this purpose, we store the temporal RDF graph in the form of adja-cency table and adopt breadth-first traversal to find some paths of the graph. Thenwe obtain the prefix paths and suffix paths in the graphs and establish the prefixpath index and suffix index, respectively. Finally, we build the corresponding B-treeindexes, aiming at accelerating the speed of searching elements in the temporal RDFgraph.

The main contributions of this paper are summarized as follows:

1. We propose an index schema for temporal RDF dataset based on the graphs.2. We develop the corresponding algorithm for prefix paths, suffix paths and building

the B-tree indexes.

123


3. To improve the efficiency of the index we proposed, we adopt the method offrequent elements to reduce the elements included in the index.

4. We evaluate the performance by comparing our and other index approach overtemporal RDF datasets.

The rest of this paper is organized as follows. Section 2 presents a brief overviewof related work in temporal RDF model and some indexes based on graphs. Section 3proposes our scheme about temporal RDF index and the index algorithm. Section 4presents the experimental evaluations by comparing our index approach with otherindex approaches. Section 5 concludes the paper and sketches our future work.

2 Related work

The present work in the paper is closely related to three issues, which are indexesbased on RDF triples, indexes based on the graph, and temporal RDF modeling.

2.1 Temporal RDFmodel

Temporal data has been extensively investigated in the context of databases (e.g., [12]).Recently temporal information has been applied in XML (e.g., [11]) and OWL (e.g.,[10]). In addition, temporal RDF data model has been proposed more recently. Thebasic idea of temporal RDF is to annotate triples or/and triple elements (i.e., subject(s), property (p) and object (o)) with temporal labels. Normally, there are twomethodsto represent temporal information. First, temporal information is represented in theform of time interval [ts, te], where ts is the start time of the event, te is the end timeof the event. Second, temporal information is represented in the form of {t1, t2,…,tn}, where t1, t2,…, tn are time points. These two forms are used to represent the validtime of events.

A temporal RDF model is proposed in [16, 17], which contains temporal RDFtriples. A temporal triple is shaped as (s, p, o):[ts, te], where (s, p, o) is an RDF tripleand [ts, te] is a time interval. Here, (s, p, o):[t1, t2] denotes such a fact that triple (s, p,o) is valid at time interval [ts, te]. Here is an example of temporal RDF triple (UAS,AsPresident, Ronald Wilson Reagan):[1981, 1989]. Being different from the temporalRDF model in [16, 17], a novel concept of a normalized tRDF database is proposed in[19]. A temporal RDF database consists of a set of temporally annotated RDF tripleswith a form of (subject, property:annotation, object), in which three forms of temporalRDF triples are applied:

1. (s, p:{T}, o), where p:{T} means a tRDF triple is valid in a time points {T};2. (s, p: <n:T>, o), where p: <n:T>means a tRDF triple is valid at least at n distinct

time points in T ;3. (s, p:[n:T ], o), where p:[n:T ] means a tRDF triple is valid at most at n distinct time

points in T .

A temporal RDF graph can be defined as a set of temporal RDF triples. A tRDFmodel can be regarded as a temporal RDF graph, whose nodes represent elements of

123

L. Yan et al.

subject and object and arcs represent property:annotation. The arts in the temporalRDF graph, which are associated with annotations, is different from the arts in theclassical RDF graph.

2.2 Indexes based on graph

To query data efficiently, many index structures have been proposed for different datamodels. Here we concentrate on the indexes based on graphs, in which tree modelcan be regarded as a special graph model. We first discuss temporal XML indexes andthen classical RDF indexes.

In [21], XML tree is transferred to a summary graph and then an index namedTMIX is proposed. TMIX utilizes a summary graph to preserve the hierarchical rela-tionships between nodes of a temporal XML document and uses a matrix to store thesummary graph, in which a temporal equivalence class table is used to store the edgesof summary graphs. Finally, a B+ tree index is built on top of the matrix entries tofacilitate fast probing of temporal elements. On the basis of TMIX, an index namedTOIX is proposed in [22]. To simplify the representation, distinct labels (tag namesof nodes) are encoded by a unique code in the index TOIX. The index defines the fol-lowing index structures to preserve the structural relationship between objects, whichare Temporal Object Index, Path Object Index, Index for Path Class with Subtrees andSTPC Bitmap Index. TOIX lies on mapping the twigs into temporal objects and pathobjects, and then uses a matrix to store path class and sub-tree path class. Finally, aB-tree index is established.

A temporal XML index TXSIM based on suffix tree is proposed in [20], which usestheUkkonen algorithmand traverses semi-structured datamodel (OEM)only one time.The construction of the index TXSIM needs to convert XML to OEM tree and thenencode the OEM tree nodes to generate a temporal code table. The OEM tree is atype of self-description object model, which is designed especially for semi-structuredata. Finally, the temporal XML index TXSIM based on suffix tree is constructed,which is combined with the idea of the Ukkonen algorithm. The index TXSIMmainlycontains two parts: suffix indexing tree and temporal code table. An index structureViST (Virtual Suffix Tree) is proposed in [31], which uses a dynamic labeling methodto assign labels to suffix tree nodes. The construction of the index is divided intothree stages. The naive algorithm, based entirely on suffix trees, requires traversal ofa large portion of the tree structure for non-contiguous subsequence matching. Thenthe index RIST (Relationships Indexed Suffix Tree) is presented, which improves thenaive algorithm by using B+ Trees to index suffix tree nodes. Finally, the index ViSTis presented, which is an index structure having the same functionality but relyingexclusively on B+ Trees.

In the context of RDF, a framework for designing and using RDF structural indexesbased on formal structural characterizations is proposed in [23]. The structural index isan edge labeled graph (V ,E), where the node setV is a partition of theRDFdataset, andthe edges in the edge set E are labeled by the equality types between triples in nodes.The structural index contains equality types and the edges between equality types.The structure index uses pure acyclic basic graph patterns to prune for accelerating

123


the speed of searching. In [24], the height- and label-parameterized structure indexis proposed for RDF. The index adopts the idea of structure-oriented partitioning(SP), which applies the grouping of elements captured by the structure index to thephysical organization of data. To capture the structure of RDF dataset, the vertices ofthe graph are used to represent groups of data elements that are similar in structure.The vertices contain all the elements who are similar in structure. In this approach,the parameterized structure index can control its size to fit in main memory. In [2], theRG-index is proposed to index graph patterns in the RDF graph, in which the vertexlists are provided to match specific graph patterns. To accelerate the speed of indexingthe graph patterns, the gSpan algorithm is applied to mine frequent graph patterns. Toimprove the efficiency of index, they adopt the method of avoiding redundant patternsand caching the intermediate results.

In [32], the indexing schema for RDF and RDF Schema are proposed. In thisapproach, four kinds of DAGs (Directed Acyclic Graphs) are extracted from datasetand then all path expressions are extracted from the DAGs. Finally, the suffix array isused to store all path expressions and then the index constructions are built. In [33],the RDF index scheme is proposed, which focuses on the path queries of RDF datawith no cyclic. To take the semantic information provided by RDF, the original RDFgraph is divided into several sub-graphs and each sub-graph is indexed separately. Asub-graph mechanism is used to manage RDF data and an extended structure basedon suffix array is used to support path expressions on RDF of DAG, which containsno cycle.

3 Temporal RDF index

Let U be a set of URI references, L a set of literals and P a set of properties. An RDFtriple is a triple (s, p, o) inU ×P × (U ∪L). Here, s, p, and o are the subject, property(predicate) and object of a triple, respectively. A finite set of RDF triples forms anRDF graph, where each triple (s, p, o) describes a directed edge labeled with p fromthe vertex labeled with s to the vertex labeled with o.

3.1 Temporal RDF graph

Two types of time information are usually identified: valid time and transaction time.Valid time is related to a time interval during which an event occurs in real life.Transaction time refers to a specific time point when the fact was stored [25]. In thispaper, we consider valid times as temporal information of RDF triples. A collectionof temporal RDF triples forms a temporal RDF graph.

Definition 1 A temporal RDF triple is an RDF triple with a time interval, which isdenoted (s, p:[ts,te], o). Here, s, p and o represent subject, predicate and object,respectively, and [ts, te] represents a time interval starting at ts and ending at te.

Definition 2 A temporal RDF graph is a label directed graph. Formally a temporalRDF graph is represented as G � (V , L, E), in which

123

L. Yan et al.

school

teacher

student

jane

joe

mark

lucy

Maths

Chinese

member:[0,20]

member:[0,20]

name:[0,20]

name:[0,20]

name:[0,20]

name:[0,20] friend:[3,20] friend:[3,20]

teach:[4,9]

teach:[2,8]

study:[2,5]

study:[3,5]

Fig. 1 An example of temporal RDF graph

1. V is a finite set of vertices. An element v of V is s or o of temporal RDF triples.2. L is a finite set of edge labels. An element l of L is with a form of p:[ts,te].3. E is a finite set of edges. An element e of E is with a form of e (v1, v2), where v1,

v2∈V.A temporal RDF graph contains two parts: vertices and edges. The vertices are the

elements of collection V and the edges are elements of collection E. An edge startsand ends in vertices. Figure 1 presents an example of temporal RDF graph with sometemporal RDF triples. In the temporal RDF graph of Fig. 1, for example, (joe, teach:[2, 8], Chines) is a temporal triple. Here triple (joe, teach: [2, 8], Chines) is associatedwith a time interval [2, 8].

To index the temporal RDF graph, we first introduced several definitions for tem-poral RDF graph.

Definition 3 A path between two vertices is a sequence of vertices and labels alter-nating with each other from one vertex to another vertex, that is, V0 · L0 · V1 · L1 ·V2 . . . Vn−1 · Ln−1Vn . Here Vi · Li Vi+1 means that there is an edge between verticesVi and Vi+1, which has a label Li.

It is possible in temporal RDF graph that two vertices of a path are the same. Atthis point, we have a special type of path called a circle.

Definition 4 If exists a path that is V0 · L0 · V1 · L1 · V2 . . . V0, this path is called acircle.

Definition 5 Let X be a vertex or a label of the path V0 · L0 · V1 · L1 · V2 . . . Vn . Thenwe call V0 · L0 · V1 · L1 · V2 . . . X the prefix path of X.

Definition 6 Let X be a vertex or a label of the path V0 · L0 · V1 · L1 · V2 . . . Vn . Wecall X . . . An · Vn the suffix path of X.

123


Fig. 2 The adjacency table of the graph shown in Fig. 1

Example 1 Let us look at the temporalRDFgraph inFig. 1.Apath between twoverticesschool and mark is represented by school.member:[0,20].student.name:[0,20].mark.In addition, mark.friend: [3, 20].lucy.friend: [3, 20].mark is a circle. Forpath school.member:[0,20].student.name: [0,20].mark, the prefix path of stu-dent is school.member:[0,20].student and the suffix path of student is stu-dent.name:[0,20].mark.

For a temporal RDF graph, we need to find its all paths. We adopt an adjacencytable to store the temporal RDF graph. Based on the adjacency table, we can find thepaths with a breadth-first algorithm. The definition of adjacency table is presented asfollows.

Definition 7 An adjacency table is defined as T � (HN , EN), where HN is a finite setof head nodes and EN is a finite set of edge nodes that represent the edges startingfrom the head nodes. Each head node has value that is an element of V and pointsto an edge node firstEdge, that is, HN � {value, firstEdge}. Each edge node has thevalue adjvex that is the element of V and is the ending vertex of the edge starting fromthe value value of head node, in which weight is the label of the edge and nextEdgeis the other edge starting from the same head node, that is, EN � {adjvex, weight,nextEdge}.

Example 2 Figure 2 shows an adjacency table that is created from Fig. 1.

As shown in Fig. 2, the first node of each record is the head node and the followingnodes are the edge nodes which represent the edges and the vertices starting from thevertices stored in the head node. The head nodes of all records formed the set of headnode HN and the edge nodes of all records formed the set of edge node EN .

123

L. Yan et al.

To index the temporal RDF graph, we stored the temporal RDF dataset in the adja-cency table. Firstly, we need to generate all head nodes that point to edge nodes. Then,we need to generate all edge nodes for the temporal RDF graph. After generating theadjacency table of the temporal RDF graph, we generate all the paths of a tempo-ral RDF graph based on the adjacency table. We propose Algorithm 1 to build theadjacency table for a temporal RDF graph.

Algorithm 1. Adjacency tableInput: temporal RDF graphOutput: adjacency table of the temporal RDF graph

1. for (vex:vexs) //vex: an element of the vertex set V//vexs: all elements of the vertex set V

2. create a new headnode; //headnode: a head node of the adjacency table3. headnode.value=vex;4. headnode.firstEdge=null;

//firstEdge:a edge node of the adjacency table, it represents a adjacent node of a head node5. Headnodearray.add(headnode); //Headnodearray:the array contains all the head node 6. for ( edge:edges) //edge: an element of the edge set E

//edges: all elements of the edges set E7. start = getPosition(svex); //start: the position of headnodesarray

// svex: a start vertex of the edge edge8. create a new edgenode; //edgenode: an edge node of the adjacency table9. edgenode.adjvex=evex; //evex: the end vertex of the edge edge10. edgenode.weight=edgelabel; //edgelabel: the edge label of the edge edge11. if (Headnodearray[start].firstEdge = null) 12. Headnodearray[start].firstEdge=edgenode; 13. Else14. linkLast(Headnodearray[start].firstEdge,edgenode);15. getposition(svex){16. for (0<=i<Headnodearray.length)17. if (svex=Headnodearray[i].value)18. return i;19. return -1;20. }21. linkLast(Headnodearray[start].firstEdge,edgenode){ //linkLast: add the new edge node22. while (Headnodearray[start].firstEdge.nextEdge!=null) //nextEdge: a edge node23. Headnodearray[start].firstEdge=vexsarray[start].firstEdge.nextEdge;24. Headnodearray[start].firstEdge.nextEdge=edgenode;25. }

In Algorithm 1, we construct the adjacency table of the temporal RDF graph toprepare for the generation of the path later. We create all head nodes for the temporalRDF graph. The value of the head node is the element of vertices in the temporal RDFgraph. We set firstEdge of the head node as null (lines 1–4). Then, we create all edgenodes of the adjacency table (lines 5–13). We need to find the position of a head nodewith a value of svex in head node array. This is done by calling the function getposition

123


(lines 14–19), in which the array of head nodes is traversed and the specific head nodeis found (lines 16–18). We then create a new edge node and add adjvex and weight inedge node (lines 7–9). Here we need to judge if the firstEdge of head node is null ornot. If firstEdge is null, we add the edge node to firstEdge (line 11). Otherwise, thefunction linkLast is called to add the edge node to nextEdge (line 13). The functionlinkLast checks the nextEdge of edge node until nextEdge is null and we finally addthe edge node to nextEdge (line 22).

3.2 Path index

Our index approach is to find all prefix paths and suffix paths. For this purpose, weneed to find all paths of the graph first and then get all prefix paths and suffix paths tobuild the prefix path index and suffix path index, respectively. The way we find all thepath is based on the adjacency table which is generated in the previous section.

3.2.1 Finding all paths

For the adjacency table based on the given temporal RDF graph, we adopt a breadth-first traversal method to obtain all its paths. We first find the starting nodes and then allpaths to prepare for finding the prefix paths and suffix paths of all paths. The startingnodes are the vertices with indegrees of zero. We define a set named Zero for thestarting nodes. So, the first step is to obtain the set Zero. Then, we get the element v ofset Zero and find all paths starting from the element v. Formally we define a startingnode as follows.

Definition 8 A starting node is such a node that has an indegree of zero and is formallyrepresented as {n|n∈V and indegree (0) � 0}.

Example 3 Let us look at the temporal RDF graph in Fig. 1. In Fig. 1, only node schoolis a starting node that has an indegree of zero. Let Zero be a set of starting nodes. ThenZero � {school} in Fig. 1.

Example 4 Let us look at the temporal RDF graph in Fig. 1 again. In Fig. 1, thecollection of all paths starting from school is listed as follows.

{school.member:[0,20].teacher.name[0,20].jane.teach: [2, 8].Maths,school.member:[0,20].teacher.name[0,20].joe.teach: [4, 9].Chinese,school.member:[0,20].student.name[0,20].mark.study: [2, 5].Maths,school.member:[0,20].student.name[0,20].mark.friend: [3, 20].lucy.study: [3,5].Chinese,school.member:[0,20].student.name[0,20].mark.friend: [3, 20].lucy.friend: [3,20].mark,school.member:[0,20].student.name[0,20].lucy.study: [3, 5].Chinese,school.member:[0,20].student.name[0,20].lucy.friend: [3, 20].mark.study [2,5].Maths,school.member:[0,20].student.name[0,20].lucy.friend: [3, 20].mark.friend: [3,20].lucy}

123

L. Yan et al.

To find all paths of a given temporal RDF graph, we propose Algorithm 2 in the fol-lowing, which mainly adopts a breadth-first approach to find all paths in the adjacencytable of the given of the temporal RDF graph.

Algorithm 2. Finding all pathsInput: adjacency table of temporal RDF graph, a set of the nodes whose indegree is zeroOutput: the set pathset contains all paths of the temporal RDF graph

1. for (zvex:zvexs) //zvex:the indegree of vertex is zero, zvexs: all vertices whose indegree is zero2. for (headnode:headnodearray) //headnode:a head node3. if (zvex=headnode.value and headnode.firstEdge!=null)4. create a new path p;5. edge=headnode.firstEdge;6. add headnode.value, edge.weight,edge.adjvex in path p;7. pathset.add(p);8. breadth(edge.nextEdge,pathset,p)9. for (p:pathset)10. if the elements in the path p have duplication11. add ‘-1’ to the path p12. lastelement getLastElement(p)13. if(lastelement!=’-1’)14. for (headnode:headnodearray)15. if (headnode.value=lastelement and headnode.firstEdge!=null)16. pathset.remove(p)17. edge=headnode.firstEdge18. add edge.weight,edge.adjvex in path p;19. pathset.add(p);20. breadth(edge.nextEdge,pathset,p)21. breath(edge, pathset,p)22. while (edge!=null)23. create a new path pl;24. pl=p25. add edge.weight and edge.adjvex in path pl; 26. pathset.add(pl);27. edge=edge.nextEdge;

In Algorithm 2, we find all the paths of the temporal RDF graph in the adjacencytable. We generate the path set pathset that contains all the paths of the temporal RDFgraph. Since we are using breadth-first traversal, we need to find the starting verticesof all paths in order to generate all paths at the same time (lines 1–8). First, we obtainan element zvex of the set zvexs to start the way of producing paths in turn. We needto find all the following edge of the element zvex by traversing the head node set. Ifthe value of the head node is equal to the element zvex and firstEdge of the head nodeis not null, we create the new path p which forms the part of the path. The part of thepath contains the value of the head node and weight and adjvex of firstEdge of thehead node. For simplicity, we set firstEdge of the head node as edge and we add thepath in the path set to extend the path later (lines 4–7). This means we find the firstedge that starts from the starting vertices and then we need to find all edges that startfrom the starting vertices. The way we find all paths starting from the starting verticesis the function breadth (lines 21–27). The function breadth checks if the edge nodeedge is null or not until the edge node edge is null. If the edge node edge is not null,

123


we create a new path pl whose value is equal to the path p. Then we add weight andadjvex of the edge node to the path pl. Finally, we add the path pl to the set pathsetand set the nextEdge of the edge node as edge. In the following, we extend each pathin the set pathset to generate all the path of the temporal RDF graph (lines 9–20).For the path p in the path set pathset, we check if the path p contains a circle path(lines 10–11). If the elements of the path p have duplication, we believe that the path pcontains a circle path and add the element ‘− 1’ in the path p to indicate that the pathp has a circle path. We get the last element lastelement of the path p (line 12). If theelement lastelement is ‘− 1’, we no longer extend the path (line 13). If we continueto extend the path p, the program may fall into an infinite loop. Then, we extend thepath from the element lastelement (lines 14–20). We traverse the array of head nodeheadnodearray to find the head node headnode whose value is equal to the elementlastelement. If the value of head node headnode is equal to the element lastelementand firstEdge of the head node headnode is not null, we remove the path p from thepath set pathset. This step can remove the paths that are not fully expanded from thepath set. For simplicity, we set firstEdge of the head node as edge and extend the pathp by adding the weight and adjvex in the path p (lines 17–18). We add the path p inthe path set pathset (line 19). Finally, we find all the paths that are extended from thepath p by the function breadth.

3.2.2 Prefix path index

To establish a prefix index, we need to get all prefix paths of the given paths. For agiven path V0.L0.V1.L1.V2…Ln.Vn, the prefix path of X (X is a vertex or label) isV0.L0…X.

Example 5 For the path “school.member:[0,20].teacher.name[0,20].jane.teach: [2,8].Maths” in Fig. 1, all its prefix paths are listed as follows.

{school,school.member:[0,20],school.member:[0,20].teacher,school.member:[0,20].teacher.name[0,20],school.member:[0,20].teacher.name[0,20].jane,school.member:[0,20].teacher.name[0,20].jane.teach: [2, 8],school.member:[0,20].teacher.name[0,20].jane.teach: [2, 8].Maths}

Prefix path index contains two types of data: one is the elements X in V or L andthe other is the set that contains the prefix path sets. We formally define the structureof prefix path index as follows.

Definition 9 Prefix path index has structure of {pE, P, PR}, where pE is a set of theelements in V or L, P is the set contains the elements p, where p is the set whichcontains the prefix paths whose last elements are same, and PR: pE → P is a set ofrelationships between pE and P. That is, the element pe is related to the set p whoselast element of each prefix path is pe, where pe is the element of the set pE.

123

L. Yan et al.

Example 6 In Fig. 1, the set pE of temporal RDF graph is {school, member:[0,20],teacher, student, name:[0,20], jane, joe, mark, lucy, firend: [3, 20], teach: [2, 8],teach: [4, 9], study: [2, 5], study: [3, 5], Math, Chinese}.

With the paths created by Algorithm 2, we can obtain all prefix paths and establish aprefix path index. The prefix path index is stored in the map, whose key is the elementof pE and value is the element of the element of P. If we generate the prefix paths forall the elements, more storage spaces and times are needed for building the index. Toimprove the efficiency of prefix path index, we need to reduce the storage space of theindex as well as the time of building index. Generally, we believe that elements withhigh frequency in the dataset are also found with high rates. Therefore, we generateprefix paths for elements with high repetition rates. According to Definition 9, wedefined the prefix index structure as {pE, P, PR}. All elements in set pE are highfrequency elements in the dataset. The way we define the high frequency elementsis as follows. In the prefix path index, the frequency of the element is determinedby the in-degree. If the element is the edge labels, the number of edge labels can beconsidered as the in-degree. We generate the prefix paths for all the elements in pE tobuild prefix path index. Algorithm 3 is used to build the prefix path index.

Algorithm 3. prefix path indexInput: path set pathset, element set pEOutput: prefix path index

1. for (p:pathset) //p : the element of path set pathset2. for (e:p) //e : the element of the path p3. prefixpath getPrefix(e , p) //getPrefix: get the prefix path of e for the path p4. prefixset.add(prefixpath) //prefixset:the set contains all prefix path5. for (pe:pE) //pe:the element of the set pE6. for (prepath:prefixset) //prepath:the element of prefixset7. lastElement getLastElement(prepath)

//getLastElement: get the last element of the prefix path prepath8. if(pe.equals(lastElement))9. pre.add(prepath)

//pre:the set contains all the prefix paths whose last element is pe10. preindex.put(pe, pre) //preindex:the map whose key is pe and value is pre

Example 7 We consider such elements that have the in-degrees being larger than zeroas the high frequency elements. Then we have the set that contains frequent elementsof the prefix index in Fig. 1. The frequent elements are {member:[0,20], teacher,student, name:[0,20], jane, joe, mark, lucy, firend: [3, 20], teach: [2, 8], teach: [4, 9],study: [2, 5], study: [3, 5], Math, Chinese}.

The input of Algorithm 3 is the path set pathset and the element set pE. Thepath set pathset contains all the path of the graph and the element set pE containsthe high frequency element in the graph. Generally, we sort the elements in the setwhich contains all elements in descending order of frequency and put the elements ofrelatively high frequency into pE. We process each path p in the path set pathset toobtain the prefix path set prefixset which contains all the prefix path of the paths in thepath set pathset (lines 1–4). Then, we process the element set pE and the prefix pathset prefixset to obtain the map preindex. For the element pe in the set pE, we get the

123


prefix path set pre which is related to the element pe. The element of the prefix pathset pre is determined by the last element of the prefix path prepath whether equalsto the element pe or not. If the last element of the prefix path prepath is equal to theelement pe, we add the prefix path prepath into the set pre. Finally, we can put the keype and the value pre into the map preindex (lines 5–10).

With the prefix path index, we can find not only the majority of elements that wewould directly search but also some elements that are related to the given elements.So, the prefix path index is suitable for a query that the ending vertices of edges aregiven. Note that the prefix path index cannot be applied for a query that the endingvertices of edges are unknown. At this point, we propose a new index named suffixpath index for searching.

3.2.3 Suffix path index

Formally, for a given path V0 · L0 · V1 · L1 · V2 . . . Ln · Vn , the suffix path of X (X is avertex or label) in the path is X . . . Ln · Vn . We present an example to show all suffixpaths of a given path.

Example 8 For the path “school.member:[0,20].teacher.name[0,20].jane.teach: [2,8].Maths” in Fig. 1, its suffix path is listed as follows.

{school.member:[0,20].teacher.name[0,20].jane.teach: [2, 8].Maths,member:[0,20].teacher.name[0,20].jane.teach: [2, 8].Maths,teacher.name[0,20].jane.teach: [2, 8].Maths,name[0,20].jane.teach: [2, 8].Maths,jane.teach: [2, 8].Maths,teach: [2, 8].Maths,Maths}

Suffix path index contains two types of data: one is the elements X in V or L andthe other is the set that contains the suffix path sets. We formally define the structureof suffix path index as follows.

Definition 10 Suffix path index has structure of {sE, Q, QR}, where sE is a set ofthe elements in V or L, Q is the set contains the elements q, where q is the set whichcontains the suffix paths whose first elements are same, and QR: sE → Q is a set ofrelationships between sE and Q. That is, the element se is related to the set q whosefirst element of each suffix path is se, where se is the element of sE.

Example 9 In Fig. 1, the set sE of temporal RDF graph is {school, member:[0,20],teacher, student, name:[0,20], jane, joe, mark, lucy, firend: [3, 20], teach: [2, 8],teach: [4, 9], study: [2, 5], study: [3, 5], Math, Chinese}.

Being similar to the prefix path index, we take some methods to reduce the storagespace of the index and the build time of the index. Therefore, the elements in the setsE are high frequency elements. In the suffix path index, the frequency of the elementis determined by the out-degree. If the element is the edge labels, the number of edgelabels can be considered as the out-degree. We generate the suffix paths for all the

123

L. Yan et al.

elements in sE to build suffix path index. Algorithm 4 is used to build the suffix pathindex.

Algorithm 4 suffix path indexInput: the path set pathset, the element set sEOutput: suffix path index

11. for (p:pathset) //p : the element of path set pathset12. for (e:p) //e : the element of the path p13. suffixpath getSuffix(e, p) //getSuffix: get the suffix path of e for the path p14. suffixset.add(suffixpath) //suffixset:the set contains all suffix path15. for (se:sE) //se:the element of the set sE16. for (sufpath:suffixset) //sufpath:the element of suffixset17. firstElement getFirstElement(sufpath)

//getFirstElement: get the first element of the suffix path sufpath18. if (se.equals(firstElement))19. suf.add(sufpath)

//suf:the set contains all the suffix paths whose first element is se20. sufindex.put(se, suf) //sufindex:the map whose key is se and value is suf

Example 10 We consider the elements whose out-degree is larger than zero as the highfrequency elements. The set contains frequent elements of the suffix index in Fig. 1as follows.

The frequent elements a {school, member:[0,20], teacher, student, name:[0,20],jane, joe, mark, lucy, firend: [3, 20], teach: [2, 8], teach: [4, 9], study: [2, 5], study:[3, 5]}.

The input of Algorithm 4 is the path set pathset and the element set sE. The pathset pathset contains all the path of the graph and the element set sE contains the highfrequency element in the graph. Similar to the prefix path index, we sort the elementsin the set which contains all elements in descending order of frequency and put theelements of relatively high frequency into sE. We process each path p in the path setpathset to obtain the suffix path set suffixset which contains all the suffix path of thepaths in the path set pathset (lines 1–4). Then, we process the element set sE and thesuffix path set suffixset to obtain the map sufindex. For the element se in the set sE,we get the suffix path set suf which is related to the element se. The element of thesuffix path set suf is determined by the first element of the suffix path sufpathwhetherequals to the element se or not. If the first element of the suffix path sufpath is equalto the element se, we add the suffix path sufpath into the set suf . Finally, we can putthe key se and the value suf into the map sufindex (lines 5–10).

Note that, however, to find the suitable element in sE, we need to search sE. In theworst case, we have to iterate through all the elements in the set sE to determine if theelement is included in the set sE. To accelerate the speed of finding specific elements,we use B-tree indexes to reduce searching time.

3.3 B-tree index

B-tree is a common data structure that can reduce the time of finding records. B-treeshave several obvious characteristics. Each internal node of a B-tree contains a number

123


of keys, where keys act as separation values that can divide its subtrees. The root nodeof B-tree has at least two children nodes and a leaf node does not have any childrennodes. Generally, the number of keys in the root node is between 1 and m-1. Usually,the number of keys in a non-root node of B-tree is chosen between [ceil(m/2)− 1] andm− 1, where [ceil (m/2)− 1] is theminimum number of keys. The number of childrenin non-root nodes is between ceil(m/2) andm, where ceil (m/2) is the minimum degreeor branching factor of the tree. Each key exists one time in a node and the key set isdistributed in the whole tree. Searching specific values may end in non-leaf nodes andthe search performance is equivalent to doing a binary search within the complete setof keys. So, B-tree indexes have been used for a wide variety of operations required ininformation retrieval and database management, even if some other index structuresare faster for some individual index operations [26]. In our index approach, we needto build two B-tree indexes: one is for the prefix path index and the other is for thesuffix path index. For simplicity, we discuss the B-tree for prefix path index in thefollowing, which is similar to the B-tree for the suffix path index.

In the context of temporal RDF index, the nodes of B-tree contain keys and values,where keys are the elements in the set pE and values are the elements in the set P.When we search a prefix path in the B-tree of the prefix path index, we need to obtainthe last element se of the prefix path and search in the B-tree to get the node n whosekey is se. Then, we traverse the set p which is the value of the node n to get the prefixpath that we search. Finding prefix path with B-tree is faster than finding path one byone. In the following, we present the B-tree indexes for prefix path index.

Definition 11 The B-tree index has a form of (M, EN, IN, ER, IR, K, V ), where

1. M is the maximum of the children of external nodes.2. EN is a set of external nodes of B-tree and each external node has M children at

most.3. IN is a set of internal nodes of B-tree and an internal node has keys and value.4. ER: EN → IN is a set of relationships between external nodes and internal nodes.

That is, a child of the external node is an internal node.5. IR: IN → EN is a set of relationships between internal nodes and external nodes.

That is, an internal node points to an external node.

name:[0,20]joefriend:[3,20] mark

study:[3,5]

student

Chinese Math

jane lucy

member:[0,20]

teach:[4,9]

study:[2,5] teach:[2,8] teacher

Fig. 3 B-tree of the frequent elements in prefix paths

123

L. Yan et al.

6. K : IN → key is a function, which assigns a key to an internal node.7. V : IN → value is a function, which assigns a set value to an internal node.

Example 11 Let us look at the frequent elements of the prefix paths in Example 7.Then a B-tree of the set sE is shown in Fig. 3, where M of B-tree is 4.

For the B-tree index based on the prefix path index, the key of the B-tree is theelements in the set pE and the value of the B-tree is the elements in the set P. WithAlgorithm 3, we can obtain the key and value that are used for establishing B-treeindex. We propose Algorithm 5 to build B-tree index, which is divided into two parts,inserting nodes and splitting nodes.

Algorithm 5. B-tree indexInput: the map preindexOutput: B-tree

1. put (preindex.key, preindex.value) 2. Node u = insert (root, preindex.key, preindex.value, ht);

//root: the root node of B-tree index; ht: the height of B-tree3. if (u != null)4. create a new node q as Node (2);5. q.children[0] = Entry(root.children[0].key, null, root);6. q.children[1] = Entry(u.children[0].key, null, u);7. root = q;8. HT++;9. Insert (h, key, value, ht) 10. set a Entry t = new Entry (key, value, null);11. if ht = 012. for j = 0, ..., h.m-113. if key < h.children [j].key14. break;15. else 16. for j = 0, ..., h.m-117. if j+1 = h.m or key < h.children [j+1].key18. Node u = insert (h.children [j++].next, key, value, ht-1);19. if u = null20. return null;21. t.key = u.children[0].key;22. t.next = u;23. break;24. for i = h.m, ..., j+125. h.children[i] = h.children[i-1];26. h.children[j] = t;27. h.m++;28. if (h.m < M) return null; //M: the maximum of children of external nodes29. else return split(h);30. Split (Node h)31. Node p = new Node(M/2);32. h.m = M/2;33. for j = 0, ..., M/2-134. p.children [j] = h.children[M/2+j]; 35. return p;

123


In Algorithm 5, we use Node to represent external nodes and Entry to representinternal nodes. The children of external nodes are interval nodes and the element nextof internal node is an external node. We first get the key and value of the prefix indexpreindex being added to B-tree indexes, and then call a function insert to add node u(lines 1–2). If u is not null, we create a new node q and update the elements of a nodein q (lines 3–8). For Insert, we create an internal node t, add key and value (line 10),and find a suitable position for Entry t (lines 11–23). In this step, if ht is larger than0, we find the right position by comparing key and h.children [j].key. If key is largerthan h.children [j].key, we continue to compare key and the key of the next child ofexternal node h. Otherwise, we call Insert to add a new node into the external nodepointing to the (j + 1)th child of external node h and update the internal node t (line16–23). If ht is equal to 0, we continue to compare key and h.children [j].key until keyis smaller than h.children [j].key (line 12–14). Then, we add the internal node t in theexternal node h. If the number of children of external node h is larger thanM, we callfunction Split to split the external node h into two external nodes (lines 26–29). ForSplit, we create a new external node p and update the external node p (lines 31–35).We do not present the details of two functions Insert and Split in this paper.

3.4 Inserting and deleting triples

We argue that temporal RDF varies with time. It means that temporal RDF triplesoften changes. As a result, the temporal RDF index must be maintained accordingly.Basically, we can identify three types of changes for temporal RDF triples: insertingtemporal RDF triples, deleting temporal RDF triples and updating temporal RDFtriples. Updating temporal RDF triples can be carried out by deleting these temporalRDF triples first and then inserting some corresponding triples. So, in the following,we only concentrate on inserting and deleting temporal RDF triples. In addition, theinsertion and deletion of temporal triples involve the modification of the B-tree node.We only parse the B-tree constructed by the prefix path index, and the B-tree of thesuffix path index is similar to the prefix.

Generally speaking, when some temporal RDF triples are inserted to the temporalRDF graph, we need to add some elements into the created B-tree. However, the keyset of B-tree only contains the high frequency elements. There are two main types ofinsertion, one is that the key set of B-tree contains the elements of temporal and theother is that the key set of B-tree does not contain the elements of a temporal triple.If the key set of the B-tree contains the elements of temporal triple, the operationof updating the value of the B-tree node must be carried out. Otherwise, we mustdetermine if the node needs to be deleted. We propose Algorithm 6 to deal with theB-tree index based on the prefix path index while temporal RDF triples are insertedto the temporal RDF graph. Following the same processing, we can handle the B-treeindex based on suffix path index.

123

L. Yan et al.

Algorithm 6. index for inserting temporal RDF triplesInput: a temporal RDF graph and its B-tree; a new temporal RDF triple (s, p:[T], o);

the path set pathset;the map indegree contains the elements in the set pE and its in-degree;the map in contains all the elements in the dataset and its in-degree;

Output: a new B-tree1. opath getPath(o,path)

//getPath: get all the paths which start from the element o; opath: a path set2. if the element temp is s or p:[T]3. if pE.contains(temp)4. newpath createNewPath((s, p:[T], o),opath)

//createNewPath:add the element temp in each path of the path set opath; newpath: a path set5. addpath(temp,newpath)6. indegree.update(temp) //update: update the in-degree of the element temp7. else8. tempindegree getInDegree(temp,in)

//getInDegree: get the in-degree of the element temp9. (minelement, minindegree) getMin(indegree)//getMin:get the min in-degree of the map indegree and return the value of the min in-degree

minindegree and its corresponding element minelement10. if tempindegree+1>mindgree11. delete(minelement) //delete: delete the node whose key is minelement12. insert(root,temp,newpath,ht)13. indegree.update(temp) //update: update the in-degree of the element temp14. else15. pathset.add(newpath)16. in.update(temp) //update: update the in-degree of the element temp17. if the element temp is o18. indegree.update(temp) 19. in.update(temp)20. addpath(temp, newpath)

//addpath:update the value of the B-tree node whose key is temp21. node searchNode(temp)

//searchNode:search the node whose key is temp22. node.value addValue(node.value,newpath)

//addValue: add path set newpath to the node.value23. updateNode(node.value)

//updateNode:update the B-tree node

In Algorithm 6, we discuss the approach of adding new temporal RDF triple in thetemporal RDF graph. First, we get the set opath contains all the paths starting fromthe element o (line 1). Then, we conduct the operation of adding new triples. If theelement temp is s or p:[T], we check if the set pE contains the element temp or not. Ifthe set pE contains the element temp, we just create the set newpath which containsall the new paths and add the set newpath in the index (lines 3–5). Meanwhile, weupdate the in-degree of the element temp in the map indegree (line 6). The way ofadding the set newpath in the index is to search the node node whose key is tempand modify the value of the node (lines 20–23). If pE does not contain the elementtemp, we consider if we need to modify the index node or not (lines 7–15). We get thein-degree tempindegree of the element temp and obtain the min in-degreeminindegreeand its corresponding elements minelement (lines 8–9). If the result of tempindegree+ 1 is larger than minindegree, we delete the node whose key is minelement and then

123


insert the node whose key is temp and update the in-degree of the element temp (lines10–13). Because the algorithm of deleting nodes is similar to that of normal B-treenodes, it is not explained in detail in this paper. Otherwise, we do not need to modifythe index of the temporal RDF graph. However, we need to add the new path setnewpath in the path set pathset (line 15). The last step is to update the in-degree of theelement temp in the map in (line 16). If the element temp is o, we update the in-degreeof the element temp in the map indegree and in (lines 18–19).

Generally, deleting a triple in the index of regular RDF dataset requires to deletethe elements in the index. Being different from the index of regular RDF data, theindex approach proposed in this paper only needs to modify the related elements inthe index. This is because we discuss the index for the temporal RDF graph with validtime. At this point, the invalid elements are still retained and are not deleted. Algorithm7 is used to deal with the B-tree index when temporal RDF triples are deleted fromthe B-tree index. In the algorithm, deleting some triples in the temporal RDF graphmeans that these triples are not valid and we need to modify them in the temporal RDFgraph. The method of modifying the triples in the temporal RDF graph is to modifythe temporal information of the related paths in the path set pathset and modify thetemporal information of the key and value in the B-tree. If the key set of B-tree doesnot contain the elements of temporal RDF triples, wemodify the temporal informationof the path in the path set pathset.

Algorithm 7. index for deleting temporal RDF triplesInput: a temporal RDF graph and its B-tree; a temporal RDF triple (s, p:[T], o);

the path set pathset;the map indegree contains the elements in the set pE and its in-degree;the map in contains all the elements in the dataset and its in-degree;

Output: a new B-tree1. opath getPath(o,pathset)

//getPath: get all the paths which start from the element o; opath: a path set2. if the element temp is s or p:[T]3. oldpath searchpath(pathset, temp)

//searchpath: search the path starting from the element temp in the path set pathset4. newpath createNewPath((s, p:[T], o), opath) //newpath: a path set

//createNewPath: create new path according to the temporal RDF triple and the path set opath5. if (temp.equals(s))6. if (pE.contains(temp))7. node searchNode(temp)8. node.value modifyValue(node.value,newpath)

//modifyValue:modify the value of the node with the path set newpath9. else10. pathset.modify(oldpath,newpath) //use the new path to replace the old path11. if (pE.contains(temp))12. node searchNode(temp)13. node.key modifyKey(node.key,temp)

//modifyKey:modify the key of the node with the element temp14. node.value modifyValue(node.value,newpath)15. else16. pathset.modify(oldpath,newpath)

123

L. Yan et al.

In Algorithm 7, we discuss the approach of deleting temporal RDF triples. First,we get the path set contains the paths which are started from the element o (line 1). Ifthe element is o, the paths that start with the element o do not contain any temporalinformation about triple (s, p:[T], o). So, the key o and its related value in the B-treedo not modify the temporal information of the paths (line 2). If the element temp iss or p:[T], we get the old path set oldpath, which contain the paths that originallyexist in the index, and then create the new paths set newpath with the triple (s, p:[T],o) (lines 3–4). If the element temp is s, we check if pE contains temp or not. If pEcontains temp, we need to find the related node in the B-tree and modify the value ofthe related node. For the reason that the key of the related node does not contain anytemporal information, the key of the related node does not need to be changed (lines6–8). If pE does not contain temp, we modify the paths in the path set pathset by usingthe new path to replace the old path (line 10). If temp is equal to p:[T], we also needto judge if pE contains temp or not. If pE contains temp, we get the node related withtemp and modify the key and value of the related node because the key of the relatednode contains the temporal information of triples (lines 9–13). Otherwise, we modifythe paths in the path set pathset (line 16).

4 Experimental evaluations

To evaluate the performance of our index approach for the temporal RDF graph, weconduct some experiments with datasets. We first present some datasets that are usedin our experiments and then present the experimental results of the proposed indexesover the datasets.

4.1 Data sets

The datasets we use in the experiments originate from LUBM (Lehigh UniversityBenchmark) [27] and DBpedia [28]. LUBM, which is a university domain ontology,is developed to evaluate the Semantic Web repositories in a standard and systematicway. It consists of customizable and repeatable synthetic data, a set of test queries, andseveral performance metrics. With the LUBM, we can receive datasets that contain 5million triples to 28 million triples. DBpedia is a project that aims to extract structuredcontents available on the World Wide Web. With DBpedia, we can get about 2.6million concepts described by 247 million triples, including abstracts in 14 differentlanguages. Note that the triples of LUBM and DBpedia are not temporal RDF triples.For our experimental purpose, we create temporal RDF datasets based on LUBM andDBpedia datasets by randomly generating temporal intervals and adding them to theclassical triples.

Weevaluate the performanceof our approachbyusing several datasetswith differentsizes of interval lists and different numbers of temporal triples. We use LUBM andDBpedia datasets and respectively create three different datasets named dataset1 (1000triples), dataset2 (10,000 triples) and dataset3 (100,000 triples) by randomly selectingthe fixed number of temporal RDF triples. In addition, we obtain the sizes of storage

123


Table 1 Characteristics of datasets

Benchmark Name of datasets Number of triples Size of storage (k)

LUBM Dataset1 1000 126

Dataset2 10,000 1265

Dataset3 100,000 12,912

Dataset1 1000 166

Dataset2 10,000 1658

Dataset3 100,000 13,781

spaces of these datasets. The characteristics of the total six datasets are summarizedin Table 1.

To evaluate our index approach for temporal RDF graph, we first introduce an indexapproach that is based on temporal RDF triples instead of temporal RDF graph. TheRDF index based on triples has been extensively discussed in the context of classicalRDF data.

4.2 Temporal RDF index based on triples

For temporal RDF triples with the form of (s, p, o):[T], the index based on temporalRDF triples is a two-level index. The first level named global index of temporal RDFdata is a K-D tree and the second level named local index of temporal RDF data is abitmap index. The global index contains elements that represent time intervals. Thetemporal information of all RDF triples is obtained to build a K-D tree according tothe temporal information. The elements of the local index are about subjects, predi-cates, and objects of triples with the same intervals. The local index consists of threecomposite bitmap indexes and their key are s-p, p-o, and o-s, respectively. For thecomposite bitmap with the key of s-p, for example, the elements of its first columnare subjects of the temporal triples, the elements of its first row are predicates of thetemporal triples, and its other elements are objects of the temporal triples. Assumethat the first element of the ith column is s and the first element of the jth row is p.Then when the jth element of the ith column is null, it means the triple (s, p, o) is notexisting. Otherwise, it means there exist some triples that have subject s and predicatep.

With the index based on temporal RDF triples, when we search the index, we firstget the intervals of triples in the global index and then get the right RDF data in thecorresponding local index. This way can prune the triples that have intervals out ofthe given interval. Note that the index based on temporal RDF triples is not suitablefor the situation that temporal RDF triples frequently change (e.g., inserting, deletingand updating temporal triples). Therefore, we propose the index based on temporalRDF graph in this paper.

123

L. Yan et al.

Table 2 The out-degree of thedataset1 in LUBM

Number 251 126 78 26 14 18 21 19 17 6 5

Out-degree 1 2 3 5 6 7 8 9 10 11 12

Table 3 The out-degree of thedataset2 in LUBM

Number 814 513 474 425 265 177 302 347 327

Out-degree 1 2 3 4 5 6 7 8 9

Number 182 96 72 36 15 12 5 4 4

Out-degree 10 11 12 13 14 15 16 17 18

4.3 Experimental results

Our experiments are implemented in JDK 1.8.0 with Eclipse and MySQL and per-formed on a systemwith Intel i5-2450m2.5GHz processor, 2GBRAM, andWindows7 operating system. We conduct our experiments using the datasets in Table 1. Weevaluate our index approach proposed in the paper in two aspects. First, we evaluatethe time of establishing indexes and the size of index storage space. We mainly showthe time of inserting and deleting elements in our index approach. Second, we evaluatethe performance of queries with our index approach. We compare our index approachwith the index approach based on triples from the two aspects above.

4.3.1 The definition of frequency

To improve the performance of the index about temporal RDF graph, we use frequentelements to reduce the number of prefix paths and suffix paths as well as the time ofbuilding index. On the one hand, the size of the storage space is decreased because thenumber of prefix paths and suffix paths is reduced. On the other hand, if an elementis not a frequent element, there is no need to expand its corresponding prefix pathsand suffix paths, and as a result, the time of building index is shortened. Accordingto Tables 2, 3 and 4, we analyze the frequencies of the datasets and then determinewhich elements are frequent elements.

In the above-mentioned three tables, the row named out-degree indicates that theout-degree of the elements and the row named number means the number of elementscontains the sameout-degree.Obviously, the number of elements in dataset1, dataset2,and dataset3 is 3000, 30,000 and 300,000 respectively. However, the numbers of theelements whose out-degree is larger than 0 in dataset1, dataset2, and dataset3 are2000, 20,000, 200,000, respectively. The element whose out-degree is 0 means thatthe suffix path only contains the element itself. Therefore, we can ignore the suffixtree of these elements, which can reduce the construction time of the index path.This means these elements may not be frequent elements. In addition, the number ofelements whose out-degree is 1 in the dataset1, dataset2, and dataset3 is 251, 814 and1087 respectively. If these elements are included in the dataset of frequent elements,the size of the set that we need to determine for each created suffix path will increase.The index storage space of the suffix path increases and the time to generate the suffixpath increases also. After the suffix path is generated, the amount of data for the B-

123


Table 4 The out-degree of the dataset3 in LUBM

Number 1087 770 623 447 387 282 201 136 98 91 100

Out-degree 1 2 3 4 5 6 7 8 9 10 11

Number 114 124 137 148 156 170 157 141 134 116 99

Out-degree 12 13 14 15 16 17 18 19 20 21 22

Number 84 84 54 61 61 62 48 54 56 56 55

Out-degree 23 24 25 26 27 28 29 30 31 32 33

Number 59 76 69 61 79 63 76 60 60 67 62

Out-degree 34 35 36 37 38 39 40 41 42 43 44

Number 48 48 39 38 38 20 24 19 14 10 6

Out-degree 45 46 47 48 49 50 51 52 53 54 55

Number 11 4 3 1 2 2 1 1 1 2 1

Out-degree 56 57 58 60 61 63 64 71 94 96 98

Number 1 1 3 2 3 1 3 1 1 2 1

Out-degree 100 101 102 103 106 108 110 114 115 117 120

Number 1 1 2 1 1 1 1 1 1 1 1

Out-degree 128 129 131 137 140 142 187 296 319 327 330

Number 1 1 1 1 1 1 1 1 1 1 1

Out-degree 343 376 378 381 395 404 411 416 424 425 466

Number 1 1 1 1 1 1 1 1 1 1 1

Out-degree 532 541 548 566 573 576 599 625 640 641 673

Number 1 1 1 1 1 1 2 2 1 1 1

Out-degree 703 704 709 721 733 774 775 793 795 796 797

Number 1 1 1 1 1 1 1 1 1 1 1

Out-degree 803 832 851 855 907 975 1173 1177 1204 1208 1215

Number 1 1 2 1 1 1 1 1 1 1 1

Out-degree 1235 1264 1338 1343 1358 1473 1476 1516 1590 2258 2348

Number 1 1 1 1 1 1 1 1 1 1 1

Out-degree 2369 2615 2641 2726 2872 2984 3222 3333 3334 3424 3489

Number 1 1

Out-degree 3744 3984

tree index constructed for the suffix path also increases. In contrast, we believe thatelements with higher frequency in the dataset are more likely to be queried, so webelieve that elements with the frequency of 1 are less likely to be found. For example,we build suffix paths for elements with a degree of 1177 and build suffix paths for1087 elements with a degree of 1. For these two types of situations, the likelihood ofbeing queried is similar, but the space and time of the index are much different. So,we sort the elements in descending order of out-degree and set the top 60% of theelements as high frequency elements. This is because the element with the out-degree0 is one-third of the total element from the above data. In addition, we need to cutout some elements with the small out-degree. Therefore, we define 60% of the total

123

L. Yan et al.

Fig. 4 Time of building indexes

0

50000

100000

150000

200000

250000

dataset1 dataset2 dataset3

Time of building index (ms)

db-B lubm-B db-KD lubm-KD

data as frequent elements. Similarly, the frequent elements of the prefix path are alsodefined as the 60% of all elements.

4.3.2 Index performance

First, we evaluate the time of building indexes by comparing our indexes and the indexbased on triples. Our index approach builds two B-trees: one is for the prefix pathsand the other is for the suffix paths. For this purpose, we build the adjacency tableof temporal RDF graph and find all paths. On the basis of the adjacency table, weget the suffix paths and prefix paths and then build two B-trees. It is shown that thetime complexity of building our index consists of the above three aspects. The timecomplexity of building the adjacency table isO (n) (n represents the number of triples).The time complexity of finding all paths is O (n2). The time complexity of buildingtwo B-tree indexes is O (logn). The time complexity of the index based on temporaltriples is O (d ×n) (d is the number of time intervals). However, we adopt the way offrequent elements to improve the index of relative efficiency and the time of buildingindex is shortened.

We show the time results of building two types of indexes over three datasets ofLUBM and DBpedia in Fig. 4. We use lubm-B and db-B to represent the time ofbuilding our index over LUBM and DBpedia, respectively. We use lubm-KD and db-KD to represent the time of building the index based on temporal RDF triples overLUBM and DBpedia, respectively. In Fig. 5, we show the number of paths in buildingour index over DBpedia (db inshort) and LUBM (lubm in short), respectively.

Second, we evaluate the space of storing two types of indexes. For our index, westore the keys and values of internal nodes, in which the keys are subjects, predicate

123


Fig. 5 Number of paths in graph

0

5000

10000

15000

20000

25000

30000

35000


Number of paths in graph

db lubm

02000400060008000

100001200014000160001800020000


Size of storage space (K)

db-B lubm-B db-KD lubm-KD

Fig. 6 The size of storage space

and objects of triples and the values are all paths that are associated with the keys. Thespace complexity of our index is O (n3). The space complexity of the index based ontriples is O (n2). The way of frequent elements is to decrease the size of storage spaceof the index.

Figure 6 shows the space sizes of storing our index and the index based on temporalRDF triples for four datasets of LUBM and DBpedia.

123

L. Yan et al.

Fig. 7 Time of inserting triples

0

20

40

60

80

100

120

140

160

180

200


Time of inserting triples (ms)

db lubm

It can be seen from Figs. 4 and 6 that, compared with the index based on temporalRDF triples, our index has an advantage in the time of building the index and space ofstoring the index when the experimental datasets become larger in triple number andspace. This is because our index does not build an index on all the data. In addition,our index can support temporal RDF triple changes (e.g., triple insertion, deletion,and update) well. We finally evaluate how inserting/deleting a temporal RDF triple(s, p:[T ], o) impacts our index. The insertion operation of the index is mainly dividedinto two steps. The relevant elements and paths are inserted into the B-tree and thepath data is traversed and updated. Similarly, while deleting a temporal triple fromthe temporal RDF graph, we modify the node of B-tree and modify the paths in thepath set. It is shown in Figs. 7 and 8 that more times are needed to modify the existingindex while a temporal RDF triple is inserted to and deleted from a larger temporalRDF dataset, respectively.

4.3.3 Querying performance

The aim of establishing indexs for temporal RDF is to accelerate temporal triplesearching. To evaluate the performances of querying with index, we identify and usefour types of representative queries for temporal RDF as follows.

(a) Single triple-pattern queries (Query1). A single triple-pattern query exactly con-sists of one triple pattern.

(b) Path queries (Query2). A path query consists of several connected triple patternsthat form a path.

(c) Star queries (Query3). A star query consists of more than two path queries andthese paths exactly share a common center node.

(d) Time queries (Query4). A time query searches for the intervals of triples.

We use the queries of Query1, Query2, Query3, and Query4 over the datasets inTable 1. Figures 9, 10, 11, and 12 shows the results of their querying times, respectively.

123


Fig. 8 Time of deleting triples

0

20

40

60

80

100

120


Time of deleting triples (ms)

db lubm

Fig. 9 Querying time of Query1

0

500

1000

1500

2000

2500

3000


Querying time (10-3ms)

db-B lubm-B db-KD

lubm-KD db-no lubm-no

For each type of queries over a dataset, we consider three different cases: the first one isthe queries with our index over DBpedia and LUBM (identified by bd-B and lumb-B);the second one is the queries with the index based on triples over DBpedia and LUBM(identified by bd-KD and lumb-KD); the third one is the queries with non-index overDBpedia and LUBM (identified by bd-no and lumb-no). It is clearly shown in Figs. 9,10, 11 and 12 that there is a performance problem in querying the temporal RDFdatasets without index. Such a performance problem becomes more serious when thesize of datasets becomes larger. At this point, the optimization is necessary for a userof the system. It can be seen from Figs. 9, 10, 11 and 12 that the use of index for thetemporal RDF datasets can improve the querying performance greatly.

123

L. Yan et al.


0

1000

2000

3000

4000

5000

6000



db-B lubm-B db-KD



0

1000

2000

3000

4000

5000

6000

7000

8000

9000



db-B lubm-B db-KD


In the single triple-pattern queries (Query1), we need to query the elements in atriple. In the B-tree index, querying different elements of a triple requires differentindexes. If we query an element of a triple, we can query the B-tree index based on thesuffix paths or the B-tree index based on the prefix paths. If we query two elementsof a triple, the queries are divided into three types as follow. If we query the subjectand predicate of a triple, we can query the B-tree index based on the prefix paths. Ifwe query the predicate and object of a triple, we can query the B-tree index based

123



0

500

1000

1500

2000

2500

3000



db-B lubm-B db-KD


on the suffix paths. In addition, if we query the subject and object of a triple, we canquery the B-tree index based on the suffix paths and the B-tree index based on theprefix paths simultaneously. This is because the representation of the prefix path isV0 · L0 · V1 . . . X , which contains the elements in the paths before the element X. So,if we query the subject and predicate in a triple, we can find the object of a triple in aB-tree and get the prefix path set. We can traverse the prefix path set to find the subjectin a triple. For example, if the prefix path is V0 · L0 · V1 . . . VJ · L J · X and the objectis X, the subject is VJ and the predicate is LJ . However, if the given element is nota frequent element, we cannot find the answer in the B-tree index. Then we traversethe path set to find the answers. As we analyzed in the above data, an element witha degree of 0 will probably account for one-third of the total element. In addition,the suffix index of the element with out-degree zero only contains the element itself,which has no effect on the query. And we set the proportion of frequent elements to60%, which means that only 6.66 percent of the elements are not included in the index.Therefore, the possibility of not finding elements in the B-tree is relatively low. Sincefrequent elements account for 60% of the total elements, the amount of data in the B-tree index is reduced and the query time will increase significantly. But if the elementis an infrequent element, its query efficiency will be greatly reduced. This paper willcompare queries that are frequent elements. The single triple-pattern queries with theindex based on triples search time interval first and then find right elements. It is shownin Fig. 9 that there are not obvious differences of querying time between the queriesof Query1 with our index and the queries of Query1 with the index based on triples.

Similar situations also appear in the path queries (Query2) and the star queries(Query3) in Figs. 10 and 11, respectively. Note that the queries of Query2 with theindex based on triples need to carry onQuery1-like queries several times (depending onthe length of path) and the queries ofQuery3 need to conduct several path queries. The

123

L. Yan et al.

queries of Query2 and Query3 with our index approach conduct searching operationonly once no matter how long the path is. We just need to find the start or end elementof the path and then obtain the related prefix path set or the related suffix path set.So, with the increase of path length, the frequency of searching operation in Query2and Query3 with the index based on triples observably increases accordingly. As aresult, the querying time performance ofQuery2 andQuery3with our index approachbecomes better than with the index based on triples.

In the time queries (Query4), the queries with the index based on triples need totraverse the local index until finding right elements. Our index conducts a single queryonly once. But after finding the corresponding paths, we need to compare the predicateelement p:[T] of the path, which is the same as the p of the triple. It is shown in Fig. 12that the queries of Query4 with our index approach need less querying time than withthe index based on triples. It is especially true when the number of temporal triplesbecomes bigger.

5 Conclusions

In this paper, we propose a novel index approach for temporal RDF graph, which canimprove temporal RDF queries. The proposed indexes contain two B-tree indexes thatare based on the suffix path and the prefix path, respectively. To improve the efficiencyof the index we proposed, we adopt the method of frequent elements to reduce theelements included in the index. With the proposed index, we compare several types ofqueries over different datasets. Our approach allows the operations of inserting triplesand deleting triples. The experimental results show that our index approach has a goodquerying time performance for various types of queries. It is especially true for thetime queries when massive temporal triples are searched.

Compared with the index based on triples, our index approach needs less time toestablish the indexes and also needs less storage space to store the indexes. Note thatour index contains partial elements (i.e., frequent elements) instead of all elementsof RDF datasets. When the elements that do not appear in the index are queried, theevaluation of their queries needs more time because the index is not available for theseelements and the RDF datasets need to be fully traversed. In our future work, wewill continue to improve our index approach for temporal RDF graph and verify ourapproach with massive RDF datasets.

Acknowledgements The authors wish to thank the anonymous referees for their valuable comments andsuggestions, which improved the technical content and the presentation of the paper. This work was sup-ported in part by National Natural Science Foundation of China (61772269).

References

1. Manola F, Miller E (2004) RDF primer. W3C recommendation. https://www.w3.org/TR/rdf-primer/.Accessed 5 May 2018

2. KimK,MoonB,KimHJ (2014)RG-index: anRDFgraph index for efficient SPARQLqueryprocessing.Expert Syst Appl 41(10):4596–4607

123

https://www.w3.org/TR/rdf-primer/


3. Bio2RDF. https://github.com/bio2rdf/bio2rdf-scripts/wiki. Accessed 5 May 20184. Lexvo. http://www.lexvo.org/linkeddata/tutorial.html. Accessed 5 May 20185. LinkedMDB. http://linkedmdb.org/. Accessed 5 May 20186. GovTravk. https://www.govtrack.us/about. Accessed 5 May 20187. O’Connor MJ, Das AK (2010) A lightweight model for representing and reasoning with temporal

information in biomedical ontologies. In Proceedings of the third international conference on healthinformatics, DBLP, 2010, pp 90–97

8. Ahmed F, Hameed H, Shafiq MZ (2009) Using spatio-temporal information in API calls with machinelearning algorithms for malware detection. In: Proceedings of the 2009 ACM workshop on securityand artificial intelligence. ACM, pp 55–62

9. Raleigh GC, Cioffi JM (1998) Spatio-temporal coding for wireless communication. IEEE Trans Com-mun 46(3):357–366

10. O’Connor MJ, Das AK (2011) Amethod for representing and querying temporal information in OWL.In: Proceedings of the 2011 international joint conference on biomedical engineering systems andtechnologies, 2011, pp 97–110

11. Amagasa T, Yoshikawa M, Uemura S (2000) A data model for temporal XML documents. In: Pro-ceedings of the 11th international conference on database and expert systems applications. Springer,pp 334–344

12. Edelweiss N, Hubler PN, Moro MM (2000) A temporal database management system implementedon top of a conventional database. In: Proceedings of the XX international conference of the Chilean.IEEE, pp 58–67

13. Wang F, Zaniolo C, Zhou X (2008) ArchIS: an XML-based approach to transaction-time temporaldatabase systems. VLDB J 17(6):1445–1463

14. Nørvåg K, LimstrandM,Myklebust L (2003) TeXOR: temporal XML database on an object-relationaldatabase system. In: Proceedings of the 5th international Andrei Ershov Memorial conference onperspectives of system informatics. Springer, pp 520–530

15. Lutz C, Wolter F, Zakharyashev M (2008) Temporal description logics: a survey. In: Proceedings ofthe 15th international symposium on temporal representation and reasoning. IEEE Computer Society,pp 3–14

16. Gutierrez C, Hurtado CA, Vaisman A (2007) Introducing time into RDF. IEEE Trans Knowl Data Eng19(2):207–218

17. HurtadoC,VaismanA (2006) Reasoningwith temporal constraints in RDF. In: Proceedings of the 2006international conference on principles and practice of semantic web reasoning. Springer, pp 164–178

18. Tappolet J, Bernstein A (2009) Applied temporal RDF: Efficient temporal querying of RDF data withSPARQL. In: Proceedings of the 6th European semantic web conference. Springer, pp 308–322

19. Pugliese A, Udrea O, Subrahmanian VS (2008) Scaling RDF with time. In: Proceedings of the 2008international conference on world wide web. ACM, pp 605–614

20. Zhang F, Wang X, Ma S (2009) Temporal XML indexing based on suffix tree. In: Proceedings of the2009 ACIS international conference on software engineering research, management and applications.IEEE, pp 140–144

21. Binthalab R, Eltazi N, Elsharkawi ME (2013) TMIX: temporal model for indexing XML documents.In: Proceedings of the 2013ACS international conference on computer systems and applications. IEEEComputer Society, pp 1–8

22. Binthalab R, Eltazi N (2015) TOIX: temporal object indexing for XML documents. In: Proceedings ofthe 2015 international conference on database and expert systems applications. Springer, pp 235–249

23. Picalausa F, Luo Y, Fletcher GHL (2012) A structural approach to indexing triples. In; Proceedings ofthe 2012 international conference on the semantic web. Springer, pp 406–421

24. Tran T, LadwigG, Rudolph S (2013)Managing structured and semistructured RDF data using structureindexes. IEEE Trans Knowl Data Eng 25(9):2076–2089

25. Campos R, Jatowt A (2015) Survey of temporal information retrieval and related applications. ACMComput Surv 47(2):15

26. Graefe G, Kuno H (2011) Modern B-tree techniques. In: Proceedings of the 2011 international con-ference on data engineering. IEEE Computer Society, pp 1370–1373

27. LUBM. http://swat.cse.lehigh.edu/projects/lubm. Accessed 5 May 201828. DBpedia. http://wiki.dbpedia.org/about. Accessed 5 May 201829. Ma Z, Capretz MAM, Yan L (2016) Storing massive Resource Description Framework (RDF) data: a

survey. Knowledge Engineering Review 31(4):391–413

123

https://github.com/bio2rdf/bio2rdf-scripts/wiki

http://www.lexvo.org/linkeddata/tutorial.html

http://linkedmdb.org/

https://www.govtrack.us/about

http://swat.cse.lehigh.edu/projects/lubm

http://wiki.dbpedia.org/about

L. Yan et al.

30. Ma RZ, Jia XY, Cheng JW, Angryk RA (2016) SPARQL queries on RDF with fuzzy constraints andpreferences. J Intell Fuzzy Syst 30(1):183–195

31. Wang H, Wang H, Park S (2003) ViST: a dynamic index method for querying XML data by treestructures. ACM Sigmod International Conference on Management of Data. ACM, 2003

32. Matono A, Amagasa T, Yoshikawa M (2003) An indexing scheme for RDF and RDF schema basedon suffix arrays, international conference on semantic web & databases. CEUR-WS.org, 2003

33. Liu B, Hu B (2005) Path queries based RDF index. In: International conference on semantics. IEEEComputer Society, 2005

123

indexing temporal rdf graph -...

Documents