1 rordf: optimization for rdf query on monetary cost via...

19
1 CroRDF: Optimization for RDF Query on monetary cost via Crowdsourcing Depeng Dang, Member, IEEE Abstract—The proliferation of structured data and the advances in knowledge graph have enabled the construction of knowledge bases using the RDF data model to represent various resources and their relationships. But some rdf queries cannot provide knowledges completely. In this paper, we present CroRDF, an inquiry system that provides users with low cost query services based on existing and crowdsourced RDF data. We propose crowdsourcing query plan (CQPs) enumeration optimization algorithms that enumerate the CQPs in the search space based on the selection of acquisition rules of high scores for each triple pattern in the basic graph pattern (BGP); To find the optimal CQPs, we describe the monetary cost estimation algorithm. The algorithms reduce the total time required to traverse the search space and improves the optimization efficiency. We present the monetary cost estimation algorithm, which considers the relationship between triple patterns, in detail, and this algorithm is combined with the multi-choice of the crowdsourcing direction to estimate the query monetary cost. To evaluate CroRDF, we create different queries on DBpedia dataset. The crowed use Amazon Mechanical Turk to contribute their knowledge. Experimental results clearly show that our solution accurately low monetary cost through crowdsourcing platforms and integrating existing data. Index Terms—Crowdsourcing, RDF, Monetary cost estimation, Crowdsourcing cost optimization —————————— —————————— 1 INTRODUCTION ince Google optimized its search services with knowledge graphs, knowledge graphs have grown rapidly. A variety of semantic knowledge bases have emerged in both industry and academia. This like DBpedia 1 , YAGO-NAGA 2 , Freebase 3 and Geo-Names 4 . the Resource Description Framework 5 (RDF) is a W3C standard that describes network resources. It is widely used to represent various resources and their relationships in the knowledge graph. RDF is a semi-structured data model where entities are represented as resources; connection between resources are described as triples composed of subjects, predicates and objects[1]. Many semantic knowledge bases use the RDF semantic model to express millions of fact entities and their relations. Rich and substantial knowledge bases provide not 1. https://wiki.dbpedia.org 2. https://datahub.io/collections/yago 3. https://developers.google.com/freebase 4. http://www.geonames.org 5. https://www.w3.org/RDF _________________________ D. Dang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]. W. Yu is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]. S. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected] N. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]. only a backend basis for a variety of applications but also intelligent services. RDF data model technologies could be especially serviceable for expressing the knowledge of Internet. RDF clearly represents the subject, predicate and object of a sentence in the form of triples. Moreover, SPARQL allows applications to perform complex queries on distributed RDF databases and it is supported by various competition frameworks. Yet, SPARQL can only query the data already on the knowledge bases. The quality of the data on the knowledge bases determines whether the results of the query are good or bad. If the data on the knowledge bases is incomplete, it cannot be queried. Existing methods and technologies of acquiring RDF cannot guarantee data integrity. Transiting from text data or XML documents and mining from the semantic web [2] both are inadequate since data resources are limited. Both methods are offline approaches. Neither method can provide a complete answer to a query immediately. Therefore, acquiring RDF data that meet the requirement for completeness in real time remains challenging. Recently, with the rapid development of the network, people are scrambling to make full use of network resources. Many projects are attracted by the power of people on the Internet. Over the past decade, multinational corporations in developed countries have turned their attention to China and India, where low-cost labor markets have made them salivate. But now, it does not matter where the labor force comes from. They can live next door, or possibly far away from Indonesia, as long as they have access to the Internet. Crowdsourcing integrates the advantages of machines and manpower to effectively solve complex problems[6][7][8][9], such as the evaluation of search results [10], tagging of pictures [11], and xxxx-xxxx/0x/$xx.00 © 2018 IEEE Published by the IEEE Computer Society S

Upload: others

Post on 23-May-2020

46 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

1

CroRDF: Optimization for RDF Query on monetary cost via Crowdsourcing

Depeng Dang, Member, IEEE

Abstract—The proliferation of structured data and the advances in knowledge graph have enabled the construction of knowledge

bases using the RDF data model to represent various resources and their relationships. But some rdf queries cannot provide

knowledges completely. In this paper, we present CroRDF, an inquiry system that provides users with low cost query services based

on existing and crowdsourced RDF data. We propose crowdsourcing query plan (CQPs) enumeration optimization algorithms that

enumerate the CQPs in the search space based on the selection of acquisition rules of high scores for each triple pattern in the

basic graph pattern (BGP); To find the optimal CQPs, we describe the monetary cost estimation algorithm. The algorithms reduce

the total time required to traverse the search space and improves the optimization efficiency. We present the monetary cost

estimation algorithm, which considers the relationship between triple patterns, in detail, and this algorithm is combined with the

multi-choice of the crowdsourcing direction to estimate the query monetary cost. To evaluate CroRDF, we create different queries

on DBpedia dataset. The crowed use Amazon Mechanical Turk to contribute their knowledge. Experimental results clearly show

that our solution accurately low monetary cost through crowdsourcing platforms and integrating existing data.

Index Terms—Crowdsourcing, RDF, Monetary cost estimation, Crowdsourcing cost optimization

—————————— ◆ ——————————

1 INTRODUCTION

ince Google optimized its search services with

knowledge graphs, knowledge graphs have grown

rapidly. A variety of semantic knowledge bases have

emerged in both industry and academia. This like DBpedia1,

YAGO-NAGA2, Freebase3 and Geo-Names4. the Resource

Description Framework 5 (RDF) is a W3C standard that

describes network resources. It is widely used to represent

various resources and their relationships in the knowledge

graph. RDF is a semi-structured data model where entities

are represented as resources; connection between resources

are described as triples composed of subjects, predicates and

objects[1]. Many semantic knowledge bases use the RDF

semantic model to express millions of fact entities and their

relations. Rich and substantial knowledge bases provide not

1. https://wiki.dbpedia.org

2. https://datahub.io/collections/yago 3. https://developers.google.com/freebase 4. http://www.geonames.org 5. https://www.w3.org/RDF

_________________________ • D. Dang is with College of Information Science and

Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].

• W. Yu is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].

• S. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]

• N. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].

only a backend basis for a variety of applications but also

intelligent services.

RDF data model technologies could be especially

serviceable for expressing the knowledge of Internet. RDF

clearly represents the subject, predicate and object of a

sentence in the form of triples. Moreover, SPARQL allows

applications to perform complex queries on distributed RDF

databases and it is supported by various competition

frameworks. Yet, SPARQL can only query the data already

on the knowledge bases. The quality of the data on the

knowledge bases determines whether the results of the query

are good or bad. If the data on the knowledge bases is

incomplete, it cannot be queried.

Existing methods and technologies of acquiring RDF

cannot guarantee data integrity. Transiting from text data or

XML documents and mining from the semantic web [2] both

are inadequate since data resources are limited. Both

methods are offline approaches. Neither method can provide

a complete answer to a query immediately. Therefore,

acquiring RDF data that meet the requirement for

completeness in real time remains challenging.

Recently, with the rapid development of the network,

people are scrambling to make full use of network resources.

Many projects are attracted by the power of people on the

Internet. Over the past decade, multinational corporations in

developed countries have turned their attention to China and

India, where low-cost labor markets have made them salivate.

But now, it does not matter where the labor force comes from.

They can live next door, or possibly far away from Indonesia,

as long as they have access to the Internet. Crowdsourcing

integrates the advantages of machines and manpower to

effectively solve complex problems[6][7][8][9], such as the

evaluation of search results [10], tagging of pictures [11], and

xxxx-xxxx/0x/$xx.00 © 2018 IEEE Published by the IEEE Computer

Society

S

Page 2: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

2

filtering [12]. Data can be acquired in real time and on

demand through crowdsourcing systems. Collecting data

through crowdsourcing alleviates the problems while

guaranteeing RDF semantic integrity and realizing flexible

queries in real time. So, many researchers combine man and

machine to achieve the desired result.

Reviewing previous research in this area are about

importance of collecting knowledge from the crowd to

complete missing values. Examples include CrowdQ [3],

HARE [4] and [5]. However, these existing hybrid

human-machine approaches have so far been focused on the

data. They did not minimize the monetary cost.

The monetary cost has not been solved by previous

research. To achieve this, in this paper, we acquire RDF data

by using the crowdsourcing approach. However, collecting

data through crowdsourcing is not free and can lead to

monetary costs. Therefore, the goal of the present work is to

obtain RDF data through crowdsourcing with the least

monetary cost. The goal of traditional query optimization

methods for a database is to reduce CPU, I/O, and

communication costs rather than monetary cost. A few

studies have examined addressing queries on structured data

through crowdsourcing with a goal of reducing monetary

cost, but these works are not applicable to the RDF data with

semi-structured nature. In addition, no study has sought to

obtain RDF data by crowdsourcing while minimizing the

monetary cost.

In this paper, we will describe CroRDF, a crowdsourcing

knowledge system, to provide a query service by combining

existing data and crowdsourcing data. We will be focusing on

the crowdsourcing query optimization of the CroRDF system

to minimize the monetary cost.

Collecting the query answers from the existing data is

“free”. Thus, given a query from the end users, CroRDF first

searches the answers from the existing data, which we call

the search phase. When the answers do not meet the

requirements, the system collects the remaining answers

through crowdsourcing. We denote this step the collect phase.

In this phase, the system first generates a search space that

contains all possible CQPs. Then, we will detail a cost

estimation algorithm to evaluate each CQP to obtain their

monetary costs. Finally, we will choose the CQP with the

least monetary cost as the final query.

In CroRDF, an ordered BGP graph can be assigned to

different sets of acquisition rules, resulting in different

crowdsourcing plans and different monetary costs. Therefore,

in the optimization process, we will define two reasonable

evaluating scores that are used to determine the candidate set

of optimal acquisition rules. Then, we will detail a native

algorithm and an improved efficient enumeration algorithm

to enumerate all CQPs in the search space. During the

process, we save the number of possible result tuples needed

(PossiNum) while crowdsourcing calculation relationship

between the triples to improve the efficiency.

We attribute the cost estimation problem to the

PossiNum estimation problem. The PossiNum estimation is

holistic. The PossiNum of a sub-plan depends on the entire

plan. For a triple-pattern sequence, we consider the

association types between the triple patterns and the search

direction of the basic graph pattern (BGP) to estimate the

PossiNum of each triple pattern and the cost of the CQP.

Finally, we select the optimized CQP with the lowest

monetary cost. We conduct an experimental evaluation of

our system in terms of the accuracy of the cost estimation and

the effectiveness of the acquisition rule scores and the two

plan enumeration algorithms.

Our main contributions are as follows:

1. We propose the design of CroRDF including a

two-phase executive strategy for RDF crowdsourcing, i.e., a

search phase and a collected phase.

2. We describe the evaluation scores of acquisition

rules and the enumerating algorithms.

3. We traverse the candidate crowdsourcing query

plan to find the one with the lowest cost.

4. We demonstrate that our approach has lower cost

through experiments on the real dataset.

The remainder of this paper is structured as follows:

Section 2 investigates the related work. Section 3 introduces

the basic architecture of the CroRDF system. Section 3

elaborates the search phase and explains how to query in the

RDF database first. Section 4 defines the search space of the

CQP in CroRDF and proposes a plan enumeration

optimization process that enumerates all the CQPs in the

search space based on the selection of the acquisition rules.

Section 5 expands on the cost estimation algorithm and

discusses how CroRDF estimates the execution cost of a

CQP. Section 6 presents the experimental evaluation of our

system. Section 7 investigates the related work. Finally,

section 8 draws some summarizes and discusses areas we

have identified for improvement in the future.

2 RELATED WORKS

In recent years, crowdsourcing has been widely used in

various fields as an efficient and cheap problem-solving

model, with demonstrated advantages in human

resources[16][17]. As shown in[18], we have been offered an

overall picture of the current state of the art techniques in

general-purpose crowdsourcing, hence we are essentially

dealing with RDF queries with crowdsourcing. It is the most

promising technology to solve the integrity issues

researchers are facing in RDF query. Some data-oriented

process systems have used crowdsourcing data in a

crowdsourcing explanation approach. These systems

integrate a crowdsourcing process control in the data

collection process. Approaches such as CrowDB [19], Deco

[21][22][23][24], HARE [20],CoEx Deco[26] and CrowOp

[25] target scenarios in which existing microtask platforms

are directly embedded in query processing systems.

There are three important problems in this field: quality

control, cost control and latency control [27]. Here, we

Page 3: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

3

briefly review the work on approaches mentioned above.

CrowDB[19] uses human input via crowdsourcing to process

queries that neither database systems nor search engines can

adequately answer. It exploits the extensibility of the

iterator-based query processing paradigm to add crowd

functionality into a DBMS. CrowDB[19] supports two types

of user interfaces to allow user to input the primary key of the

search. It highlighted two cases where human input is needed:

(a) unknown or incomplete data, and (b) subjective

comparisons. CrowDB[19] extends SQL to address these

cases.

Similarly, Deco [21][22][23][24] is a database system for declarative crowdsourcing. Syntactically, Deco’s query

language is a simple extension to SQL. Based on the

CrowDB[19], Deco[21][22][23][24] proposes the notions of fetch and resolution rules provide powerful mechanisms

for describing crowd access methods. Fetch rules,

specifying how data in the conceptual relations can be obtained from external sources (including humans).

Resolution rules are used to reconcile inconsistent or

uncertain values obtained from external sources.

CrowOp [25] Supports cost-based query optimization. It

is capable of finding the query plan with low latency given

a user-defined budget constraint, which nicely balances the cost and time requirement of users. We develop efficient

algorithms in the CrowOp [25] for optimizing three types

of queries: selection queries, join queries, and complex selection-join queries. CoEx Deco[26] is also a system

that provides the user the facility to submit queries in the

form of comments. CoEx Deco[26] mades the user free to

comment anything about a specific noun in form of triplet and it makes a seamless integration of user entered data

along with data collected from the crowd. When a query is

evaluated the input graph of RDF is matched against the

inside variables of triple patterns.

HARE [20] identifies parts of SPARQL queries that are

affected by incomplete portions of RDF data sets,

crowdsources potential missing values and then efficiently

combines the crowd answers with the results from the data

set during the query execution. It uses a crowd knowledge

base that captures crowd answers about missing values in the

RDF dataset. And it uses a microtask manager that exploits

the semantics encoded in the dataset RDF properties, to

crowdsource SPARQL sub-queries as microtasks and update

the crowd knowledge base with the results from the crowd.

CrowDB[19] just proposes that uses microtask-based

crowdsourcing to answer queries that cannot otherwise be

answered. No cost control involved. Deco [21][22][23][24]

prototype does not yet perform sophisticated query

optimization. Although HARE can enhance the answer of a

SPARQL query evaluation, it concentrates more on

automatically identifying the completeness of a query against

RDF data and does not consider the optimization of

crowdsourcing cost for a query, which is our specific target.

CoEx Deco system [52] answer the user queries over a

Simple Protocol and RDF Query Language (SPARQL)

Query on RDF together obtaining data from crowd in form of

triplet, but it mainly aims to make the SPARQL queries more

expressive and does not optimize the crowdsourcing process.

Our work focuses on semi-structured data–RDF, which

includes the associated relationship between the triple

patterns in a SPARQL query.

In that respect, our work is related to cost control for

crowdsourcing missing data for a SPARQL query on an RDF

database, which extends the SPARQL language. Our system

builds upon widely used crowdsourcing platforms, such as

AMT. We considered the mutual influence between existing

data and crowdsourcing data and then embedded crowd

computing features in the query execution. In conclusion,

CroRDF is a crowdsourcing query system that considers the

characteristics of the RDF data and the crowdsourcing mode,

shows the universality and flexibility of crowdsourcing in

data collection, optimizes the crowdsourcing query process,

effectively replenishes the missing RDF data, and provides a

crowdsourcing query service.

3 SYSTEM ARCHITECTURE

The architecture of query processing in CroRDF is

illustrated in Fig. 1. An application issues requests using

CroSparql, a moderate extension of standard SPARQL.

Users can use the CroSparql to call CroRDF query API, so

that they can get the answer from CroRDF. CroRDF consists

of two components, the Search phase and the Collect phase.

Fig. 1. Architecture of CroRDF

In the Search phase, we present the flexible and

extensible data model and predicate index to store the RDF

graph data. Then, we can search the results for a query with

existing RDF graph data by graph exploration. If the results

of the Search phase do not satisfy the query target, the results

are sent to the Collect phase.

After the Search phase, we enter the Collect phase.

CroRDF can generate crowdsourcing question according to

the specific acquisition rules in each crowdsourcing plan.

Then, it loads answers from the crowdsourcing platform and

Page 4: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

4

uses resolution rules to filter the answers. Finally, the

crowdsourcing results are converted to RDF format and

returned to the knowledge base. CroRDF combines the

crowdsourcing results and the results of the query phase and

returns them to the user. The overall framework of the

crowdsourcing query optimization and module functions is

presented below, and the two-phase query execution process

the CroRDF system is briefly described.

3.1 Data Model

RDF is a graph-based data model, which uses directed

edges to connect different nodes. An RDF tuple is

composed of three parts: the subject, predicate, and object.

Each tuple represents a fact. The subject generally

represents an information entity (or concept) on the Web

by a Universal Resource Identifier (URI). The predicate

describes the relevant properties of the entity, and the

object represents the attribute value corresponding to the

subject. The formal representation is as follows [28]:

Given a set of URI I, a blank node set B, a literal

description set L, and an RDF tuple (s, p, o), the

information represented by the tuple is as follows. (𝑠, 𝑝, 𝑜) ∈ (𝐼 ∪ 𝐵) × 𝐼 × (𝐼 ∪ 𝐵 ∪ 𝐿)

A group of RDF triple data can be regarded as a directed

graph G = (V, E, L) [29], where V is the node set representing

the subject or object. E is the directed edge set representing

the predicate. L is the label set. L = 𝐿𝑣 ∪ 𝐿𝑝, where 𝐿𝑣 is the

label set of the nodes and 𝐿𝑝 is the label set of the edges. We

construct the RDF graph based on the key-value storage of

the data structure of the node (id, value), where each node

represents an RDF entity and is stored as a key-value pair.

The specific form is as follows:

(id, (in-adjacency-list, out-adjacency-list))

The node id is regarded as the key and the value is entity

pointed by an adjacent arrow. The lists can be divided into

two categories according to the direction of the connected

edges, with the structure of (predicate, id) as the basic

element. For each node, we can search its adjacent nodes.

An example is shown in Fig. 2. Fig. 2(a) shows the RDF

graph data, where 𝑛𝑖 is the node id and 𝑙𝑖 is the predicate.

Figure 2(b) shows the key-value storage of node 𝑛0.

There are server others components to CroRDF’s data

model:

错误!超链接引用无效。BGP (Basic Graph

Pattern). A sequence of triple patterns <subject,

predicate, object>.

Solutions, the results of extended SPARQL

query.

Acquisition rules, specifying how data in the

knowledge base can be obtained from external

sources (including humans).

Resolution rules, used to reconcile inconsistent or

uncertain values obtained from external sources.

Crowdsourcing Query Plan (CQP), decided by the

ordered BGP graphs, the acquisition rules and

enumerate plans. It includes the process order and

the crowdsourcing direction.

We will illustrate each of the data model components

informally in other chapters.

3.2 Query Extension

We used SPARQL to complete a select query on the RDF

graph data in the CroRDF. In contrast to the commonly used

join graph representation of BGPs in which each triple

pattern is an ordinary directed edge from a subject node to an

object node. The formal syntax of the BGP is expressed as Q:

SELECT ?V1 ... ?Vm WHERE {Q1 ... Qn}, where {Q1 ... Qn}

represents a set of triple patterns and ?V1 ...? Vm represents a

set of variables that appear in {Q1 ... Qn} and defines the

format of the query output.

In order to meet the needs of a query by data collection, we

use extended the SPARQL query language ‘CroSparql’ and

we can use it to complete two types of query targets by

crowdsourcing platform.

a. Given a threshold n of number of queries, CroRDF can

return n with the least cost.

b. Return the maximum number of queries within the

fixed cost.

For example, the threshold n is set to 5, then CroRDF

first return β exact solutions from knowledge base in the

Query parser. If β is less than 5, CroRDF will collect

solutions with crowd in Collect phase. Considering the

following example.

Example 1. A user wants to find a doctor and his field of

focus and the doctor is a professor and he works in a hospital.

At the same time, the level of the hospital is three. The

answer can be obtained by the following SPARQL query,

namely QF and the query graph is shown in Fig. 3.

Fig. 3. Query QF and its BGP graph

First enter the Search phase of CroRDF, according to the

query target, query requests are initialized. The graph

exploration module explores the existing knowledge in

accordance with one query plan and returns the partial results.

n0

n3

n5

n2

n4

n1

l3

l5l2

l4

l1

(n0, (in-adjacency-list, out-adjacency-list))

(l1,n1) (l2,n2) (l3,n3) (l4,n4) (l5,n5)

(a) RDF graph data (b) Key-value storage structure of n0

Fig. 2. Example of the key-value storage structure

QF: SELECT ?doctor ?hospital ?field ?score WHERE { ?doctor Has_rate ?score,

?doctor PositionTitle PROFESSOR , ?doctor WorkIn ?hospital, ?hospital MajorIn ?field, ?hospital Has_level 3 MinTuples 5}

?doctor

?score ?hospital

?field 3

Has_rate WorkIn

MajorIn Has_level

q1 q3

q5q4

PROFESSOR

PositionTitle

q2

Page 5: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

5

Existing data in the knowledge base as show in Fig.4. If the

target of β is less than 5, the query process switches to the

Collect phase. In this phase, based on the partial results

obtained in the Search phase, the TPGenerate processor

generates ordered BGP graphs according to certain rules, i.e.,

different execution sequences of triples. The Acquire

processor determines the crowdsourcing direction and

acquisition rule set of each triple pattern according to the

acquisition rule scores to generate candidate optimal CQPs in

the effective search space. Then, the CostEst module is

utilized to estimate the crowdsourcing cost and help to find

the optimal plan. Finally, the CreateQuestions and

LoadAnswer processors in the crowdsourcing module

perform crowdsourcing questions and collect the results.

This paper focuses on crowdsourcing query optimization.

Therefore, the specific query optimization process of the

Search phase is not discussed. In Sections 4 and 4, we will

explain the cost estimation for a CQP and the acquisition rule

evaluation algorithms used in the Collect phase.

4 SEARCH PHASE

In this phase, the SPARQL query process is transformed

into a sub-graph matching problem using graph exploration

[31]. The process order of the triple patterns in the SPARQL

query is sorted with {q1, ... , qn}, and the matching set of the

i-th triple qi is calculated through the whole graph.

According to the matching set of qi, qi+1 is mapped with the

graph exploration query. In an ordered set of triples, there is

an effect of the impact of the interactions between the triples,

and each step of the matching operations is based on the

previous results to reduce the intermediate result sets and

improve the query performance.

Algorithm 1 illustrates the main process of the Search

phase. Where q⃗ represents a triple pattern with a direction, i.e., the crowdsourcing direction from the subject to the object, that indicates the common nodes with another triple pattern as the subject. And q⃗⃖ represents the crowdsourcing direction from the object to the subject. We call the source of q⃗ and q⃗⃖ “src” and call the target of them “tgt”. “p” represents predicate and “dir” represents to correspond relationship

between “src” and “tgt”. When src is a variable, the

LoadNodes initialize the candidate set by predicate indexes;

when src is a constant, B(src) is initialized as the constant.

Then, for each candidate item in B(src), the

SelectByPredicate searches for the suitable candidate set of

tgt. The result is added into R only when the tgt matches

B(tgt).

In example 2, assume that the existing data in the

knowledge base are as shown in Fig. 4. For q⃗ {?doctor

WorkIn ?hospital}, there will be 4 matching results in R

according to algorithm 1: {(wang3, WorkIn, Chinese

Medicine Hospital), (wang1, WorkIn, Jishuitan Hospital),

(wang1, WorkIn, Beiyi Hospital), (wang2, WorkIn, Beijing

Hospital) }.

5 COLLECT PHASE

In this phase, based on the result set and the query target in

the Search phase, the query engine triggers the optimal

acquisition rules, generates candidate crowdsourcing plans

and questions dynamically. Then, the crowdsourcing

platform can handle the crowdsourcing questions and

collect new data later.

5.1 Generate Ordered BGP Graphs

For a SPARQL query Q, we first construct a BGP graph

to describe the structural relationship between the triple

patterns. Then, all possible ordered BGP graphs of the triple

patterns that describe the process orders are determined.

Based on the BGP graphs, we construct all possible logical

plans.

Definition 1. A Logical Plan is a sequence of triple

patterns corresponding to an ordered BGP graph.

Assume the triple pattern set TP1 = {q1, q2,..., qn} as the

initial ordered BGP graph that appears in the query. Based on

TP1, the positions of the two pairs of triples are exchanged

according to the rule 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑖) ↔ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑗)(𝑖 ≠ 𝑗)

to form different triple pattern sequences corresponding to

different ordered BGP graphs. When there are n triples, n!

triple pattern sequences are generated. The generation

process of triple pattern sequences is shown in Algorithm 2.

Different ordered BGP graphs may have different

crowdsourcing costs. When the number of triple

wang3Chinese

Medicine Hospital

2

wang1Jishuitan Hospital

Orthopedics

Beiyi Hospital

3

3

wang2Beijing

Hospital29

8

WorkIn

Has_level

MajorIn

Has_level

Has_level

Has_level

WorkIn

WorkIn

WorkIn

Has_rate

Has_rate

PROFESSOR

PositionTitle

ASSOCIATEPROFESSOR

PositionTitle

PROFESSOR PositionTitle

MajorIn

Dermatology

Fig. 4. Existing data in the knowledge base Algorithm 1 MatchPatter

Input: Triple patter e (e=q⃗⃖ or e=q⃗ ) Output: The matching set R

1: Initial src, tgt, p and dir from e

2: if src is a variable then

3: B(src)= LoadNodes(p, dir)

4: else if src is a constant then

5: for each s in B(src) do

6: Id_ListSet=LoadNeighbors(src, dir) //Get

adjacency list corresponding with src

7: N = SelectByPredicate(Id_ListSet, p)

8: for each n in N ∩ B(tgt) do

9:

10:

R=R∪(s, p, o)

return R

Page 6: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

6

patterns in the TP set is large, a large number of TP

sequences will be generated, which may affect the cost

estimation and the efficiency of the crowdsourcing

optimization process. Therefore, a pruning process is

necessary. Given that the crowdsourcing process of

triple patterns is in a certain order, there exists a

binding set of associated values among them that

limits and reduces the unnecessary acquisition

rule generation. Therefore, when generating TP

sequences, we consider TP sequences (line 6) that have

an association between every two triple patterns to

effectively reduce the candidate ordered BGP graphs.

4.2 Evaluate Acquisition Rules

Based on the ordered BGP graphs, there are different acquisition rules for each triple pattern that generate different crowdsourcing questions.

4.2.1 Acquisition Rules

Definition 2. An Acquisition Rule is the rule extracted from a triple pattern in a BGP graph that defines how to generate crowdsourcing questions and acquire data from crowdsourcing platforms.

The general form of the acquisition rule is Predicate(subject, object). There are two specific forms when generating acquisition rules: one is Predicate(?, object), with a known object and an unknown subject; the other is Predicate(subject, ?), with a known subject and an unknown object. The acquisition process obtains an unknown value according to a known value. We can set a certain reward for each acquisition rule based on the predicate and pay workers when they complete the crowdsourcing question generated by the acquisition rule later. We take the hospital system as an example. Some acquisition rules are as follows:

Is(?, doctor): Ask a doctor's name.

WorkTime(NAME, ?): Ask the working time

according to the name of the doctor. A triple pattern in the WHERE clause of a

SPARQL query can generate a specific set of acquisition rules. The triple pattern is formally

expressed as ? _var1 <P>? _var2 / CONST, where ?_ var1 and ?_var2 represent variables (subject or object). The object may also be a constant. According to the definition of the acquisition rules, we can generate the following three types of acquisition rules: I: P(?_var1, CONST); II: Is(?_var1, var1), Is(?_var2, var2); III: P(VAR1, ?_var2), P(?_var1, VAR2). var1 and var2, respectively, represent the category where the subject and the object node belong. VAR1 and VAR2, respectively, denote the corresponding values of the subject and the object. Different acquisition rules can be selected under different conditions, and the data for the corresponding triple pattern can be acquired.

4.2.2 Acquisition Rules Selection

Definition 3. A Physical Plan is a sequence of

acquisition rules. It is converted from a logical plan by

choosing the crowdsourcing direction for each triple

pattern in the logical plan and determining the

acquisition rule for the corresponding triple pattern.

We first compute a set of candidate acquisition

rules for each triple pattern and select the

possible-complete acquisition rules (possible-complete

means that data crowdsourced according to the

acquisition rules match the triple pattern completely).

The complete acquisition rules of all triple patterns are

combined to produce a physical plan.

We consider a triple pattern q: ? _var1 <P>? _var2 /

CONST, whose candidate acquisition rule set is

fets={ P(?_var1, CONST); Is(?_var1, var1); Is(?_var2,

var2); P(VAR1, ?_var2); P(?_var1, VAR2) }. A set of

minimum complete acquisition rules includes three

types of rule sets:

A: P(?_var1, CONST)

B: { Is(?_var2, var2); P(?_var1, VAR2)}

C: { Is(?_var1, var1); P(VAR1, ?_var2)}

Sets A and B determine that the crowdsourcing

direction of one triple pattern q is from the object to

the subject and can be denoted by �⃖�. Set C determines

that the crowdsourcing direction of q is from the

subject to object and can be denoted by 𝑞 . When the

object is a constant, it is necessary to filter the result

of ?_var2, which is obtained according to the

acquisition rules in set C by the constant.

Assuming that the knowledge base has data, as

shown in Fig. 4, we take the query in Section 2.4 as an

example. We take the following two CQPs as an

example: A: {𝑞2⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ } and

B:{𝑞2⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ }; the order of the BGP graph and

Algorithm 2:EnumerateBGP

Input: Initial ordered BGP graph TP1

Output: TP sequence set TPset

1 TPset <- {TP1}

2 for i ∈ [1, n] do

3 for j ∈ [i, n] do

4 for 𝑇𝑃𝑖 ∈ TPset do

5 position(𝑞𝑖) ↔ position(𝑞𝑗)(i ≠ j)

6 if filter(TPnew) then

7 TPset <- TPset ∪TPnew

8 return TPset

Page 7: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

7

the acquisition rules are shown in Fig. 5. All acquisition

rules conform to one of the three sets described above.

Different physical plans in the search space are formed

by the combination of different complete acquisition

rules of triple patterns.

Enumerating and combining all possible-complete

rules in the candidate acquisition rule sets of each

triple pattern will result in a huge number (i.e., 𝑜(2𝑛 ∗

𝑛!) ) of physical plans, which can affect the query

efficiency. We define two types of evaluation scores for

the acquisition rules to calculate their respective

contribution to the whole result and select rules with

high contribution scores to reduce the physical plans.

The first score of the acquisition rule fhk is calculated as

follows:

𝑠𝑐𝑜𝑟𝑒1(𝑓ℎ𝑘) = ∑([∃ 𝑗: 𝑐(𝑖, 𝑗) = ℎ] × 1

𝑝𝑖

)

𝑛

𝑖=1

where n denotes the number of triple patterns, pi

denotes the number of variables in the i-th triple

pattern, fhk denotes the k-th acquisition rule in the

complete set of the h-th triple pattern, and c(i, j)

indicates whether or not fhk contributes to the i-th

triple pattern.

The second score considers the number of

acquisition rules ∑ 𝑞(𝑖, 𝑗)𝑝𝑖𝑗=1 . 𝑞(𝑖, 𝑗) refers to the

number of acquisition rules required for the j-th

variable. The score of the acquisition rule fhk is

calculated as follows:

𝑠𝑐𝑜𝑟𝑒2(𝑓ℎ𝑘) = ∑([∃ 𝑗: 𝑐(𝑖, 𝑗) = ℎ] × 1

∑ 𝑞(𝑖, 𝑗)𝑝𝑖

𝑗=1

)

𝑛

𝑖=1

With the calculation and comparison of the scores

of different acquisition rules, rules with high scores are

selected for each triple pattern, which will be

combined to generate the possible optimal physical

plans. Cost estimation can be conducted on these plans

to determine the optimal CQP.

4.3 Candidate Crowdsourcing Plan

4.3.1 Search Space

Definition 4. A Crowdsourcing Query Plan (CQP)

consists of an ordered BGP graph and the

corresponding physical plan.

This section mainly explains the search space of

the possible CQPs considered by the crowdsourcing

query optimizer of the CroRDF system. An ordered

BGP graph specifies the process order of the triple

patterns that form the BGP graph. Different orders

affect the crowdsourcing process. In Section 4.1, we

will discuss in detail how to generate an ordered BGP

graph. For each triple pattern, the CQP requires a set of

acquisition rules to collect data that match the triple

pattern. An acquisition rule corresponds to a

crowdsourcing direction of the triple pattern. In

Section 4.2, we have discussed the set of candidate

acquisition rules for an ordered BGP graph in detail

and the need to select better acquisition rules with

higher scores, i.e., that may contribute more to the

result. By using an ordered BGP graph, we can

construct a logical plan and extend it to different

?doctor

?score?hospital

?field 3

Has_rateWorkIn

MajorIn Has_level

q3

q5q4

PROFESSORPostitionalTitle

q2

?doctor

?score?hospital

?field 3

Has_rateWorkIn

MajorIn Has_level

q1 q3

q5q4

PROFESSORPostitionalTitle

q2Plan A: Acquisition Rulesq2: PositionalTitle(?doctor, PROFESSOR)q1: Has_rate(doctor, ?score)q3: WorkIn(doctor, ?hospital)q4: MajorIn(hospital, ?field)q5: Has_level(hospital, ?level)

Plan B: Acquisition Rulesq2: Is(?, doctor) PositionalTitle(doctor, ?positionalTitle)q1: Has_rate(doctor, ?score}q3: WorkIn(doctor, ?hospital)q4: MajorIn(hospital, ?field)q5: Has_level(hospital, ?level)

Plan A Plan B

q1

Fig. 5. CQPs and acquisition rules for plans A and B

Algorithm 3:SearchBestPlanOriginal Procedure

1 bestPlan <- NULL

2 minCost <- ∞

3 for each seqBGP do

4 for each fetchRuleSet do

5 plan <- GeneratePlan(seqBGP, fetchRuleSet)

6 plan.TriplePossEst()

7 cost <- plan.CostEst(plan.poss)

8 if cost < minCost then

9

bestPlan <- plan

1

0

return bestPlan

Page 8: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

8

executable physical plans by selecting different

acquisition rules.

4.3.2 Enumeration Algorithms

Definition 5. PossiNum is the number of possible result tuples needed for each candidate acquisition rule for a triple pattern, which is related to the cost of the corresponding crowdsourcing plan. The details of how to estimate the PossiNum are discussed in Section 5.

Note: Different physical plans have different

acquisition rules, and different acquisition rules have

different turns ratios, which indicates that the

generated one-to-one crowdsourcing questions need

different numbers of result tuples (PossiNum) to find

the right answer. The number of result tuples needed is

directly related to the monetary cost of crowdsourcing.

We now consider the problem of efficiently

enumerating all CQPs in the search space. In CroRDF,

the same logical plan may correspond to different

physical plans, resulting in different crowdsourcing

costs. Thus, the PossiNum estimation is applied at the

physical plan level to help select the optimal CQP.

Moreover, the CroRDF PossiNum estimation is holistic

and is based on an ordered triple pattern sequence in

which the PossiNum of each triple pattern partly

depends on the other parts of the CQP and affects the

other triple patterns. Therefore, the goal of the

enumeration algorithm is to generate a complete CQP

in the search space while maximally reusing the

common triple pattern subsequence. First, we propose

a native enumeration algorithm. Then, we propose an

improved efficient enumeration algorithm based on

reuse. The performance of the two enumeration

algorithms is compared in the experiment.

4.3.2.1 Native Algorithm

The native enumeration algorithm iteratively

generates all valid CQPs in the search space.

Algorithm 3 illustrates the whole process. First, all

ordered BGP graphs (line 3) are enumerated using the

EnumerateBGP algorithm in Section 4.1. For one

ordered BGP, a set of complete acquisition rules is

generated and combined according to the evaluation

scores proposed in Section 4.2, which constructs a

candidate CQP (lines 4 and 5). The optimal CQP is then

selected by using the PossiNum estimation and cost

model (lines 6-9).

4.3.2.2 Improved Algorithm

The native enumeration algorithm processes each

CQP independently. Since different CQPs may have

common triple pattern subsequences, it is possible to

generate a duplicate estimation for the same

subsequence. To improve the enumeration efficiency,

we can record the estimated results of these common

triple subsequences. Note that there are associated

values between the triple patterns, but we cannot

directly save the estimated PossiNum, although saving

the PossiNum calculation relationship between the

triples is feasible. Therefore, the algorithm does not

have to repeat to determine the relationship between

two triple patterns and can perform the calculation

directly based on the input parameters.

For a SPARQL query with n triple patterns,

although the computational complexity increases with

the value of n, the computational time is reduced

compared to repeatedly calculating the triple patterns

of all CQPs. Therefore, we can enumerate the physical

plans by using the combination of every two triple

patterns that consider the acquisition rules. The native

enumeration algorithm first selects an ordered BGP

graph and then enumerates the physical plans by

selecting the rules for the triple patterns. All possible

CQP are thus enumerated.

5 MONETARY COST ESTIMATION

This section describes how the CroRDF system

estimates the cost of a CQP. Assume that each

acquisition rule has a fixed cost that can be set by the

CroRDF system. Although cost may vary with

different acquisition rules, we simplify the assumption

that the cost of each acquisition rule is not dependent

on the specific predicates. Therefore, we convert the

cost estimation into a PossiNum estimation, which is

the number of possible result tuples needed for the

acquisition rule that each triple pattern in the SPARQL

query needs to generate to satisfy the overall query

target. Therefore, the cost estimation formula is as

follows:

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑠𝑡 = ∑ ∑ 𝑐𝑖𝑗 × 𝑓𝑖𝑗𝑓𝑖𝑗𝜖𝐹𝑖𝑞𝑖∈𝑇𝑃

(1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑚𝑖), where TP is the set of triple patterns in the

SPARQL query, qi is a triple pattern, Fi is a set of

Page 9: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

9

candidate acquisition rules generated by qi, fij is the PossiNum of the j-th acquisition rule in Fi, and cij is the cost of the acquisition rule corresponding to fij. To estimate the PossiNum of the triple pattern, we should fully consider the associations and restrictions among the triple patterns.

5.1 PossiNum Estimation

When executing a SPARQL query, CroRDF generates a BGP graph composed of triple patterns. A CQP corresponds to an ordered BGP graph that indicates the order in which the triple pattern is executed. Therefore, the PossiNum estimation algorithm can be regarded as a graph exploration and traversal process that considers the association among triple patterns. Based on the resolution rule turns ratio and predicate density, the whole process starts from the extended query target, estimates the result tuples that each triple pattern needs to deliver to the next triple pattern, and computes the PossiNum of each triple pattern until the entire BGP graph traversal is complete and returns the calculation result.

5.2 Important Parameters

In the PossiNum estimation, the resolution rule turns ratio and predicate density can be applied to estimate the PossiNum.

5.2.1 Resolution Rules

Resolution rules are applied to eliminate the ambiguity and inconsistency of crowdsourcing result triples, and the results are returned to the knowledge base. The form of the resolution rule is Rule(S->O, predicate), where S and O represent the subject and the object (S can be empty), respectively, and predicate is the predicate involved in the rule. The specific process groups all crowdsourcing result tuples by S, whereas for each group it regards the set of values in O as the input and outputs a result according to a specific resolution rule. Each resolution rule limits the number of inputs as a minimum or average number, and more inputs are needed if they are insufficient for the limitation. The number of inputs can be used for the query cost estimation. The resolution rules involved in the query process include distinct, majority, average, etc. In the example of the hospital system, there may be some resolution rules as follows:

Distinct(∅->hospital, Is): Distinct.

Average-3(doctor->score, Has_rate):

Calculate the average of three scores.

Majority-3(doctor->hospital, WorkIn): Take

most items of the three results.

5.2.2 Resolution Rule Turns Ratio

The resolution rule turns ratio can estimate the average number of output tuples for each input tuple. For example, the resolution rule Average-n represents the average value of n values and the turns ratio is 1/n;

Majority-n represents the majority of n results and the turn ratio is between 1/3 and 1/2 when n = 3 (when the two results are consistent, 1/2; when inconsistent, 1/3).

5.2.3 Predicate Density

The predicate density in an acquisition rule implies a probability value that owns the predicate for all possible RDF resources. The predicate density is related to the predicate category, such as for the acquisition rules Is(?, doctor) and WorkIn(?doctor, “Beiyi Hospital”), whose possible predicate densities are 1 and 0.1, respectively.

5.3 Calculate the PossiNum

First, we define four types of relationships

between triple patterns, as shown in Table 1. The

crowdsourcing process for each triple pattern has a

direction, which refers to the direction between the

source and target, represented by src and tgt,

respectively. The source and target differ from the

subject and object. The right arrow ‘→’ represents the

matching direction from subject to object, whereas the

left arrow ‘←’ indicates from object to subject. For

Page 10: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

10

example, for q2⃖⃗ ⃗⃗⃗ , src represents the source and the

object of the triple, whereas for q2⃗⃗⃗⃗ , src indicates the

subject of the triple.

TP

Relationship

Description (example in Fig. 3)

R1: src-src q1⃗⃗ ⃗⃗ ⃗ and 𝑞3⃗⃗ ⃗⃗ ⃗

R2: tgt-src 𝑞3⃗⃗ ⃗⃗ ⃗ and 𝑞4⃗⃗⃗⃗

R3: src-tgt q1⃗⃗ ⃗⃗ ⃗ and 𝑞2⃖⃗ ⃗⃗⃗

R4: tgt-tgt 𝑞1⃖⃗ ⃗⃗ ⃗⃗ and 𝑞3⃖⃗ ⃗⃗⃗

Table. 1. Relationships between the triple patterns

Now, we explain the TriplePossEst PossiNum

estimation algorithm in terms of the four types of

relationships between triple patterns. The basic

process unit of the algorithm is a single triple pattern.

In the implementation process, two input parameters

are involved:

target: The number of target tuples to be

output for one triple pattern.

binding: The candidate set of association

values between the triples.

According to the input parameters and a CQP,

the TriplePossEst algorithm estimates the PossiNum

for a specific triple pattern, and the output is passed

as the target input of the next triple pattern. Then, the

total estimated cost of all tuples is calculated

cumulatively. Four local variables are referenced in

each triple pattern estimation:

fets: The acquisition rule set of a triple

pattern.

preds: The predicate set with the density of

the involved triple pattern.

res_sel: The resolution rule set and its

turns-ratio.

poss: The PossiNum of the current triple

pattern.

Algorithm 4 illustrates the basic process of the

TriplePossEst algorithm. The input is the CQP,

including the process order and crowdsourcing

direction of TP. The output is the estimated PossiNum

of CQP, which is the number of possible result tuples

Algorithm 5:TriplePossEstCore

Input: target, binding, tp

Output: poss

1 fets, preds, res_sel <- Initialize(tp)

2 r_type <- Relationship(tp) with the previous tp

3 poss <- target-|binding.existingpartialdata(tp)|

4 posss <- {poss, poss,…, poss}

5 {f1, …, fn} <- sort(fets)

6 If r_type = R1 or R2 then

7 tp.src <- binding

8 for fi in fets do

9 if Mapping(fi.src, tp.src) then

1

0

for pred in {preds∪res_sel(fi)} do

1

1

posss[i] <- posss[i] / pred.density

1

2

else if Mapping(fi.tgt, tp.src) then

1

3

c <- (1-|tp.src|/posss[i])*tp(associate_attribute as

src).preds.density

1

4

posss[i] <- posss[i]/c

1

5

else if r_type = R3 or R4 then

1

6

tp.tgt <- binding

1

7

for fi in fets do

1

8

if Mapping(fi.tgt, tp.tgt) then

1

9

c <-

(1-|tp.tgt|/posss[i])*tp(associate_attribute as

tgt).preds.density

2

0

posss[i] <- posss[i]/c

2

1

else if NoMapping(fi, tp.tgt) then

2

2

for pred in res_sel(fi) do

2

3

posss[i] <- posss[i]/ pred.density

2

4

else if r_type=NULL

2

5

for fi in fets do

2

6

for pred in {preds∪res_sel(fi)} do

2

7

posss[i] <- posss[i]/ pred.density

2

8

return poss <- sum(posss)

Algorithm 4:TriplePossEst

Input: Crowdsourcing Query Plan (CQP)

Output: Estimation PossiNum (EstPoss)

1 target <- n-N or 1

2 binding <- GraphExplore(DataBase)

3 for tpi in TP do

4 TriplePossEstCore(target, binding, tpi)

5 target <- poss

6 associate_attribute <- Relation(tpi, tpi+1)

7 binding <- binding(associate_attribute)

∪GraphExplore(DataBase)

8 EstPoss <- EstPoss+poss

9 return EstPoss

Page 11: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

11

needed for all acquisition rules in the CQP. First,

according to the query result in the Search phase, the

algorithm initializes the parameters target and binding

(lines 1 and 2). For MinTuples n, the parameter target is

initialized as the number of result tuples required to

satisfy the query target n; for MaxCost c, target is

initialized to 1.

Then, the algorithm calls the TriplePossEstCore

algorithm to calculate the PossiNum of each triple

pattern, updates the input parameters of the next triple

pattern, and sums the PossiNum estimation

cumulatively (lines 4-7). Algorithm 5 illustrates the

TriplePossEstCore algorithm procedure, which aims to

estimate the PossiNum of the current triple pattern.

The inputs are the current triple pattern tp, the number

of results to be output, and the value set associated

with the previous tp. If the current tp is the first one in

the CQP, the binding set is initialized by the

TriplePossEst algorithm. The output is the PossiNum

of the current tp. The initialization is processed in lines

1-5 to obtain the following information about tp: fets,

preds, and res_sel. The fets set determines the

association type between the current tp and the

previous tp and initializes the PossiNum. Then,

according to the association types, three cases are

handled separately. The first case (lines 6-14) is applied

to the R1 and R2 association types. In this case, the

binding set limits the range of the src of tp. Therefore,

when the acquisition rules in the fets set are

crowdsourced in a certain order, it is unnecessary to

crowdsource the variable node values that match the

binding set to collect new data. In terms of the matching

type between the acquisition rules and the candidate

set of the association values of tp, the algorithm

estimates the number of other possible crowdsourcing

questions. The second case (lines 15-23) aims to handle

the R3 and R4 association types. In this case, the

binding set limits the range of the tgt of tp. Similarly, the

PossiNum of each acquisition rule is calculated in

terms of different matching types. When the

acquisition rule obtains the unassociated values before

knowing the associated values, the algorithm must

re-calculate the number of acquisition rules required to

obtain the associated values based on the estimated

cost and the binding set (lines 13 and 14 and lines 19

and 20). The third case occurs when tp is the first triple

pattern in the CQP or when there is no association

between the two tps. The PossiNum of acquisition

rules can be estimated directly based on the density of

predicates and the turns-ratio of resolution rules (lines

24-27). Finally, the PossiNum values of all acquisition

rules are combined as the output result.

For the target MinTuples n, the parameter target is

assigned to the number of results still to be

crowdsourced, considering the partial query results

generated by the existing knowledge. For the target

MaxCost c, the principle of the algorithm is to return as

many results as possible within the range of cost c,

based on returning at least n query results (n is the

system default). Therefore, we complete the PossiNum

estimation process in three steps. First, it sets the

parameter target to the number of partial result tuples

from the Search phase and calculates the cost used to

return the missing values in the partial result tuples. If

the number limit is satisfied or the budget has been

exceeded, the process returns the results directly and is

ended; otherwise, it proceeds to the next step. Second,

the process sets the target to 1 to calculate the cost for

returning one result tuple. Third, according to the cost

c, it repeats the calculation until the budget is

exhausted and then returns the number of tuples in the

result.

5.4 Example of Cost Estimation

We illustrate the PossiNum estimation algorithm

in Section 5.3 with two simple examples. Taking the

query in Section 4.2.2 for example, for simplicity, we

assume that the predicates PositionalTitle =

'PROFESSOR' and Has_level = 3 have a density of 0.2

and that the other predicates have a density of 1. The

turns ratios of the resolution rules Distinct, Average-3,

and Majority-3 are 1.0, 0.3, and 0.4, respectively. The

TriplePossEst({𝒒𝟐⃖⃗⃗⃗⃗⃗ , 𝒒𝟏⃗⃗⃗⃗ ⃗, 𝒒𝟑⃗⃗⃗⃗ ⃗, 𝒒𝟒⃗⃗⃗⃗ ⃗, 𝒒𝟓⃗⃗⃗⃗ ⃗})

q2.TriplePossEstCore(4, {wang1})

q2.poss=3

q1.TriplePossEstCore(3, {wang1, 9}∪binding(doctor))

q1.poss=10

q3.TriplePossEstCore(3, {wang1, 9, Beiyi

Hospital }∪binding(doctor))

q3.poss=7.5

q4.TriplePossEstCore(8.5, {wang1, 9, Beiyi

Hospital }∪binding(hospital))

q4.poss=21.25

q5.TriplePossEstCore(7.5, {wang1, 9, Beiyi Hospital,

3}∪binding(hospital))

q5.poss=18.75

Fig. 6(a) PossiNum estimation process of Plan A

Page 12: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

12

cost of each acquisition rule is assumed to be $0.05. The

resolution rules involved in CQPs A and B are

Distinct(∅->doctor, Is), Average-3(doctor->score,

Has_rate), Majority-3(doctor->hospital, WorkIn),

Majority-3(hospital->field, MajorIn), and

Majority-3(hospital->level, Has_level).

Plan A: Fig. 6(a) shows the PossiNum estimation

process of plan A. First, we consider the impact of the

existing data. In our case, the partial query results are

the tuples of {wang1, Jishuitan Hospital, orthopedics, 8}

and {wang1, Beiyi Hospital, ?, 9}. Therefore, the target

parameter is initialized with 4, and the binding set is

{wang1} (the first processed triple pattern is q2). Then,

TriplePossEstCore (4, {wang1}) is called to process q2.

For the acquisition rule, PositionalTitle(?doctor,

PROFESSOR), all results satisfy the predicate, and

q2.poss = 4-1 = 3. Then, TriplePossEstCore (3, {wang1,9}

∪ binding (doctor)) is called to process q1. Since the

predicate density is 1, the resolution rule Average-3

turns ratio is 0.3, and the acquisition rule is

Has_rate(doctor, ?score), q1.poss = 3 / 0.3 = 10.

Similarly, q3.poss = 3 / 0.4 = 7.5. In case of the lack of a

‘field’ value in the partial results, q4.binding = {wang1,

9, Beiyi Hospital} ∪ binding (hospital). The PossiNum

calculation of q4 should consider supplying the

missing data; therefore, q4.poss = (7.5 + 1) / 0.4 = 21.25.

Similarly, q5.poss = 7.5 / 0.4 = 18.75. The final estimated

PossiNum is 3 + 10 + 7.5 + 21.25 + 18.75 = 60.5, and the

estimated cost is $0.05 × 60.5 = $3.025.

Plan B: Fig. 6(b) shows the PossiNum estimation

process of plan B. The difference from plan A is the

crowdsourcing direction of q2. The initialization is the

same as in plan A. When calling TriplePossEstCore (4,

{wang1}) to process q2, q2.poss = 2 × (4-1) /0.2=30

owing to the density of the predicate PositionTitle =

'PROFESSOR'. Then, TriplePossEstCore (30, {wang1,9}

∪ binding (doctor)) is called to process q1, q1.poss = 30 /

0.3 = 100. Similarly, q3.poss = 30 / 0.4 = 75, q4.poss = (75

+ 1) /0.4=190, and q5.poss=75/0.4 = 187.5. The final

estimated PossiNum is 30 + 100 + 75 + 190 + 187.5 = 585,

and the estimated cost is $0.05 × 585 = $29.25.

As shown above, plan A costs less than plan B.

Therefore, to optimize the query cost, plan A is a better

choice than plan B.

6 EXPERIMENTAL EVALUATION

In this section, we evaluate experimentally the

performance of the CroRDF system’s crowdsourcing

query optimizer, focusing on the accuracy of the cost

estimation algorithm. We only consider the query

target MinTuples n (because the target MaxCost c is

also based on the TriplePossEst algorithm). First, we

evaluate the performance of the cost estimation

algorithm with different settings. Then, we validate the

effectiveness of the acquisition rule scores and two

plan enumeration algorithms.

6.1 Accuracy of the Cost Estimation

To evaluate the accuracy of the CroRDF cost

model, we designed three experiments to compare the

actual cost with the estimated cost: no data in the

knowledge base (Experiment 1), partial data

(Experiment 2), and partial data with different logic

query plans (Experiment 3). For Experiment 1, we

adopted a real crowdsourcing platform (Amazon

Mechanical Turk) to execute different CQPs and

acquire the actual crowdsourcing cost for comparison

with the experimental result. To perform repeated

experiments and not generate actual cost, we built a

crowdsourcing simulator that returns results by

selecting from a predefined set of values. We could set

the simulator to either always return correct answers

or return wrong answers with a certain probability.

Experiment 1: No data. For the SPARQL query in

Section 4.2.2, we considered the query target

MinTuples 5 by adopting the following two CQPs:

Plan A {𝑞2⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ } and Plan B{𝑞2⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ }.

The acquisition rules of plans A and B are shown in Fig.

5. Assume that the cost of each acquisition rule is $0.05

and that the crowdsourcing start situation is no data.

The actual costs of the two crowdsourcing plans are

$4.5 and $45.25, respectively.

TriplePossEst({𝒒𝟐⃗⃗⃗⃗ ⃗, 𝒒𝟏⃗⃗⃗⃗ ⃗, 𝒒𝟑⃗⃗⃗⃗ ⃗, 𝒒𝟒⃗⃗⃗⃗ ⃗, 𝒒𝟓⃗⃗⃗⃗ ⃗})

q2.TriplePossEstCore(4, {wang1})

q2.poss=30

q1.TriplePossEstCore(30, {wang1, 9}∪binding(doctor))

q1.poss=100

q3.TriplePossEstCore(30, {wang1, 9, Beiyi

Hospital }∪binding(doctor))

q3.poss=75

q4.TriplePossEstCore(76, {wang1, 9, Beiyi

Hospital }∪binding(hospital))

q4.poss=190

q5.TriplePossEstCore(75, {wang1, 9, Beiyi Hospital,

3}∪binding(hospital))

q5.poss=187.5

Fig. 6(b) PossiNum estimation process of Plan B

Page 13: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

13

The experimental parameter settings were the

same as those in the example in Section 5.4. The

estimated results were $4.835 and $48.33, respectively.

Fig. 7 illustrates the comparison between the estimated

costs and the actual costs of the two plans. As shown in

the figure, the overall estimated costs were very close

to the actual costs, although there were still minor

errors (7.4% and 6.8%, respectively) for two main

reasons. First, our turns ratio and density settings were

not sufficiently accurate. For example, the resolution

rule Majority-3 did not necessarily require three inputs

as expected. In the experimental result, the actual turns

ratio was estimated as 0.48. Furthermore, for Plan B,

three doctor values were finally obtained from 33

crowdsourcing results, and therefore the turns ratios of

the resolution rule Distinct and the predicate

PositionalTitle=‘PROFESSOR’ were 0.85 and 0.3,

respectively. Second, our PossiNum estimation

algorithm uses some simple assumptions. For the

acquisition rule P(?var, CONST), it is assumed that the

results always satisfy a constant restriction, but this is

often not the case. For example, we assumed that the

crowdsourcing results of the rule PositionTitle(?doctor,

PROFESSOR) always satisfied the predicate

PositionalTitle=‘PROFESSOR’, but in fact, the

crowdsourcing workers were likely to return an

unmatched answer. To solve this problem, we can

adjust the turns ratio of the resolution rules associated

with these acquisition rules to accommodate

real-world uncertainties.

Experiment 2: Partial data. Because of the

crowdsourcing cost and latency arising from the

repeated experiments performed on a real

crowdsourcing platform, we adopted a crowdsourcing

simulator to simulate the crowdsourcing platform to

collect data for the following experiments. We mainly

considered the target MinTuples n and observed the

estimated results and actual results under different

existing data distributions. Consider two different

types of SPARQL queries (star structure and chain

structure):

Query I: select ?doctor, ?hospital, ?position, ?score

Where {q1: ?doctor WorkIn ?hospital, q2: ?doctor

PositionalTitle ?position, q3: ?doctr Has_rate ?score}.

The query plan is {𝑞1⃗⃗⃗⃗ , 𝑞2⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ }, and the acquisition rules

are {q1: Is(?, doctor), WorkIn(doctor, ?hospital); q2:

PositionTitle(doctor, ?position); q3:

Has_rate(doctor, ?score)}

Query II: select ?doctor, ?hospital, ?field Where

{?doctor Has_rate ? 9, ?doctor

WorkIn ?hospital, ?hospital MajorIn ?field}. The query

plan is {𝑞1⃖⃗ ⃗⃗⃗, 𝑞2⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ }, and the acquisition rules are {q1:

Has_rate(?doctor, 9); q2: WorkIn(doctor, ?hospital); q3:

MajorIn(hospital, ?field)}.

Suppose that the resolution rule of the hospital,

position, and field is Majority-3; the resolution rule of

the doctor is Distinct; the resolution rule of the score is

Average-3; and the turns ratios are 0.4, 1 and 0.3,

respectively. Fig. 8 shows the comparison of the

estimated and actual results when N results were

obtained. In the experiment, we set three different

initial states of the existing data and randomly selected

0, 100, and 200 different values. The query results of

the existing data were obtained through graph

exploration in the Search phase. The crowdsourcing

query was performed based on the partial result tuples.

The results showed that the estimated costs were very

close to the actual costs. Under the three data

distributions, the average relative errors of query I and

query II were 3.75%, 10%, and 34.95% and 9.31%,

14.18%, and 13.17%, respectively. The estimation

algorithm could distinguish between the existing data

and the crowdsourcing data, and different initial states

were reflected in the cost estimation.

Plan A Plan B

Fig. 7. MinTuples: Accuracy of the cost estimation without data

Page 14: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

14

Experiment 3: Partial data with different logic

query plans. For the query in Section 4.2.2, we

considered the following two logical query plans and

the corresponding acquisition rules:

Plan A {q2⃖⃗ ⃗⃗⃗, 𝑞5⃖⃗ ⃗⃗⃗, 𝑞3⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ } :

q2: PositionTitle(?doctor, ‘PROFESSOR’);

q5: Has_level(?hospital, 3);

q3: WorkIn(doctor, ?hospital);

q1: Has_rate(doctor, ?score);

q4: MajorIn(hospital, ?field).

Plan B{𝑞5⃖⃗ ⃗⃗⃗, 𝑞4⃗⃗⃗⃗ , 𝑞3⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞2⃗⃗⃗⃗ }:

q5: Has_level(?hospital, 3);

q4: MajorIn(hospital, ?field);

q3: WorkIn(doctor, ?hospital);

q1: Has_rate(doctor, ?score);

q2: PositionTitle(doctor, ?position).

The resolution rules and turns ratio were the same

as those in Experiment 1. Considering the three

different initial states of the existing data, Fig. 9 shows

the comparison between the estimated results and the

actual results of the two query plans in the case of the

target, MinTuples. In Fig. 9(a) and (b), the estimated

cost is close to the actual cost. The average relative

errors of plans A and B were 14.37% and 11.9%,

respectively. The result of plan B in Fig. 9(c) illustrates

a poor situation where the cost model could not predict

the best execution plan. This failure occurred is

because we controlled the generation of the data,

doctor, and hospital to meet the query predicate

requirements, which led to a difference between the

default predicate density and the actual one, resulting

in inaccurate results.

(a) Initial state: 0 value (b) Initial state: 100 values (c) Initial

state: 200 values

Fig. 8. MinTuples: Accuracy of the cost estimation with partial data

Fig. 11. Performance comparison between the two enumeration algorithms

(a) Initial state 1 (b) Initial state 2

(c) Initial state 3

Fig. 9. MinTuples: Accuracy of the cost estimation with different logical plans

(a) Initial state 1 (b) Initial state 2

(c) Initial state 3

Fig. 10. Cost comparison of different evaluation scores

Page 15: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

15

6.2 Validity of the Enumeration Algorithms

Experiment 4: Evaluation scores of the acquisition

rules. The selection of different acquisition rules for the

same logical plan results in different physical plans.

The experiment validated the effectiveness of the two

evaluation scores in optimizing the selection of the

acquisition rules proposed in Section 4.2 and evaluated

the effect of enumerating the CQPs. Considering query

II in Experiment 2 and the target, MinTuples, Fig. 10

illustrates the actual costs of score1, score2, and

random selection for three different initial states of the

existing data (the number of values was set to 100, 200,

and 300 different values). The experiment assumed

that the crowdsourcing simulator only returns correct

results. From the experimental results, we can

conclude that the cost of optimizing the selection of

acquisition rules in terms of score1 and score2 was

reduced by an average of 28.2% and 33.9%,

respectively, compared to random selection. Therefore,

the choice of rules in accordance with the scores can be

accelerated to locate the possible candidate physical

plans, reduce the enumeration space, and find the best

crowdsourcing physical plan more quickly. Moreover,

as the state of the data changed, the optimization gap

between score1 and score2 decreased gradually

because an acquisition rule can fill more values in the

partial results as the existing data increase.

Experiment 5: Enumeration algorithms. To

evaluate the effectiveness of the enumeration process,

we compared the two enumeration algorithms

considering the overall optimization time, i.e., the time

from the crowdsourcing query to finding the optimal

crowdsourcing plan, as the evaluation criterion. Since

the search space depends on the number of triple

patterns in the SPARQL query to some extent, we

generated a series of queries with varying numbers of

triple patterns t and calculated the optimization time of

each query execution. Fig. 11 shows the comparison

results of the two enumeration algorithms. The

performance of the improved enumeration algorithm

was much better than that of the native enumeration

algorithm. With increasing t, the optimization effect

was more obvious. When t = 9, the improved

enumeration algorithm was 2.3 times faster than the

native algorithm.

8 CONCLUSION

This paper presented CroRDF to complete RDF

queries via crowdsourcing with a crowdsourcing

query plan optimizer that finds the optimal CQP based

on the estimated monetary cost. According to the

characteristics of the RDF data and the query

requirements, we defined the data model and

extended the SPARQL query statement. We proposed

a plan enumeration algorithm based on triple pattern

sequences and acquisition rule selection and a

monetary cost estimation algorithm. Through the

comparison of actual data and simulation data, the

accuracy of our cost estimation algorithm and the

validity of the plan enumeration algorithm were

verified.

In future work, we will study how to optimize

multiple SPARQL crowdsourcing queries that

integrate a reasoning module and extract the common

query substructure to turn multiple queries into one

query for crowdsourcing to effectively reduce the

crowdsourcing cost.

ACKNOWLEDGMENTS

This research is supported by the National Natural

Science Foundation of China under Grant No. 61672102,

No. 61073034, No. 61370064 and No. 60940032; the

Program for New Century Excellent Talents in the

University of Ministry of Education of China under

Grant No. NCET-10-0239; the Science Foundation of

Ministry of Education of China and China Mobile

Communications Corporation under Grant No.

MCM20130371; and the Open Project Sponsor of

Beijing Key Laboratory of Intelligent Communication

Software and Multimedia under Grant No. ITSM201493. Corresponding author. Tel.: +86 13121915269. E-mail

address: [email protected] (D. Dang)

REFERENCES

[1] Acosta, Maribel, et al. "HARE: An Engine for

Enhancing Answer Completeness of SPARQL

Queries via Crowdsourcing." Companion of the The

Web Conference 2018 on The Web Conference

2018. International World Wide Web Conferences

Steering Committee, 2018.

[2] Preda, Nicoleta; Kasneci, Gjergji; Suchanek, Fabian

M.; Neumann, Thomas; Yuan, Wenjun; Weikum,

Gerhard. ActiVE KNOWLEDGE:

DYNAMICALLY ENRICHING RDF

KNOWLEdge Bases by Web Services. Proceedings

of the ACM SIGMOD International Conference on

Management of Data, p 399-410, 2010.

Page 16: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

16

[3] Demartini, Gianluca, et al. "CrowdQ:

Crowdsourced Query Understanding." CIDR. 2013.

[4] Acosta, Maribel, et al. "HARE: An Engine for

Enhancing Answer Completeness of SPARQL

Queries via Crowdsourcing." Companion of the The

Web Conference 2018 on The Web Conference

2018. International World Wide Web Conferences

Steering Committee, 2018.

[5] Acosta, Maribel, et al. "Enhancing answer

completeness of SPARQL queries via

crowdsourcing." Web Semantics: Science, Services

and Agents on the World Wide Web 45 (2017):

41-62.

[6] Doan, Anhai; Ramakrishnan, Raghu; Halevy, Alon

Y. Crowdsourcing Systems on the World-Wide Web.

Communications of the ACM, v 54, n 4, p 86-96,

April 2011.

[7] Huang, Shih-Wen; Fu, Wai-Tat. Enhancing

Reliability Using Peer Consistency Evaluation in

Human Computation. Proceedings of the ACM

Conference on Computer Supported Cooperative

Work, CSCW, p 639-647, 2013, CSCW 2013.

[8] Chittilappilly, A, I; Chen, L; Ameryahia, S. A

Survey of General-Purpose Crowdsourcing

Techniques. IEEE Transactions on Knowledge &

Data Engineering, v 28, n 9, p 2246-2266, 2016.

[9] Li, Guoliang; Yudian, Zheng; Ju, Fan; Jianan, Wang;

Reynold, Cheng. Crowdsourced Data Management:

Overview and Challenges. Proceedings of the 2017

ACM International Conference on Management of

Data, ACM, p 1711-1716, 2017.

[10] Kaler, Kamaljot S., et al. "Crowdsourcing

Evaluation of. Ureteroscopic Videos Using the

Post-Ureteroscopic Lesion Scale to Assess Ureteral

Injury." Journal of esndourology 32.4 (2018):

275-281.

[11] Katsurai, Marie. "Bursty research topic detection

from scholarly data using dynamic Co-word

networks: A preliminary investigation." Big Data

Analysis (ICBDA), 2017 IEEE 2nd International

Conference on. IEEE, 2017.

[12] Liu, Xi, Yiju Zhan, and Jian Cen. "An

Energy-efficient Crowd-sourcing-based

IndoorAutomaticLocalization System." IEEE

Sensors Journal (2018).

[13] Wang Hong, et al. "Research on Domain Ontology

Storage Method Based on Neo4j." Computer

Application Research 8 (2017): 039.

[14] Nguyen, Vinh, et al. "A Formal Graph Model for

RDF and Its Implementation." (2016).

[15] Saleem, Muhammad, et al. "Costfed: Cost-based

query optimization for sparql endpoint

federation." Procedia Computer Science 137 (2018):

163-174.

[16] Park, Hyunjung; Widom, Jennifer. CrowdFill:

Collecting structured data from the crowd.

Proceedings of the ACM SIGMOD International

Conference on Management of Data, p 577-588,

2014.

[17] Nicholson, Bryce; Sheng, Victor S.; Zhang, Jing.

Label noise correction and application in

crowdsourcing. Expert Systems with Applications,

v 66, p 149-162, December 30, 2016.

[18] Chittilappilly, Anand Inasu, Lei Chen, and Sihem

Amer-Yahia. "A survey of general-purpose

crowdsourcing techniques." IEEE Transactions on

Knowledge and Data Engineering 28.9 (2016):

2246-2266.

[19] M. Franklin, D. Kossmann, T. Kraska, S. Ramesh,

R. Xin, CrowdDB: answering queries with

crowdsourcing, in: SIGMOD, 2011, pp. 61–72.

[20] Acosta, Maribel, et al. "HARE: An engine for

enhancing answer completeness of SPARQL

queries via crowdsourcing." (2018): 501-505.

[21] Parameswaran, Aditya Ganesh, et al. "Deco:

declarative crowdsourcing." Proceedings of the 21st

ACM international conference on Information and

knowledge management. ACM, 2012.

[22] Chaudhuri, Surajit, and Kyuseok Shim. "Query

optimization in the presence of foreign functions."

VLDB. Vol. 93. 1993.

[23] Park, Hyunjung, et al. "Deco: A system for

declarative crowdsourcing." Proceedings of the

VLDB Endowment 5.12 (2012):1990-1993.

[24] Park, Hyunjung, and Jennifer Widom. "Query

optimization over crowdsourced data." Proceedings

of the VLDB Endowment 6.10 (2013): 781-792.

[25] JuFan,MeihuiZhang,StanleyKok,MeiyuLu,BengChi

nOoi,CrowdOp:Query optimization for declarative

crowdsourcing systems ,IEEETrans.Knowl.Data

Eng.27(8)(2015)2078–2092.

[26] Shaukat, Kamran, and Usman Shaukat. "Comment

extraction using declarative crowdsourcing (CoEx

Deco)." 2016 International Conference on

Computing, Electronic and Electrical Engineering

(ICE Cube). IEEE, 2016.

[27] Li, Guoliang, et al. "Crowdsourced Data

Management: A Survey." 2017 IEEE 33rd

International Conference on Data Engineering

(ICDE). IEEE, 2017.

[28] Pérez, Jorge, Marcelo Arenas, and Claudio

Gutierrez. "Semantics and complexity of

SPARQL." ACM Transactions on Database

Systems (TODS) 34.3 (2009): 16.

[29] Preda, Nicoleta, et al. "Active knowledge:

Page 17: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

17

dynamically enriching RDF knowledge bases by

web services." Proceedings of the 2010 ACM

SIGMOD International Conference on Management

of data. ACM, 2010.

[30] Gerber, Daniel; Hellmann, Sebastian; Bühmann,

Lorenz; Soru, Tommaso; Usbeck, Ricando; Ngonga

Ngomo, Axel-Cyrille. Real-time RDF extraction

from unstructured data streams. The Semantic Web,

ISWC 2013-12th International Semantic Web

Conference, v 8218 LNCS, n PART 1, p 135-150,

2013.

[31] Zeng, Kai; Yang, Jiacheng; Wang, Haixun; Shao,

Bin; Wang, Zhongyuan. A distributed graph engine

for web scale RDF data. Proceedings of the VLDB

Endowment, v 6, n 4, p 265-276, 2013.

[1] Kaoudi, Zoi; Manolescu, Ioana. RDF in

the clouds: a survey. VLDB Journal, v 24, n 1, p

67-91, July 11, 2014.

[2] Özsu, M. Tamer. A survey of RDF data

management systems. Frontiers of Computer

Science, v 10, n 3, p 418-432, 2016.

[3] Jiang, Tao; Tan, Ah-Hwee. Mining

RDF Metadata for Generalized Association Rules:

Knowledge Discovery in the Semantic Web Era.

Proceedings of the 15th International Conference on

World Wide Web, p 951-952, 2006.

[4] Hollenbach, James; Presbrey, Joe;

Berners-Lee, Tim. Using RDF Metadata To Enable

Access Control on the Social Semantic Web.

Proceedings of the Workshop on Collaborative

Construction, Management and Linking of

Structured Knowledge, CK 2009 - Collocated with

the 8th International Semantic Web Conference,

ISWC 2009, v 514, 2009.

[5] Jenkins, Charlotte; Jackson, Mike;

Burden, Peter; Wallis, Jon. Automatic RDF metadata

generation for resource discovery. Computer

Networks, v 31, n 11, p 1305-1320, May 17, 1999.

[6] Papamarkos, George; Poulovassilis,

Alexandra; Wood, Peter T. Event-condition-action

rules on RDF metadata in P2P environments.

Computer Networks, v 50, n 10, p 1513-1532, July

14, 2006.

[7] Destefano, R.J.; Tao, Lixin; Gai, Keke.

Improving Data Governance in Large Organizations

through Ontology and Linked Data. 3rd IEEE

International Conference on Cyber Security and

Cloud Computing, CSCloud 2016 and 2nd IEEE

International Conference of Scalable and Smart

Cloud, SSC 2016, p 279-284, August 16, 2016.

[8] Asano, Yu; Koide, Seiji; Iwayama,

Makoto; Kato, Fumihiro; Kobayashi, Iwao; Mima,

Tadashi; Ohmukai, Ikki; Takeda, Hideaki.

Constructing a Site for Publishing Open Data of the

Ministry of Economy, Trade, and Industry — A

Practice for 5-Star Open Data —. New Generation

Computing, v 34, n 4, p 341-366, October 1, 2016.

[9] Thuy, Pham Thi Thu; Lee, Young-Koo;

Lee, Sungyoung. A Semantic Approach for

Transforming XML Data into RDF Ontology.

Wireless Personal Communications, v 73, n 4, p

1387-1402, December 2013.

[10] McClure, John. The Legal-RDF Ontology. A Generic

Model for Legal Documents. Proceedings of the 2nd Workshop on Legal

Ontologies and Artificial Intelligence Techniques, v 321, p 25-42, 2007.

[11] Jung, Hyosook; Yoo, Sujin; Kim, Doyeon; Park, Seongbin.

A grammar based approach to introduce the Semantic Web to novice users.

Multimedia Tools and Applications, v 75, n 23, p 15587-15600, December

1, 2016.

[12] Auer, Sören; Bizer, Christian; Kobilarov, Georgi; Lehmann,

Jens; Cyganiak, Richard; Ives, Zachary; DBpedia: A nucleus for a Web of

open data. The Semantic Web - 6th International Semantic Web Conference

- 2nd Asian Semantic Web Conference, v 4825 LNCS, p 722-735, 2007.

[13] Suchanek, Fabian M.; Kasneci, Gjergji; Weikum, Gerhard.

YAGO: A Core of Semantic Knowledge Unifying WordNet and

Wikipedia. Source: 16th International World Wide Web Conference,

WWW2007, p 697-706, 2007.

[14] Preda, Nicoleta; Kasneci, Gjergji; Suchanek, Fabian M.;

Neumann, Thomas; Yuan, Wenjun; Weikum, Gerhard. Active Knowledge:

Dynamically Enriching RDF Knowledge Bases by Web Services.

Page 18: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

18

Proceedings of the ACM SIGMOD International Conference on

Management of Data, p 399-410, 2010.

[15] Gerber, Daniel; Hellmann, Sebastian; Bühmann, Lorenz;

Soru, Tommaso; Usbeck, Ricando; Ngonga Ngomo, Axel-Cyrille.

Real-time RDF extraction from unstructured data streams. The Semantic

Web, ISWC 2013 - 12th International Semantic Web Conference, v 8218

LNCS, n PART 1, p 135-150, 2013.

[16] Bouquet, Paolo; Serafini, Luciano; Stoermer, Heiko.

Introducing Context into RDF Knowledge Bases. SWAP 2005 - Semantic

Web Applications and Perspectives, Proceedings of the 2nd Italian

Semantic Web Workshop, v 166, 2005.

[17] Stoermer, Heiko; Palmisano, Ignazio; Redavid, Domenico;

Iannone, Luigi; Bouquet, Paolo; Semeraro, Giovanni. Contextualization of

a RDF Knowledge Base in the VIKEF Project. Digital Libraries:

Achievements, Challenges and Opportunities - 9th International

Conference on Asian Digital Libraries, v 4312 LNCS, p 101-110, 2006.

[18] [19] Das Sarma, Anish; Parameswaran, Aditya; Garcia-Molina,

Hector; Halevy, Alon. Crowd-Powered Find Algorithms. 2014 IEEE 30th

International Conference on Data Engineering, p 964-975, 2014.

[20] Whang, Steven Euijong; Lofgren, Peter; Garcia-Molina,

Hector. Question Selection for Crowd Entity Resolution. Proceedings of

the VLDB Endowment, v 6, n 6, p 349-360, 2013.

[21] Parameswaran, Aditya G.; Garcia-Molina, Hector; Park,

Hyunjung; Polyzotis, Neoklis; Ramesh, Aditya; Widom, Jennifer.

CrowdScreen: Algorithms for Filtering Data with Humans. Proceedings of

the ACM SIGMOD International Conference on Management of Data, p

361-372, 2012.

[22] Bönström, V.; Hinze, A.; Schweppe, H. Storing RDF as a

graph. Proceedings - 1st Latin American Web Congress: Empowering our

Web, LA-WEB 2003, p 27-36, 2003.

[23] Shao, Bin; Wang, Haixun; Li, Yatao. Trinity: A distributed

graph engine on a memory cloud. Proceedings of the ACM SIGMOD

International Conference on Management of Data, p 505-516, 2013.

[24] Zeng, Kai; Yang, Jiacheng; Wang, Haixun; Shao, Bin;

Wang, Zhongyuan. A distributed graph engine for web scale RDF data.

Proceedings of the VLDB Endowment, v 6, n 4, p 265-276, 2013.

[25] E. Prud’hommeaux and A. Seaborne. SPARQL Query

Language for RDF. Technical report, W3C, 2008.

[26] Pérez, Jorge; Arenas, Marcelo; Gutierrez, Claudio.

Semantics and complexity of SPARQL. Source: ACM Transactions on

Database Systems, v 34, n 3, August 1, 2009.

[27] Park, Hyunjung; Widom, Jennifer. CrowdFill: Collecting

structured data from the crowd. Proceedings of the ACM SIGMOD

International Conference on Management of Data, p 577-588, 2014.

[28] Guo, Stephen; Parameswaran, Aditya; Garcia-Molina,

Hector. So who won? Dynamic max discovery with the crowd.

Proceedings of the ACM SIGMOD International Conference on

Management of Data, p 385-396, 2012.

[29] Nicholson, Bryce; Sheng, Victor S.; Zhang, Jing. Label

noise correction and application in crowdsourcing. Expert Systems with

Applications, v 66, p 149-162, December 30, 2016.

[30] Simon C Warby, Sabrina L Wendt, Peter Welinder, Emil

G S Munk, Oscar Carrillo, Helge B D Sorensen, Poul Jennum, Paul E

Peppard, Pietro Perona, Emmanuel Mignot. Sleep-spindle detection:

crowdsourcing and evaluating performance of experts, non-experts and

automated methods. Nature Methods 11 385-392 2014.

[31] Li, G; Wang, J; Zheng, Y; et al. Crowdsourced Data

Management: A Survey. IEEE Transactions on Knowledge and Data

Engineering, v 28, n 9, p 2296-2319, 2016.

[32] Feng, J; Li, G; Wang, H; et al. Incremental Quality

Inference in Crowdsourcing. Database Systems for Advanced Applications,

p 453-467, 2014.

[33] Ipeirotis, P, G; Provost, F; Wang, J. Quality management

on Amazon Mechanical Turk. ACM SIGKDD Workshop on Human

Computation, p 64-67, 2010.

[34] Feng, J; et al. QASCA: A Quality-Aware Task Assignment

System for Crowdsourcing Applications. ACM SIGMOD International

Conference on Management of Data, p 1031-1046, 2015.

[35] Gao, J; Liu, X; Ooi, B, C; et al. An online cost sensitive

decision-making method in crowdsourcing systems. ACM SIGMOD

International Conference on Management of Data, p 217-228, 2013.

[36] Gruenheid, A; Kossmann, D; Sukriti, R; et al.

Crowdsourcing Entity Resolution: When is A=B?. Eth Department of

Computer Science Systems Group, 2012.

[37] Vesdapunt, N; Bellare, K; Dalvi, N. Crowdsourcing

algorithms for entity resolution. Proceedings of the Vldb Endowment, v 7, n

12, p 1071-1082, 2014.

[38] Kaplan, H; Lotosh, I; Milo, T; et al. Answering planning

queries with the crowd. Proc. VLDB Endowment, v 6, n 9, p 697-708, 2013.

[39] Faradani, S; Hartmann, B; Ipeirotis, P, G. What’s the Right

Price? Pricing Tasks for Finishing on Time. Proc. 11th Nat. Conf. Artif.

Intell. Workshop, p 26–31, 2011.

[40] V, Verroios; P, Lofgren; H, Garcia-Molina. TDP: An

optimal-latency budget allocation strategy for crowdsourced MAXIMUM

operations. Proc. ACM SIGMOD Int. Conf. Manage. Data, p 1047–1062,

2015.

[41] Davidson, Susan B.; Khanna, Sanjeev; Milo, Tova; Roy,

Sudeepa. Using the crowd for top-k and group-by queries. ICDT 2013 -

16th International Conference on Database Theory, p 225-236, 2013.

[42] Marcus, Adam; Karger, David; Madden, Samuel; Miller,

Robert; Oh, Sewoong. Counting with the crowd. Proceedings of the VLDB

Endowment, v 6, n 2, p 109-120, December 2012.

[43] Marcus, Adam; Wu, Eugene; Karger, David; Madden,

Samuel; Miller, Robert. Human-powered sorts and joins. Proceedings of

the VLDB Endowment, v 5, n 1, p 13-24, September 2011.

[44] Park, Hyunjung; Pang, Richard; Parameswaran, Aditya;

Garcia-Molina, Hector; Polyzotis, Neoklis; Widom, Jennifer. Deco: A

system for declarative crowdsourcing. Proceedings of the VLDB

Endowment, v 5, n 12, p 1990-1993, August 2012.

[45] Park, Hyunjung; Widom, Jennifer. Query Optimization

over Crowdsourced Data. Proceedings of the VLDB Endowment, v 6, n 10,

p 781-792, August 2013.

[46] Fan, Ju; Zhang, Meihui; Kok, Stanley; Lu, Meiyu; Ooi,

Beng Chin. CrowdOp: Query Optimization for Declarative Crowdsourcing

Systems. IEEE Transactions on Knowledge and Data Engineering, v 27, n

8, p 2078-2092, August 1, 2015.

[47] Franklin, Michael J.; Kossmann, Donald; Kraska, Tim;

Ramesh, Sukriti; Xin, Reynold. CrowdDB: Answering queries with

crowdsourcing. Proceedings of the ACM SIGMOD International

Conference on Management of Data, p 61-72, 2011.

[48] Demartini, Gianluca, Djellel Eddine Difallah, and Philippe

Cudré-Mauroux. ZenCrowd: leveraging probabilistic reasoning and

crowdsourcing techniques for large-scale entity linking. Proceedings of the

21st international conference on World Wide Web. ACM, p 469-478, 2012.

[49] Demartini, Gianluca, Djellel Eddine Difallah, and Philippe

Cudré-Mauroux. "Large-scale linked data integration using probabilistic

reasoning and crowdsourcing." The VLDB Journal 22.5, p 665-687, 2013.

[50] Sarasua, Cristina, Elena Simperl, and Natalya F. Noy.

"Crowdmap: Crowdsourcing ontology alignment with microtasks."

International Semantic Web Conference. Springer, Berlin, Heidelberg, p

525-541, 2012.

[51] Acosta, M; Zaveri, A; Simperl, E; et al. Crowdsourcing

Linked Data Quality Assessment. International Semantic Web Conference.

Springer-Verlag New York, Inc, p 260-276, 2013.

[52] Shaukat, K; Shaukat, U. Comment extraction using

declarative crowdsourcing(CoEx Deco). International Conference on

Computing, Electronic and Electrical Engineering, p 74-78, 2016.

[53] Maribel, Acosta; Elena, Simperl; Fabian, Flöck;

Maria-Esther, Vidal. Enhancing answer completeness of SPARQL queries

via crowdsourcing. Web Semantics: Science, Services and Agents on the

World Wide Web, v 45, p 41-62, 2017.

Page 19: 1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost

19

Depeng Dang received his PhD degree in Computer

Science and Technology from Huazhong University of Science

and Technology, China, in 2003. From Jul. 2003 to Jun. 2005, he

did his postdoctoral research in the Department of Computer

Science and Technology, Tsinghua University, China. Now, he is

a full professor and supervisor of Ph.D. students of in Computer

Science and Technology from Beijing Normal University, China.

Up to now, he has chaired Four NSFC projects. His research

interests include crowdsourcing computing, RDF data

management.

Wenhui Yu received her Bachelor’s

degree in Computer Science and Technology

from Beijing Normal University. She is currently

studying at the college of Information Science

and Technology, Beijing Normal University,

China. Her research interests include RDF data

management and crowdsourcing computing.

Shaofei Wang has received her Master’s degree in computer software and theory from Northwestern Polytechnical University. She is currently studying at college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.

Nan Wang has received her Bachelor’s degree in Computer

Science and Technology from Beijing Normal University. She is currently studying at the college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.