[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Privacy Preserving Query Processing Using Third Parties
Post on 19-Dec-2016
Embed Size (px)
Privacy Preserving Query Processing usingThird Parties
Fatih Emekci Divyakant Agrawal Amr El Abbadi Aziz GulbedenDepartment of Computer Science
University of California Santa BarbaraSanta Barbara, CA 93106, USA
Data integration from multiple autonomous data sourceshas emerged as an important practical problem. The key re-quirement for such data integration is that owners of suchdata need to cooperate in a competitive landscape in mostof the cases. The research challenge in developing a queryprocessing solution is that the answers to the queries needto be provided while preserving the privacy of the datasources. In general, allowing unrestricted read access tothe whole data may give rise to potential vulnerabilities aswell as may have legal implications. Therefore, there is aneed for privacy preserving database operations for query-ing data residing at different parties. In this paper, we pro-pose a new query processing technique using third parties ina peer-to-peer system. We propose and evaluate two differ-ent protocols for various database operations. Our schemeis able to answer queries without revealing any useful infor-mation to the data sources or to the third parties. Analyticalcomparison of the proposed approach with other recent pro-posals for privacy-preserving data integration establishesthe superiority of the proposed approach in terms of queryresponse times.
1 IntroductionWith the ubiquitous availability of the Internet and all
the resources it provides, data integration has emerged asan important practical problem especially from a data man-agement point of view [11, 14]. Techniques used for thispurpose commonly assume that the data sources are willingto allow access to all their data without privacy concernsduring query processing. In real life, however, owners donot want to share all their data due to various reasons suchas competition among data owners or possible legal impli-cations requiring confidentiality of the data. For example,
This research was funded in parts by NSF grants IIS IIS 02-23022,and CNF-04-23336.
consider a community of medical professionals that needto share data without violating the privacy of either the pa-tients or the organizations. One possible collaboration insuch an environment is that a medical professionals officemight be interested in all test results of its patients. The re-sults might be in some other medical facilities. For such aquery, the involved parties are only interested in finding thetest results of patients in the intersection of the different of-fices and nothing else. In fact, offices might not be allowedto share information about patients that are not in the in-tersection. Similarly, consider another type of data sharingwhere medical professionals want to classify the patientsaccording to their total number of visits to all the medicalprofessionals but do not want to violate patients or medi-cal offices privacy requirements (i.e., parties do not wantto reveal themselves or the total number of visits a patientmade to their office). As these examples illustrate, there isan increasing need for privacy preserving database opera-tions for these kinds of queries. Privacy preservation meansthat the data owner wants to restrict the extra information itprovides to other parties. In other words, data owners areonly willing to share the minimum information required forprocessing queries correctly.
The goal of privacy preserving query processing is to ex-ecute queries over multiple data sources without revealingany extra information to any of the involved data sources.At the end of the query processing, they will only acquirethe query result. For example, if the query is asking for theintersection of lists residing at different data sources, afterquery processing, the data owners will only know the queryresult (eg., the intersection of the lists but no informationabout the elements that are not in the intersection). In orderto prevent such privacy violations, any privacy preservingtechnique should provide a method to execute several cru-cial operations, in particular intersection, join and aggrega-tion which are all fundamental for general purpose queryprocessing.
Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE
Several techniques have been proposed to preserve theprivacy of data sources in the areas of databases and cryp-tography. In the trusted third parties approach, the dataowners hand over their data to a trusted third party andthe third party computes the results of the queries [1, 22].This level of trust is unacceptable for privacy preservingquery processing. Secure multi-party computation is an-other technique, where, given parties and their respectiveinputs , a function is com-puted such that all parties can only learnbut nothing else [17, 18, 23]. The computation and the com-munication complexity of this method makes it impracticalfor database operations working over a large number of ele-ments. For example, the communication needed to computethe intersection of two lists of size million takes around
days with this method . Due to the extreme costof computation and communication, Agrawal et al.  pro-posed a minimal information sharing paradigm to performintersection and join operations. Their proposed solutionis based on using encryption/decryption without the needfor third parties. However, this technique has two short-comings: (1) Since encryption/decryption are costly oper-ations in terms of computation, the query response time isvery high (in fact, for some real world applications, this ap-proach may require as much as hours of computation);(2) It cannot be used to aggregate data residing at differentsites (i.e., it does not support aggregation queries).
Due to the high overhead of cryptographic operations,and in general, the large number of elements that need to beaccessed in database operations, cryptography-based tech-niques are impractical for query processing in a real envi-ronment. Therefore, we propose a new computationally ef-ficient approach that uses third parties to solve this problem.Our scheme does not reveal any extra useful information toany of the third parties used in the computation, nor do thedata sources get extra information regarding the other par-ties data.
In this paper, we use a hash based P2P system to selectthird parties that perform the computations required for pro-cessing privacy preserving queries. P2P systems are partic-ularly suitable for this task due to their self organizing andscalable structure. Furthermore, we exploit the fact that itis possible to perform anonymous communication in a P2Psystem using the overlay network. Combining these proper-ties, we propose a query processing technique over peer-to-peer systems to speed up query response time while preserv-ing the privacy of the data owners. We use P2P systems notas a data store, which has been the focus of most of the P2Psystems. Instead, we employ a P2P system for anonymouscommunication and computation. Analytical comparisonof the proposed method with existing methods shows thatour method is able to answer queries with a much lower re-sponse time, which in turn makes privacy preserving query
processing practical for real life applications.The rest of the paper is organized as follows. In Sec-
tion 2 we provide some background information. Section 3formulates the problem. Section 4 describes query process-ing in an honest-but-curious environment. The analysis ispresented in Section 5. The paper concludes with a discus-sion of future work.2 Background
In this section we first present some of the related workin privacy. Then we introduce Shamirs secret sharingmethod  that will be used for sharing a secret with thirdparties securely.
2.1 Related Privacy Work
Naor et al.  proposed a scheme to process intersec-tion queries using oblivious transfer and polynomial evalu-ation based on cryptographic techniques. The scheme en-sures that data sources will only know the intersection butnothing else. Although this scheme can answer intersectionqueries involving two data sources the communication costrequired by this technique is where is the size ofthe databases. Therefore, it is impractical to apply in largedatabases. Hacigumus et al. , Hore et al.  and Ag-garwal et al.  propose using third parties as database ser-vice providers without the need for expensive cryptographicoperations. However the proposed schemes do not allowqueries to execute over the data of multiple providers, whichis the main focus of our work. Aggarwal et al.  show thechallenges in finding the th element in the union of morethan two databases while preserving privacy and propose anapproximate solution. In addition, several high level designefforts and requirement specifications have been made tosupport the privacy of individual information while still sup-porting some degree of sharing [1, 5, 6, 7, 10, 27, 28, 12].
Although related, our work is orthogonal or complemen-tary to privacy preserving data management in data miningand information retrieval. In data mining, several effortshave been made to either preserve the privacy of individu-als using randomized techniques [8, 9, 16, 25] or to preservethe privacy of the database while running data mining algo-rithms over multiple databases [13, 21] using cryptographictechniques such as secure multi-party computation and en-cryption.
2.2 Shamirs Secret Sharing
Shamirs secret sharing method  allows a dealerto distribute a secret value among peers
, such that knowledge of any ( )peers is required to reconstruct the secret. Since, even com-plete knowledge of peers cannot reveal any informa-tion about the secret, Shamirs method is information the-oretically secure. Dealer chooses a random polynomial
Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE
of degree where the constant term is the secretvalue, , and a publicly known set of random points. Thedealer computes the share of each peer as and sends itto peer . The method is summarized in Algorithm 1.
Algorithm 1 Shamirs Secret Sharing Algorithm1: Input:2: : Secret value;3: : Dealer of secret ;4: : set of peers to distribute secret;5: Output:6: : Shares of secret, , for each peer ;7: Procedure:8: creates a random polynomial with
degree and a constant term .9: chooses publicly known random points, , such that .
10: computes share, , of each peer, , where .
In order to construct the secret value , any set ofpeers will need to share the information they have re-
ceived. After finding the polynomial , the secret valuecan be reconstructed. can be found us-
ing Lagrange interpolation such that where. The key observation is that at least points and
shares are required in order to determine a unique polyno-mial of degree .3 Problem Formulation
The problem of query processing across multiple privatedatabases is defined as follows:
Let , , ..., be the databases of a set ofdata source peers , and be aquery spanning through . The problem isto compute the answer of without revealing anyadditional information to any of the data sourcepeers.
Intersection, join and aggregation operations are particu-larly challenging from a privacy preserving query process-ing point of view since to perform these operations, all ofthe data residing at different sites may need to be accessedto compute an answer. Therefore, we focus on these opera-tions.
Agrawal et al.  solved the intersection and equijoinproblem by restricting them to two data sources with somerelaxation in an honest-but-curious  environment. Therelaxation reveals the sizes of the tables or lists in thedatabases to the other party. The method is based on an en-cryption/decryption protocol, which requires an extensiveamount of computation, and cannot be used for more thantwo data sources.
To reduce the high computational cost due to encryption/decryption, and to speed up the query processing time, wepropose a solution in an honest-but-curious environment forintersection, equijoin and aggregation queries. Our solu-tion is similar to  which solves some of these problems(i.e., only sum and average query processing) over onlydatawarehouses and reveals more information about the un-derlying datawarehouse. In our scheme, we use third parties
located in a peer-to-peer system, namely Chord . Al-though we use Chord in our system for the selection of thethird parties, our method can be adapted to any structuredpeer-to-peer system.
Similar to , we do not aim to solve the problem ofrevealing additional information to a peer which poses mul-tiple queries and combines their results in order to obtaininformation about the data. In addition, we do not solve theproblem of data discovery and schema mediation. Solutionsto these problems are discussed briefly in .
4 Query Processing in an Honest-but-curiousEnvironment
In this section, we introduce our query processing tech-nique and explain how it works in an honest-but-curiousenvironment. An honest-but-curious environment is onewhere the parties involved in a query follow the given pro-tocol; additionally they may keep any result or informationthey obtain during the course of the protocol. This settingis also known as semi-honest environment in the literature.In our scheme, queries are answered in a privacy preservingmanner using third parties. At the end of query processing,no additional useful information is revealed to any partiesinvolved in query processing but only the query answer isrevealed to the query posers.
One possible approach towards solving this problem isto use a one-way cryptographic hash function such as SHAor MD to find the intersection of two lists. In this scheme,data sources hash the secret values in their respective listsand send the hashed lists to a third party. Then, the thirdparty compares these hashed lists and sends the intersectionof two hashed lists to each of the data sources so that theycan figure out the intersection. Since the hash function isone-way and if the third party is not curious, the third partycannot learn the content of the lists from the hashed list andthe data sources cannot learn anything other than the inter-section. However, in practice, the number of one-way hashfunctions are limited, a curious third party may try all ofthem to extract the content of the hashed lists. Basically,the third party can check whether an element exists in theincoming hashed list or not, and thus, it might exhaustivelytry the entire domain to find out the data of the data source(Dictionary attack) if the domain size is small. Even if thethird party is not curious, only the equality check could bedone by the third party using this approach. Therefore, thissolution is inadequate to support aggregation queries (e.g.sum of values of specific keys without revealing values).
4.1 Intersection Que...