Data Markets in the Cloud: An Opportunity for the Database Community

Download Data Markets in the Cloud: An Opportunity for the Database Community

Post on 25-Feb-2016

33 views

Category:

Documents

3 download

DESCRIPTION

Data Markets in the Cloud: An Opportunity for the Database Community. Magdalena Balazinska, Bill Howe, and Dan Suciu University of Washington. Project supported in part by NSF and Microsoft. The Value ($$$) of Data. Buying and selling data are common operations - PowerPoint PPT Presentation

TRANSCRIPT

Relational Data Markets in the Cloud: Challenges and OpportunitiesData Markets in the Cloud: An Opportunity for the Database CommunityMagdalena Balazinska, Bill Howe, and Dan SuciuUniversity of WashingtonProject supported in part by NSF and MicrosoftThe Value ($$$) of DataBuying and selling data are common operationsReal-time stock prices + trade data: $35,000/year (https://www.xignite.com)Land parcel information: $60,000/year (https://datamarket.azure.com)2Magdalena Balazinska - University of WashingtonDatabase2Organized Data MarketLogically centralized point for buying and selling dataFacilitates data discoveryFacilitates logistics of buying and selling dataPublic clouds are well-suited to support data marketsCloud data markets are indeed emerging!3Magdalena Balazinska - University of WashingtonWe argue that organized data markets raise important challenges for our communityFacilitates logistics of buying and selling data across many buyers and many sellers.3Example 1: Azure DataMarketMagdalena Balazinska - University of Washington4Example 2: InfochimpsMagdalena Balazinska - University of Washington5Technical Challenges (1)Study the behavior of agents in a data marketStudy how data should be pricedE.g., Pointless to price data based on production costsE.g., Useful to create versions for different market segmentsInform public policy regulating the data marketMagdalena Balazinska - University of Washington6Challenges for economistsEconomists will derive and express declarative pricing constraints, the DB community will develop the system to enforce these constraints and study their consequences on query processing.6Technical Challenges (2)Magdalena Balazinska - University of Washington7Develop and study pricing models for dataHow should sellers specify pricing parameters?How should system compute prices based on seller input?What are the properties of various pricing modelsDevelop supporting tools and servicesTools for expressing and computing pricesTools for processing priced data Challenges for database communityCan we help sellers figure out prices?7Novel ProblemPrior work at the intersection of DB and economics focused on resource management[Dash, Kantere, Ailamaki 2009][Kantere et al. 2011][Stonebraker et. al. 1996]We are talking about putting a price on dataMagdalena Balazinska - University of Washington8Technical Challenges (2)Magdalena Balazinska - University of Washington9Develop and study pricing models for dataHow should sellers specify pricing parameters?How should system compute prices based on seller input?What are the properties of various pricing modelsDevelop supporting tools and servicesTools for expressing and computing pricesTools for processing priced data 9Example ScenarioSeller has a database of business contact informationEconomist: Supply and demand dictate thatbusinesses in entire country: $600businesses in one province or state: $300one type of business: $50Buyer: Q1: Businesses with more than 200 employees (selection)Q2: Businesses in same city as Home Depot (self-join)Q3: Businesses in cities with high yearly precipitation (join)How to satisfy buyer? Magdalena Balazinska - University of Washington10Current Pricing: Fixed PricesFixed price for entire dataset (CustomLists, Infochimps)Must create and price views specific to queries Q1, Q2, Q3OR user must buy entire dataset if view not availableAND user must perform joins by herselfCertainly the case if datasets have different owners11Magdalena Balazinska - University of WashingtonEmphasize that this is what is going on right now without the DB community.11Current Pricing: SubscriptionsSubscriptions (Azure DataMarket, Infochimps API)Fixed number of transactions per monthMust create and price appropriate parameterized queriesCurrently these queries are dataset specific (i.e., no joins!)Can satisfy Q1: Businesses with more than 200 employeesHarder Q2: Businesses in same city as Home DepotCannot Q3: Businesses in cities with high yearly precipitation 12Magdalena Balazinska - University of WashingtonOther Data Pricing IssuesTodays data pricing can also have bad propertiesExample: Weather Imagery on Azure DataMarket1,000,000 transactions -> $2,400100,000 -> $60010,000 -> $1202,500 -> $0Arbitrage opportunity: Emulate many usersGet as much data as you want for free!13Magdalena Balazinska - University of WashingtonChallenge 1: Develop pricing models that are flexible yet have provable, good properties (e.g., no arbitrage)Potential Approach:View-Based PricingSeller specifies a set of queries Q1, QnAnd their prices: price(Q1), , price(Qn)D = all businesses in North AmericaV1 (businesses in Canada) = $600V2 (businesses in Alberta) = $300V3 (all Shell businesses) = $50Etc.14Magdalena Balazinska - University of WashingtonEconomist puts constraints on prices and we build systems that fill-in the gaps.14Potential Approach:View-Based PricingSystem computes other query pricesQ2: Businesses in same city as Home Depot, etc.Price computation is automatedSolved as a constrained optimization problemSystem guarantees price propertiesFor example, ensures that no arbitrage is possible15Magdalena Balazinska - University of WashingtonData Pricing ChallengesUnderstand properties of pricing schemesWhen can we guarantee that no arbitrage is possible?How to handle data updates?Will updates require changes to prices? How to handle price updates?Will one price-change affect all others?How to price value-added of data transformations?Should a self-join query be more expensive than a selection?Should queries with empty results be free?How to price data properties (e.g., cleanliness)?16Magdalena Balazinska - University of WashingtonEasy to use?16Technical Challenges (2)Magdalena Balazinska - University of Washington17Develop and study pricing models for dataHow should sellers specify pricing parameters?How should system compute prices based on seller input?What are the properties of various pricing modelsDevelop supporting tools and servicesTools for expressing and computing prices Tools for processing priced data 17Data Market ToolsEfficient query-price computerData pricing should not add much overhead to query proc.But some techniques (e.g. provenance-pricing) are expensivePricing updatesGiven an earlier user-query with a priceCompute price of incremental query output after updates18Magdalena Balazinska - University of WashingtonChallenge 2: Build systems that compute query prices with minimum overheadData Market Tools (continued)Price-aware query optimizerAnswer query over multiple datasets as cheaply as possiblePredict the price of a query result (quantify uncertainty)Study potential benefits of incremental query processing19Magdalena Balazinska - University of WashingtonChallenge 3: Build price-awarequery optimizersData Market Tools (continued)Pricing AdvisorChecks properties of a pricing schemeHelps set and tune prices based on data provider goalsComputes prices of new views Explains income or billCompares data providers with different pricing schemes20Magdalena Balazinska - University of WashingtonChallenge 4: Build support tools forbuyers and sellersConclusionData helps drive businesses and applicationsOrganized data markets emerging, facilitated by cloudsBut need the right tools to maximize successTheory of data pricingSystems for computing prices, checking properties, etc.http://data-pricing.cs.washington.edu21Magdalena Balazinska - University of WashingtonMagdalena Balazinska - University of Washington22Fixed PriceMagdalena Balazinska - University of Washington23Fixed PriceMagdalena Balazinska - University of Washington24Cheaper by provinceStrawman 3: View-Based PricingThis is a constrained optimization problemEach query price is a constraintCan add other constraints: e.g., total price of DBTwo methods to derive prices of new queriesReverse-eng. price of base tuples s.t. constraintsAssume a function that converts base tuple prices into query prices Compute base tuple prices in a way that maximizes entropy, user utility, or other function s.t. constraintsCompute new query prices directly25Magdalena Balazinska - University of WashingtonStrawman 1: PRICE-SemiringApproachAssign a price to individual base tuplesAutomatically compute price of query result: a b ed b ga c eA B CRprsa xd xA Dmin(p + q, s + t)r + qb xc xb yB DSqtSELECT DISTINCT A,D FROM R,S WHERE R.B = S.B AND S.D=x uQ =p = $0.1r = $0.01s = $0.5a pricing function on tuples:q = $0.02t = $0.03u = $0.04min(p + q, s + t)= $0.12 r + q = $0.03 price(Q) = $0.15a pricing calculation:26Magdalena Balazinska - University of Washington(R+,min,+,,0)Strawman 1: PRICE-SemiringBenefitsSupport datasets where different tuples have different valuesAllow users to ask arbitrary queriesCan compute prices across datasets and even data owners Avoids some bad properties such as arbitrageButLimited flexibility: e.g., submodular pricing impossibleOdd prices? Self-join can be more expensive than dataset27Magdalena Balazinska - University of WashingtonStrawman 2: Provenance ExpressionsApproach: Same as above BUTDerive provenance information for each result tuplePrice is a function of provenance expressionsMagdalena Balazinska - University of Washington28a b ed b ga c eA B CRprsa xd xA DProvenance: p, q, s, and tProvenance: r and qb xc xb yB DSqtSELECT DISTINCT A,D FROM R,S WHERE R.B = S.B AND S.D=x uQ =p = $0.1r = $0.01s = $0.5a pricing function on tuples:q = $0.02t = $0.03u = $0.04= 0.75 (0.1 + 0.02 + 0.01 + 0.5 +0.03 )= $0.50price(Q) = f( price(p), price(q), price(r), price(s), price(t))a pricing calculation (applying a 25% discount):Strawman 2: Provenance ExpressionsBenefitsMore powerful pricing functions become possibleE.g., submodular pricing ButProperties with complex pricing need studyingNave implementation could be highly inefficient29Magdalena Balazinska - University of WashingtonData Pricing Issues (continued)Lump sum or subscription pricing is also inflexibleFor lump sum, can only buy pre-defined viewsFor subscription, can only ask pre-defined queriesWould like arbitrary queries over multiple datasets 30Magdalena Balazinska - University of Washington