web 2.0 data analysis daniel deutch. data management “data management is the development,...

50
Web 2.0 Data Analysis DANIEL DEUTCH

Upload: alexina-thompson

Post on 12-Jan-2016

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Web 2.0 Data AnalysisDANIEL DEUTCH

Page 2: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Data Management

“Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets”

(DAMA Data Management Body of Knowledge)

A major success:

the relational model of databases

Page 3: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Relational Databases

• Developed by Codd (1970), who won the Turing award for the model

• Huge success and impact: ‒ The vast majority of organizational data today is stored in relational

databases‒ Implementations include MS SQL Server, MS excel, Oracle DB,

mySQL,… ‒ 2 Turing award winners (Edgar F. Codd and Jim Gray)

• Basic idea: data is organized in tables (=relations)

• Relations can be derived from other relations using a set of operations called the relational algebra ‒ On which SQL is largely based

Page 4: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Research in Data(base) Management• 1970- : Relational Databases (tables).

‒ Indexing, Tuning, Query Languages, Optimizations, Expressive Power,….

• ~20 years ago: Emergence of the Web and research on Web data‒ XML, text database, web graph….‒ Google is a product of this research

(by Stanford’s PhD students Brin and Page)

• Recent years: hot topics include distributed databases, data privacy, data integration, social networks, web applications, crowdsourcing, trust,…‒ Foundations taken from “classical” database research

• Theoretical foundations with practical impact

Page 5: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Web 2.0• “Old” web (“Web 1.0”): static pages

– News, encyclopedic knowledge...

– No, or very little, interactive process between the web-page and the user.

• Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user.

– “Network as platform" computing

– The “participatory Web”

Page 6: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Web 2.0• “Old” web (“Web 1.0”): static pages

– News, encyclopedic knowledge...

– No, or very little, interactive process between the web-page and the user.

• Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user.

– “Network as platform" computing

– The “participatory Web”

Page 7: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Online shopping

Page 8: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Advertisements

Page 9: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Social Networks

Page 10: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Collaborative Applications

Page 11: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Data is all around

• Web graph

• “Social graph”

• Pictures, Videos, notifications, messages..

• Data that the application processes

• Advertisments

• Even the application structure itself

Page 12: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

(A small portion of) the web graph

Page 13: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices
Page 14: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

News Feed

Page 15: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Need to Analyze

• Huge amount of data out there

– Est. 13.68 billion web-pages and counting

– Half a billion tweets per day and counting

• An average user “sees” about 600 tweets per day

• Most of it is irrelevant for you, some is incorrect

Page 16: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Maybe we shouldn’t

Page 17: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Haaretz, 7/3/13

Page 18: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Static vs. Dynamic data

• For “static” web data: problem addressed by search engines

– Allow users to specify criteria for search (e.g. keywords), and rank the results.

• For “dynamic” web 2.0 data (tweets, notifications, web applications):

– No satisfying solution yet.

– Data volume is constantly increasing.

Page 19: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Filter, Rank, Explain

• Filter– Select the portion of data that is relevant– Group similar results

• Rank– Rank data by trustworthiness, relevance, recency...– Present highest-rank first

• Explain– An explanation of why is the data considered

relevant/highly-ranked– An explanation of how has the data propagated

• “Why do I see this?”

Page 20: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Approach

• Leverage knowledge from “classic” database research

• Account for the new challenges

• Do so in a generic manner

• Leverage unique features such as collaborative contribution, distribution, etc.

Page 21: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

21

SSN Name Category123-45-6789 Charles undergrad234-56-7890 Dan grad

… …

Students

Physical Storage

Indexing

Distribution

...

Data model Query language

Select…

From …

Where…

Students Takes

sid=sid

sname

name=“Mary ”

cid=cid

Courses

Page 22: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Foundations

• Model

• Query Language

• Query evaluation algorithms

• Prototype implementation and optimizations

• Getting Data and Testing

Page 23: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Two concrete examples

• Analysis of Web Applications

• Analysis of Social Networks

Page 24: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Analysis of Web Applications

Page 25: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Analysis Questions

• For the application owner:

– Can the user make a reservation without giving credit card details?

– What is the probability that a user would choose a category for which there are no available products?

• For the application user:

– How can I get the best deals (by navigating)?

– How can I navigate in an optimal way (minimal number of clicks)

Page 26: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

ShopIT-Shopping Assistant

26

Page 27: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

ShopIT-Shopping Assistant

27

Page 28: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

ShopIT-Shopping Assistant

28

Page 29: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Modeling

• Model should be declarative, intuitive

• Abstract representation• Allows readability, portability, applicability,

optimizations…

• Tradeoff between expressivity and complexity

Page 30: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Application Model

Page 31: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Effects guiding the execution

• Dynamic effects such as the choice of which link to follow

– Probability/cost value can be associated with every such choice

– Cumulative: if the same choice is made twice, double the cost

• Static effects such as available products in the database

– Static in the sense that it is constant in a single execution

– Thus non-cumulative

Page 32: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Filter, Rank, Explain

• Filter– Choose “interesting” navigation sequences

• Rank– Rank different sequences by length, purchase

cost,…• Explain

– Why are these the best sequences? What do they entail in terms of purchase? How can they be refined?

Page 33: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Query Language

• Desired executions/navigation sequences are typically expressed in temporal logic

• Database analysis is typically done through a First-order-based query language (e.g. SQL)

• A query language for data-dependent application should combine both.

• While allowing for efficient evaluation.

• LTL-FO

Page 34: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Query Evaluation

• Problem: Given a process specification and a query, evaluate the query

– To find all executions that satisfy the criteria

– To find the probability that the criteria are satisfied

– To find the minimal cost execution realizing the criteria

– …

Page 35: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Interesting problems are hard (sometimes)

• The (corresponding boolean) problems are NP-hard

‒ Very unlikely that they can be solved efficiently for worst case input

• This is often the case when modeling real-life problems

• To be considered, but not to be feared!

Page 36: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

What do you do with a theoretically hard problem?

• Find restricted, yet realistic, cases that are “easy”, i.e. admit efficient worst case algorithms

‒ This also allows to have a better understanding of “where is the hardness”

‒ Example: If there are only “few” static choices, the problem becomes easy

• Design optimization techniques for practically efficient implementations

Page 37: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Optimization

• One idea is based on the notion of data provenance– Originally for “classic” database queries, provenance is a

“concise” representation of the query computation

• Here we generalize the approach for the temporal queries– Allows for a generic approach and many optimizations

• A system prototype for provenance-aware analysis of processes is under development

Page 38: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

How do we get data?

• Some applications (e.g. Yahoo! Shopping) allows a (limited) interface to their database.

• Application structure can be obtained by crawling.

• Allows to prove that the model is realistic, and to test feasibility of algorithms on real-life data scale

Page 39: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Open problems

• Uncertainty and partial knowledge

• Data updates by the application

• Incremental maintenance

‒ How do modifications to the database can be accounted for without restarting the analysis?

Page 40: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Analysis of Social Networks

• Social network common model of disseminating data is in a “push” mode

• Facebook’s news feed, Twitter’s tweets…

• Huge load of information, and it continues to grow

• No standard way of searching and ranking such posts

Page 41: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

A Machine Learning Approach

• “Learn” users preferences

– Ask them whether they like a particular post/tweet, or use existing mechanisms

• Try to generalize by extracting characteristic features (e.g. topics)

• Use the generalized function to predict future preferences

Page 42: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

A data management approach

• Allow users to tell the system their preferences, through a “query” language.

• One family of such languages are rule-based languages

– Prolog, Datalog,...

• Simpler than standard programming languages

• Allows for easier analysis

Page 43: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Rule based language(Logic Programming)

descendent(X,Y):- parent(X,Y)

descendent(X,Y):- parent(X,Z),descendent(Z,Y)

Page 44: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

For tweeter

See(tweet):-Tweets(Tova,tweet)See(tweet):-Tweets(x,tweet), related(x,"basketball")Attending(party):- count[friend(x),like(x,party)]>10

Page 45: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Learning

• Unlike standard database querying, some of the predicates may be difficult to evaluate

• How do we know that a post is related to basketball?

• What can we do?

Page 46: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

What can we do

• Text analysis

– Problem: An ”I love basketball” tweet is not really a good match

• Use tagging if exists

• Ask other peers in the network!

– Can also be used to propose predicates to users

Page 47: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Influence

• Predict effects and allow users to influence friends

• "How do I get at least 5 of my friends to come with the party with me?"

• Need them to (1) see the party, (2) click "attending"

• "How do I get at least 1000 people to go to a political rally?"

• Can be answered by an analysis of the rules!

Page 48: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Interface for Specifying Rules

• Cannot expect Facebook/Twitter users to be able to program

• Need to design an interface

• The interface components can be dynamically designed according to answers by other peers

• An ongoing research

Page 49: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Conclusions

• Web 2.0 data poses new challenges in research

• Data is dynamic, access mode is unique, analysis goals different than in classic databases

• Lots of research on theoretical questions with strong practical applications

Page 50: Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices

Questions?