mirrors and crystal balls a personal perspective on data mining

60
1 ACM SIGKDD Innovation Award Raghu Ramakrishnan Mirrors and Crystal Balls A Personal Perspective on Data Mining

Upload: medge-mckee

Post on 30-Dec-2015

40 views

Category:

Documents


5 download

DESCRIPTION

Mirrors and Crystal Balls A Personal Perspective on Data Mining. Raghu Ramakrishnan. Outline. This award recognizes the work of many people, and I represent the many A warp-speed tour of some earlier work What’s a data mining talk without predictions? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mirrors and Crystal Balls A Personal Perspective on Data Mining

1ACM SIGKDD Innovation Award

Raghu Ramakrishnan

Mirrors and Crystal BallsA Personal Perspective on Data Mining

Page 2: Mirrors and Crystal Balls A Personal Perspective on Data Mining

2ACM SIGKDD Innovation Award

Outline

• This award recognizes the work of many people, and I represent the many– A warp-speed tour of some earlier work

• What’s a data mining talk without predictions?– Some exciting directions for data mining that

we’re working on at Yahoo!

Page 3: Mirrors and Crystal Balls A Personal Perspective on Data Mining

3ACM SIGKDD Innovation Award

A Look in the Mirror …

(and the faces I found there:unfortunately, couldn’t find photos for some people)

(and apologies in advance for not discussing the related work that provided context and, often, tools and motivation)

Page 4: Mirrors and Crystal Balls A Personal Perspective on Data Mining

4ACM SIGKDD Innovation Award

1987 2007

Page 5: Mirrors and Crystal Balls A Personal Perspective on Data Mining

5ACM SIGKDD Innovation Award

Sequences, Streams• SEQ

– Sequence Data Processing. P. Seshadri, M. Livny and R. Ramakrishnan. SIGMOD 1994

– SEQ: A Model for Sequence Databases. P. Seshadri, M. Livny, and R. Ramakrishnan, ICDE 1995

– The Design and Implementation of a Sequence Database System. P. Seshadri, M. Livny and R. Ramakrishnan. VLDB 1996

• SRQL– SRQL: Sorted Relational Query

Language. R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SSDBM 1998

Page 6: Mirrors and Crystal Balls A Personal Perspective on Data Mining

6ACM SIGKDD Innovation Award

Scalable Clustering

• Birch– BIRCH: A Clustering Algorithm for Large

Multidimensional Datasets. T. Zhang, R. Ramakrishnan and M. Livny. SIGMOD 96

– Fast Density Estimation Using CF-Kernels. T. Zhang, R. Ramakrishnan, and M. Livny. KDD 1999

– Clustering Large Databases in Arbitrary Metric Spaces. V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French. ICDE 1999

• Clustering Categorical Data– CACTUS: A Scalable Clustering Algorithm

for Categorical Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. KDD 1999

Page 7: Mirrors and Crystal Balls A Personal Perspective on Data Mining

7ACM SIGKDD Innovation Award

Scalable Decision Trees

• Rain Forest– RainForest: A Framework for

Fast Decision Tree Construction of Large Datasets. J. Gehrke, R. Ramakrishnan and V. Ganti. VLDB 1998

• Boat– BOAT: Optimistic Decision Tree

Construction. J. Gehrke, V. Ganti, R. Ramakrishnan, and W-Y. Loh. SIGMOD 1999

Page 8: Mirrors and Crystal Balls A Personal Perspective on Data Mining

8ACM SIGKDD Innovation Award

Streaming and Evolving Data, Incremental Mining

• FOCUS– FOCUS: A Framework for

Measuring Changes in Data Characteristics. V. Ganti, J. Gehrke, R. Ramakrishnan, and W-Y. Loh. PODS 1999

• DEMON– DEMON: Mining and

Monitoring Evolving Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. ICDE 1999

Page 9: Mirrors and Crystal Balls A Personal Perspective on Data Mining

9ACM SIGKDD Innovation Award

Mass Collaboration

• The QUIQ Engine: A Hybrid IR-DB System. N. Kabra, R. Ramakrishnan, and V. Ercegovac. ICDE 2003

• Mass Collaboration: A Case Study. R. Ramakrishnan, A. Baptist, V. Ercegovac, M. Hanselman, N. Kabra, A. Marathe, U. Shaft. IDEAS 2004

KNOWLEDGEBASE

QUESTION

Answer added to power self service

SELF SERVICE

ANSWER

KNOWLEDGEBASE

QUESTION

SELF SERVICE

--Partner Experts-Customer Champions -Employees

Customer

Support Agent

Answer added to power self service

Page 10: Mirrors and Crystal Balls A Personal Perspective on Data Mining

10ACM SIGKDD Innovation Award

OLAP, Hierarchies, and Exploratory Mining

• Prediction Cubes. B-C. Chen, L. Chen, Y. Lin, R. Ramakrishnan. VLDB 2005

• Bellwether Analysis: Predicting Global Aggregates from Local Regions. B-C. Chen, R. Ramakrishnan, J.W. Shavlik, P. Tamma. VLDB 2006

Page 11: Mirrors and Crystal Balls A Personal Perspective on Data Mining

11ACM SIGKDD Innovation Award

Hierarchies Redux

• OLAP Over Uncertain and Imprecise Data. D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, S. Vaithyanathan. VLDB 2005

• Efficient Allocation Algorithms for OLAP Over Imprecise Data. D. Burdick, P.M. Deshpande, T. S. Jayram, R. Ramakrishnan, S. Vaithyanathan.

• Learning from Aggregate Views. B-C. Chen, L. Chen, D. Musicant, and R. Ramakrishnan. ICDE 2006

• Mondrian: Multidimensional K-Anonymity. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE 2006

• Workload-Aware Anonymization. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. KDD 2006

• Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge. B-C. Chen, R. Ramakrishnan, K. LeFevre. VLDB 2007

• Composite Subset Measures. L. Chen, R. Ramakrishnan, P. Barford, B-C. Chen, V. Yegneswaran. VLDB 2006

Page 12: Mirrors and Crystal Balls A Personal Perspective on Data Mining

12ACM SIGKDD Innovation Award

Many Other Connections …

• Scalable Inference– Optimizing MPF Queries:

Decision Support and Probabilistic Inference. H. Corrada Bravo, R. Ramakrishnan. SIGMOD 2007

• Relational Learning– View Learning for Statistical

Relational Learning, with an Application to Mammography. J. Davis, E.S. Burnside, I. Dutra, David Page, R. Ramakrishnan, V. Santos Costa, J.W. Shavlik.

Page 13: Mirrors and Crystal Balls A Personal Perspective on Data Mining

13ACM SIGKDD Innovation Award

Community Information Management

• Efficient Information Extraction over Evolving Text Data. F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE 2008

• Toward Best-Effort Information Extraction. W. Shen, P. DeRose, R. McCann, A. Doan, R. Ramakrishnan. SIGMOD 2008

• Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan. VLDB 2007

• Source-aware Entity Matching: A Compositional Approach. W. Shen, P. DeRose, L. Vu, A. Doan, R. Ramakrishnan. ICDE 2007

Page 14: Mirrors and Crystal Balls A Personal Perspective on Data Mining

14ACM SIGKDD Innovation Award

… Through the Looking Glass

Prediction is very hard, especially about the future. Yogi Berra

Page 15: Mirrors and Crystal Balls A Personal Perspective on Data Mining

15ACM SIGKDD Innovation Award

Information Extraction

… and the challenge of managing it

Page 16: Mirrors and Crystal Balls A Personal Perspective on Data Mining

16ACM SIGKDD Innovation Award

Page 17: Mirrors and Crystal Balls A Personal Perspective on Data Mining

17ACM SIGKDD Innovation Award

DBLife

Integrated information about a (focused) real-world community

Collaboratively built and maintained by the community

CIMple software package

Page 18: Mirrors and Crystal Balls A Personal Perspective on Data Mining

18ACM SIGKDD Innovation Award

babycenter

epicurious

Search Results of the Future

yelp.com

answers.com

LinkedIn

webmd

Gawker

New York Times

(Slide courtesy Andrew Tomkins)

Page 19: Mirrors and Crystal Balls A Personal Perspective on Data Mining

19ACM SIGKDD Innovation Award

Opening Up Yahoo! SearchPhase 1 Phase 2

Giving site owners and developers control over the appearance of Yahoo!

Search results.

BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!

Search infrastructure and technology to developers and companies to help them

build their own search experiences.

(Slide courtesy Prabhakar Raghavan)

Page 20: Mirrors and Crystal Balls A Personal Perspective on Data Mining

20ACM SIGKDD Innovation Award

Custom Search Experiences

Social Search

Vertical Search

Visual Search

(Slide courtesy Prabhakar Raghavan)

Page 21: Mirrors and Crystal Balls A Personal Perspective on Data Mining

21ACM SIGKDD Innovation Award

Economics of IE

• Data $, Supervision $– The cost of supervision, especially large,

high-quality training sets, is high– By comparison, the cost of data is low

• Therefore– Rapid training set construction/active learning

techniques– Tolerance for low- (or low-quality) supervision– Take feedback and iterate rapidly

Page 22: Mirrors and Crystal Balls A Personal Perspective on Data Mining

22ACM SIGKDD Innovation Award

Example: Accepted Papers

• Every conference comes with a slightly different format for accepted papers– We want to extract accepted papers directly

(before they make their way into DBLP etc.)

• Assume– Lots of background knowledge (e.g., DBLP

from last year)– No supervision on the target page

• What can you do?

Page 23: Mirrors and Crystal Balls A Personal Perspective on Data Mining

23ACM SIGKDD Innovation Award

Page 24: Mirrors and Crystal Balls A Personal Perspective on Data Mining

24ACM SIGKDD Innovation Award

Down the Page a Bit

Page 25: Mirrors and Crystal Balls A Personal Perspective on Data Mining

25ACM SIGKDD Innovation Award

Record Identification

• To get started, we need to identify records

– Hey, we could write an XPath, no?

– So, what if no supervision is allowed?

• Given a crude classifier for paper records,

can we recursively split up this page?

Page 26: Mirrors and Crystal Balls A Personal Perspective on Data Mining

26ACM SIGKDD Innovation Award

First Level Splits

Page 27: Mirrors and Crystal Balls A Personal Perspective on Data Mining

27ACM SIGKDD Innovation Award

After More Splits …

Page 28: Mirrors and Crystal Balls A Personal Perspective on Data Mining

28ACM SIGKDD Innovation Award

Now Get the Records

• Goal: To extract fields of individual records

• We need training examples, right?– But these papers are new

• The best we can do without supervision is noisy labels.– From having seen other such pages

Page 29: Mirrors and Crystal Balls A Personal Perspective on Data Mining

29ACM SIGKDD Innovation Award

Partial, Noisy Labels

Page 30: Mirrors and Crystal Balls A Personal Perspective on Data Mining

30ACM SIGKDD Innovation Award

Extracted Records

Page 31: Mirrors and Crystal Balls A Personal Perspective on Data Mining

31ACM SIGKDD Innovation Award

Refining Results via Feedback

• Now let’s shift slightly to consider extraction of publications from academic home pages – Must identify publication sections of faculty home

pages, and extract paper citations from them

• Underlying data model for extracted data is – A flexible graph-based model (similar to RDF or ER

conceptual model)– “Confidence” scores per-attribute or relationship

Page 32: Mirrors and Crystal Balls A Personal Perspective on Data Mining

32ACM SIGKDD Innovation Award

Extracted Publication Titles

Page 33: Mirrors and Crystal Balls A Personal Perspective on Data Mining

33ACM SIGKDD Innovation Award

A Dubious Extracted Publication…

PSOX provides declarative lineage tracking over operator executions

Page 34: Mirrors and Crystal Balls A Personal Perspective on Data Mining

34ACM SIGKDD Innovation Award

Where’s the Problem?

Use lineage to find source of problem..

Page 35: Mirrors and Crystal Balls A Personal Perspective on Data Mining

35ACM SIGKDD Innovation Award

Source Page

Hmm, not a publication page ..

(but may have looked like one to a

classifier)

Page 36: Mirrors and Crystal Balls A Personal Perspective on Data Mining

36ACM SIGKDD Innovation Award

Feedback

User corrects classification of that section..

Page 37: Mirrors and Crystal Balls A Personal Perspective on Data Mining

37ACM SIGKDD Innovation Award

Faculty or Student?

•NLP•Build a Classifier•Or…

Page 38: Mirrors and Crystal Balls A Personal Perspective on Data Mining

38ACM SIGKDD Innovation Award

…Another Clue…

Page 39: Mirrors and Crystal Balls A Personal Perspective on Data Mining

39ACM SIGKDD Innovation Award

…Stepping Back…

Student

Student

Student-List

AdvisorOf

Prof

Prof-List

Prof

Prof

• Leads to large-scale, partially-labeled relational learning

• Involving different types of entities and links

Page 40: Mirrors and Crystal Balls A Personal Perspective on Data Mining

40ACM SIGKDD Innovation Award

Maximizing the Value of What You Select to Show Users

p1 p2 p3

Page 41: Mirrors and Crystal Balls A Personal Perspective on Data Mining

41ACM SIGKDD Innovation Award

Content Optimization

• PROBLEM: Match-making between content, user, context– Content:

• Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC)– User

• Individual (e.g., b-cookie), or user segment– Context

• E.g., Y! or non-Y! property; device; time period

• APPROACH: Scalable algorithms that select content to show, using editorially determined content mixes, and respecting editorially set constraints and policies.

Page 42: Mirrors and Crystal Balls A Personal Perspective on Data Mining

42ACM SIGKDD Innovation Award

Team from Y! Research

Raghu Ramakrishnan

Deepak Agarwal

Pradheep Elango

Seung-Taek ParkWei Chu

BeeChungChen

Page 43: Mirrors and Crystal Balls A Personal Perspective on Data Mining

43ACM SIGKDD Innovation Award

Team from Y! Engineering

Scott Roy

Nitin Motgi

Joe Zachariah

Kenneth FoxTodd Beaupre

Page 44: Mirrors and Crystal Balls A Personal Perspective on Data Mining

44ACM SIGKDD Innovation Award

Yahoo! Home Page Featured Box

• It is the top-center part of the Y! Front Page

• It has four tabs: Featured, Entertainment, Sports, and Video

Page 45: Mirrors and Crystal Balls A Personal Perspective on Data Mining

45ACM SIGKDD Innovation Award

Traditional Role of Editors

• Strict quality control– Preserve “Yahoo! Voice”

• E.g., typical mix of content– Community standards– Quality guidelines

• E.g., Topical articles shown for limited time

• Program articles periodically– New ones pushed, old ones taken out

• Few tens of unique articles per day– 16 articles at any given time; editors keep up with

novel articles and remove fading ones– Choose which articles appear in which tabs

Page 46: Mirrors and Crystal Balls A Personal Perspective on Data Mining

46ACM SIGKDD Innovation Award

Content Optimization Approach

• Editors continue to determine content sources, program some content, determine policies to ensure quality, and specify business constraints– But we use a statistically based machine

learning algorithm to determine what articles to show where when a user visits the FP

Page 47: Mirrors and Crystal Balls A Personal Perspective on Data Mining

47ACM SIGKDD Innovation Award

Modeling Approach

• Pure feature based (did not work well):– Article: URL, keywords, categories– Build offline models to predict CTR when article

shown to users– Models considered

• Logistic Regression with feature selection• Decision Trees, Feature segments through clustering

• Track CTR per article in user segments through online models– This worked well; the approach we took

eventually

Page 48: Mirrors and Crystal Balls A Personal Perspective on Data Mining

48ACM SIGKDD Innovation Award

Challenges

• Non-stationary CTR

• To ensure webpage stability, we show the same article until we find a better one– CTR decays over time; sharply at F1– Time-of-day; day-of-week effect in CTR

Page 49: Mirrors and Crystal Balls A Personal Perspective on Data Mining

49ACM SIGKDD Innovation Award

Modeling Approach

• Track item scores through dynamic linear models (fast Kalman filter algorithms)

• We model decay explicitly in our models

• We have a global time-of-day curve explicitly in our online models

Page 50: Mirrors and Crystal Balls A Personal Perspective on Data Mining

50ACM SIGKDD Innovation Award

Explore/Exploit

• What is the best strategy for new articles?– If we show it and it’s bad: lose clicks– If we delay and it’s good: lose clicks

• Solution: Show it while we don’t have much data if it looks promising– Classical multi-armed bandit type problem– Our setup is different than the ones studied in

the literature; new ML problem

Page 51: Mirrors and Crystal Balls A Personal Perspective on Data Mining

51ACM SIGKDD Innovation Award

Novel Aspects

• Classical: Arms assumed fixed over time– We gain and lose arms over time

• Some theoretical work by Whittle in 80’s; operations research

• Classical: Serving rule updated after each pull– We compute optimal design in batch mode

• Classical: Generally. CTR assumed stationary– We have highly dynamic, non-stationary CTRs

Page 52: Mirrors and Crystal Balls A Personal Perspective on Data Mining

52ACM SIGKDD Innovation Award

Some Other Complications

• We run multiple experiments (possibly correlated) simultaneously; effective sample size calculation a challenge

• Serving Bias: Incorrect to learn from data for serving scheme A and apply to serving scheme B– Need unbiased quality score– Bias sources: positional effects, time effect, set of

articles shown together

• Incorporating feature-based techniques– Regression style , E.g., logistic regression – Tree-based (hierarchical bandit)

Page 53: Mirrors and Crystal Balls A Personal Perspective on Data Mining

53ACM SIGKDD Innovation Award

System Challenges

• Highly dynamic system characteristics:– Short article lifetimes, pool constantly

changing, user population is dynamic, CTRs non-stationary

– Quick adaptation is key to success

• Scalability:– 1000’s of page views/sec; data collection,

model training, article scoring done under tight latency constraints

Page 54: Mirrors and Crystal Balls A Personal Perspective on Data Mining

54ACM SIGKDD Innovation Award

Results

• We built an experimental infrastructure to test new content serving schemes– Ran side-by-side experiments on live traffic– Experiments performed for several months;

we consistently outperformed the old system– Results showed we get more clicks by

engaging more users– Editorial overrides

• Did not reduce lift numbers substantially

Page 55: Mirrors and Crystal Balls A Personal Perspective on Data Mining

55ACM SIGKDD Innovation Award

Comparing buckets

Page 56: Mirrors and Crystal Balls A Personal Perspective on Data Mining

56ACM SIGKDD Innovation Award

Experiments

• Daily CTR Lift relative to editorial serving

Page 57: Mirrors and Crystal Balls A Personal Perspective on Data Mining

57ACM SIGKDD Innovation Award

Lift is Due to Increased Reach

• Lift in fraction of clicking users

Page 58: Mirrors and Crystal Balls A Personal Perspective on Data Mining

58ACM SIGKDD Innovation Award

Related Work

• Amazon, Netflix, Y! Music, etc.:– Collaborative filtering with large content pool– Achieve lift by eliminating bad articles– We have a small number of high quality

articles

• Search, Advertising– Matching problem with large content pool– Match through feature based models

Page 59: Mirrors and Crystal Balls A Personal Perspective on Data Mining

59ACM SIGKDD Innovation Award

Summary of Approach

• Offline models to initialize online models

• Online models to track performance

• Explore/exploit to converge fast

• Study user visit patterns and behavior; program content accordingly

Page 60: Mirrors and Crystal Balls A Personal Perspective on Data Mining

60ACM SIGKDD Innovation Award

Summary

• There are some exciting “grand challenge” problems that will require us to bring to bear ideas from data management, statistics, learning, and optimization– i.e., data mining problems!

• Our field is too young to think about growing old, but the best is yet to be …