kansas state university department of computing and information sciences kansas state university kdd...

24
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( www.kddresearch.org ) Collaborative Filtering Collaborative Filtering Intelligent Information Retrieval and Intelligent Information Retrieval and the Grid the Grid Friday 11 October 2002 William H. Hsu Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org This presentation is: http://www.kddresearch.org/KSU/CIS/KU-20021010.ppt

Upload: tatiana-hallsted

Post on 01-Apr-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Collaborative FilteringCollaborative FilteringIntelligent Information Retrieval and the GridIntelligent Information Retrieval and the Grid

Friday 11 October 2002

William H. Hsu

Laboratory for Knowledge Discovery in Databases

Department of Computing and Information Sciences

Kansas State University

http://www.kddresearch.org

This presentation is:

http://www.kddresearch.org/KSU/CIS/KU-20021010.ppt

Page 2: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

AcknowledgementsAcknowledgements

• Kansas State University Lab for Knowledge Discovery in Databases– Graduate research assistants: Haipeng Guo ([email protected]), Roby

Joehanes ([email protected])– Other grad students: Prashanth Boddhireddy, Siddharth Chandak, Ben

B. Perry, Rengakrishnan Subramanian– Undergraduate programmers: James W. Plummer, Julie A. Thornton

• Joint Work with– KSU Bioinformatics and Medical Informatics (BMI) group: Sanjoy Das

(EECE), Judith L. Roe (Biology), Stephen M. Welch (Agronomy)– KSU Microarray group: Scot Hulbert (Plant Pathology), J. Clare Nelson

(Plant Pathology), Jan Leach (Plant Pathology)– Kansas Geological Survey, Kansas Biological Survey, KU EECS

• Other Research Partners– NCSA Automated Learning Group (Michael Welge, Tom Redman)– University of Manchester (Carole Goble, Robert Stevens)– The Institute for Genomic Research (John Quackenbush, Alex Saeed)– International Rice Research Institute (Richard Bruskiewich)

Page 3: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

OverviewOverview

• Filtering– Collaborative filtering (CF) and relatives

– Application to intelligent information retrieval (IR)

• Computational Grids– High-Performance Computing (HPC) services

• Scientific data, metadata (ontologies, specifications), documentation• Software tools (source codes, application servers)• Experimental results

– Grid initiatives: TeraGrid (USA), eScience (UK, EBI)

• Challenge: Personalization of Services• Application: Bioinformatics• Methodology: Learning Relational Probabilistic Models

– User modeling and collaborative filtering (CF)

– DESCRIBER system: integrative CF for computational genomics

• Current Research and Open Problems

Page 4: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Cross-Selling(based upon Market

Basket Analysis)

CollaborativeRecommendation

© 2002 Amazon.com, Inc.

Collaborative Filtering in Action:Collaborative Filtering in Action:Amazon.com [1]Amazon.com [1]

Page 5: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Collaborative Filtering in Action:Collaborative Filtering in Action:Amazon.com [2]Amazon.com [2]

© 2002 Amazon.com, Inc.

Classification andRegression based

upon HistoricalCustomer Data

Explanation fromRecommender

(Decision Support)System

Page 6: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Filtering and Recommendation ApproachesFiltering and Recommendation Approaches

• Collaborative

– Collect: recorded decisions (actions) of user(s)

– Infer: preferences of user(s)

– Model: associational relationships among entities (e.g., purchases)

– Use to: recommend similar decisions to users in similar context

• Structural

– Collect: recorded decisions (actions) of user(s)

– Infer: preferences of user(s)

– Model: causal relationships among entities (e.g., use cases)

– Use to: make recommendation and explain

• Content-Based: Driven by Key Word / Phrase

• Collective: Driven by Consensus, Stochastic Mixture Model

(e.g., “Swarm Intelligence”, Ant Colony Optimization)

Page 7: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

ThemeScapes © 1999 SPIRIX software http://www.cartia.com

6500 news storiesfrom the WWWin 1997

A Filtering Problem:A Filtering Problem: Text Mining for Information Retrieval (IR) Text Mining for Information Retrieval (IR)

Page 8: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Another Filtering Application:Another Filtering Application:Commercial Fraud MonitoringCommercial Fraud Monitoring

Page 9: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Stages of Data Mining andStages of Data Mining andKKnowledge nowledge DDiscovery in iscovery in DDatabasesatabases

Adapted from Fayyad, Piatetsky-Shapiro, and Smyth (1996)

Page 10: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

NCSA NCSA D2KD2K: Visual Programming System for: Visual Programming System forRapid Application Development in KDDRapid Application Development in KDD

Data to Knowledge (D2K) © 2002 NCSA http://archive.ncsa.uiuc.edu/STI/ALG/d2k/

Page 11: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

NCSA NCSA D2K D2K WorkflowWorkflow: Decision Support: Decision Supportin Insurance Pricingin Insurance Pricing

Hsu, Welge, Redman, Clutter (2002) Data Mining and Knowledge Discovery, 6(4):361-391

Page 12: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Computational Grids [1]:Computational Grids [1]:High-Performance Distributed ComputingHigh-Performance Distributed Computing

• What is The Grid? – Infrastructure: Distributed Processing, Networks, Software

– Paradigm for Very Large-Scale Scientific Computing

• End Users of The Grid – Adapted from Goble (2002)– Providers

• Tool builders

• Systems/network administrators, service providers, etc.

– Researchers

• Scientific discipline – e.g., Biology

• Computational Science and Engineering (CSE) – e.g., Bioinformatics

• Patent Intelligence!

– “End users”

• Developers: e.g., pharmaceutical

• Medical doctors, patients

Page 13: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Computational Grids [2]:Computational Grids [2]:Personalization of ServicesPersonalization of Services

• What Services?– High-Performance Computing (HPC) facilities

• Compute clusters (Beowulf, NT, etc.)

• Massively distributed networks

– Software

– Scientific data servers

• Metadata– Ontologies: Definitional Data Models (cf. Semantic Web)

– Service Type Directory

• Dynamic Design of Workflows – myGrid, Goble et al. (2002) http://www.ebi.ac.uk/mygrid

• Challenge: Personalization– Intelligent Filtering Approach: User Modeling

– “Users Who Used (Your) Specified Resources Also Used…”

Page 14: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Domain-Specific Repositories

Experimental DataSource Codes and Specifications

Data ModelsOntologies

Models

Data Entity and Source Code Repository Index for Bioinformatics Experimental Research

Personalized Interface

Domain-SpecificCollaborative Filtering

New QueriesLearning and Inference

Components

HistoricalUse Case & Query Data

Decision SupportModels

Users ofScientificDocumentRepository

Interface(s) to Distributed Repository

Example Queries:• What experiments have found cell cycle-regulated

metabolic pathways in Saccharomyces?

• What codes and microarray data were used, and why?

DESCRIBERDESCRIBER: An Experimental: An ExperimentalIntelligent FilterIntelligent Filter

Page 15: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Module 2

Learning & Validationof Bayesian Network

Models forUse Cases

Module 4Learning & Validationof Bayesian Network

Models forMAGE Data & Codes

Relational Models of MAGE Data

Module 1Intelligent Collaborative

Filtering Front-End

Data

Historical Use Case& Query Data

Personalized Interface Module 5MAGE

Data Model

User

Estimationof

ConstraintParameters

Graphical Modelsof Use Cases

Module 3

Constrained Models of Use Cases

New Queries

DESCRIBERDESCRIBER [1]: [1]:OverviewOverview

Page 16: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Intelligent Collaborative FilteringFront-End

Personalized Interface

Relational Models of(Domain-Specific) Data

Constrained Modelsof Use Cases

RelationalProbabilistic

ModelConstraintSelector

IntegratedReasoning

Component:

XML Validator andConstraint Checker

Constraintson Repository

Content

Responseto User

New Queryfrom User

Module 1

DESCRIBER DESCRIBER [2]:[2]:Collaborative Filtering ModuleCollaborative Filtering Module

Page 17: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Computational Genomics andComputational Genomics andMicroarray Data MiningMicroarray Data Mining

Treatment 1(Control)

Treatment 2(Pathogen)

Messenger RNA(mRNA) Extract 1

Messenger RNA(mRNA) Extract 2

cDNA

cDNA

DNA Hybridization Microarray(under LASER)

Page 18: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Publication(e.g., PubMed)

Source(e.g.,

Taxonomy)

Gene(e.g., GenBank)

Experiment

Sample Hybridization Array

Normalization/Discretization

Data

Components of A Microarray Experiment:Components of A Microarray Experiment:HybridizationHybridization

Page 19: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

ComputationalWorkflows

(e.g., myGrid)

ExperimentalServices &Metadata

(Mage-ML XML)

GeneExpression

Model

Pathway &NetworkLearning

Specification

DataPreprocessingSpecification

ParameterLearning

Specification

ModelAnalysis

Specification

DiscretizationUse Case

Data MiningUse Case

Feature Selection

Specification

Validation(e.g., Bootstrap)

Use Case

Components of A Microarray Experiment:Components of A Microarray Experiment:Computational Gene Expression ModelingComputational Gene Expression Modeling

Page 20: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Graphical Models of Probability for Graphical Models of Probability for CCollaborative ollaborative FFiltering (CF)iltering (CF)

• Goal: Estimate

• Filtering: r = t

– Intuition: infer current state from observations

– Applications: signal identification

– Variation: Viterbi algorithm

• Prediction: r < t

– Intuition: infer future state

– Applications: prognostics

• Smoothing: r > t

– Intuition: infer past hidden state

– Applications: signal enhancement

• CF Tasks

– Plan recognition by smoothing

– Prediction cf. WebCANVAS – Cadez et al. (2000)

)y|P(X r1it

Murphy (2002)

Page 21: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

Tools for Building Graphical ModelsTools for Building Graphical Models

• Commercial Tools: Ergo, Netica, TETRAD, Hugin• Bayes Net Toolbox (BNT) – Murphy (1997-present)

– Distribution page http://http.cs.berkeley.edu/~murphyk/Bayes/bnt.html

– Development group http://groups.yahoo.com/group/BayesNetToolbox

• Bayesian Network tools in Java (BNJ) – Hsu et al. (1999-present)– Distribution page

http://bndev.sourceforge.net

– Development group http://groups.yahoo.com/group/bndev

– Current (re)implementation projects for KSU KDD Lab

• Continuous state: Minka (2002) – Hsu, Guo, Perry, Boddhireddy

• Formats: XML BNIF (MSBN), Netica – Guo, Hsu

• Space-efficient DBN inference – Joehanes

• Bounded cutset conditioning – Chandak

Page 22: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

LearningEnvironment

Specification Fitness(Inferential Loss)

[B] ParameterEstimation

[A] StructureLearning

G = (V, E)Graph Component of BN

D: Data (User, Microarray)

B = (V, E, )BN with Probabilities

Dval (Model Validation by Inference)

G1

G2

G3

G4 G5

G1

G2

G3

G4 G5

Experimenters’ WorkbenchExperimenters’ Workbench

Page 23: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

References [1]:References [1]:Intelligent Filtering, IR, and KDDIntelligent Filtering, IR, and KDD

• Intelligent Filtering– Taxonomy of Filtering Approaches: Rocha (2001)

http://www.c3.lanl.gov/~rocha/GB0/adapweb_GB0.html

– Microsoft Research: Cadez et al. (1999), Heckerman and Meek (2002), Kadie (2002)

– Technical report: survey, Hsu (2002) http://www.kddresearch.org/Publications/Techreports/BMI-2001.pdf

– NCSA Automated Learning Group http://www.ncsa.uiuc.edu/STI/ALG

• Machine Learning, Data Mining, and Knowledge Discovery– K-State KDD Lab: literature survey and resource catalog (2002)

http://www.kddresearch.org/Resources

– Bayesian Network tools in Java (BNJ): Hsu, Guo, Joehanes, Perry, Thornton (2002) http://bndev.sourceforge.net

– Machine Learning in Java (BNJ): Hsu, Louis, Plummer (2002) http://mldev.sourceforge.net

Page 24: Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab () Collaborative

Kansas State UniversityDepartment of Computing and Information Sciences

Kansas State University KDD Lab (www.kddresearch.org)

References [2]:References [2]:The Grid and BioinformaticsThe Grid and Bioinformatics

• The Grid– United Kingdom eScience Initiative: Taylor et al. (2002)

http://www.research-councils.ac.uk/escience– Access Grid: Foster and Kesselman (1999), Foster (2002)

http://www-fp.mcs.anl.gov/fl/accessgrid– NSF NPACI lecture: Reed (10 Apr 2002) http://

www.interact.nsf.gov/cise/conferences.nsf/cise_lectures

• Bioinformatics– European Bioinformatics Institute Tutorial: Brazma et al. (2001) http://

www.ebi.ac.uk/microarray/biology_intro.htm– Hebrew University: Friedman, Pe’er, et al. (1999, 2000, 2002)

http://www.cs.huji.ac.il/labs/compbio/– K-State BMI Group: literature survey and resource catalog (2002)

http://www.kddresearch.org/Groups/Bioinformatics

Kohavi (1998): “Crossing the Chasm”http://robotics.stanford.edu/~ronnyk/chasm.pdf