learning relational probability trees
DESCRIPTION
Learning Relational Probability Trees. Jennifer Neville David Jensen Lisa Friedland Michael Hay. Presented by Andrew Tjang. About the authors. David Jensen – Director of KDL, research asst. prof @ umass - PowerPoint PPT PresentationTRANSCRIPT
Learning Relational Probability Trees
Jennifer Neville
David Jensen
Lisa Friedland
Michael Hay
Presented by Andrew Tjang
About the authors
David Jensen – Director of KDL, research asst. prof @ umass
– Research focuses on the statistical aspects and architecture of data mining systems and the evaluation of those systems for government and business applications
Authors, cont’d
Lisa Friedland – graduate student, KDL, interest in DB/data mining
Michael Hay – graduate student, KDL Jennifer Neville –
– masters student @ Umass in Knowledge Discovery Lab (KDL investigates how to find useful patterns in large and complex databases.)
Learning Relational Probability Trees (RPTs)
Classification trees used in data mining: easily interpretable – Conventional: homogeneous data, statistically
independent– Much research gone into classification trees ->
relational data, so variances can be safely “ignored” and relational effects can be on the front burner
Why is relational data so complicated?
Traditional assumption of instance independence is violated.
3 characteristics of relational data– Concentrated linkage– Degree disparity– Relational autocorrelation
All 3 potentially produce overly complex models with excess structure
– This leads to factually incorrect relations and more space/computation requirements.
What did they do?
Conventional techniques/modeling assume independence– RPT and class of algorithms to adjust for relational
characteristics. RPTs the only statistical model to adjust for these relational characteristics
Applied these algorithms to a set of databases (IMDb, Cora, WebKB, human genome)– Then evaluated accuracies of their RPT applications
Examples of relational data
Social networks, genomic data, data on interrelated people, places, things/events from text documents.
First part was to apply RPT to WebKB – (a collection of websites classified into categories –
course, faculty staff, student, research project)– Appx 4000 pages, 8000 hyperlinks– DB contains attributes associated with each
page/hyperlink (path/domain, direction of hyperlink)
Probability (sub) tree of WebKB
RBTs
Estimate probability distros over att values These algs differ from regular – they consider
the effect of related object– Movie example
Use Aggregation used to “proportionalize” relational data.– E.g. some movies have 10 actors, some have 1000
Aggregate as avg age. (either preprocess and or dynamically during learning process
The Algorithm
RPT algorithm takes collection of subgraphs as input (single target object ot be classified + other objects/links in relat’l neighborhood
Construct a probability estimation tree (like previous fig) to predict target class label given:– The atts of target object– The atts of other objects in neighborhood– The degree atts counting object and neighborhood
objects
Algorithm (cont’d)
Searches over space of binary relat’l features to “split data”– MODE(actor.gender)=female
Constructs features based on atts of different objects in subgraphs and aggregation of values.
Feature scores calculated from class after split using chi squared analysis
Algorithm Part trois
The p-value from the chi squared will reveal which features are significantly correlated. The max of these is included in tree
Does this alg recursively greedily until class distros don’t change significantly– “Bonferroni adjustment” + Laplace correction to
improve probability estimates
Relational features
Traditional features– Genre=comedy
ID att, operator, value
Relational features– Movie(x),Y = {y|ActedIn(x,y)}:Max(Age(Y))>65
ID relation that links an object x to a set of other object Y.
Feature Selection Biases
Concentrated linkage and autocorrelation combine to reduce effective sample size and increase variance.
Degree disparity can cause sprious elements and miss useful elements.
Randomization Test
Used on relational data sets where hypothesis testing not as useful.
– Used to account for biases
Computationally intensive– Pseudo samples generated by permuting variables– In this case permute att values before aggregation
This new permuted vector preserves intrinsic links/relations but removes correlations btw atts and class label
Tests
Use IMDb to predict $2million box office receipts on opening weekend
1) IMDb with class labels only correlated with degree features.
2)Use structure of IMDb as well as attributes 3)From Cora : if paper is of topic “neural networks” 4) Genome project: if gene is in nucleus 5) WebKB: is student web page
Tests (cont’d)
Each of these tasks a set of classification techniques were applied:– RPT w/ Chi squared (CT)– RPT with randomized testing (RT)– RPT (pseudo) w/o randomized testing (C4.5)– Baseline Relat’l Bayesian Classifier (RBC)
* entries statistically significant differences
Results
Results explained
None except for random showed that their algorithm performed any better than other algorithms
For random they show:– RPTs can adjust for linkage and autocorrelation.– RTs perfrom better than other 3 models– Biased models select random attributes that hinder
performance.
Conclusions (their’s and mine)
Possible to extend conventional methods for estimation tree algorithms
RPT w/randomization and RPT w/o randomization perform about the same (smaller trees in the former)
Randomization adjusts for biases (but is this so important if they perform the same?)
Non selective don’t have biases, but lose interpretablilty.
Future work enrichments to RPT– Better ways of calc’ing ranking scores