new models for relational classification ricardo silva (statslab) joint work with wei chu and zoubin...
TRANSCRIPT
![Page 1: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/1.jpg)
New Models for Relational Classification
Ricardo Silva (Statslab)
Joint work with Wei Chu and Zoubin Ghahramani
![Page 2: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/2.jpg)
The talk
Classification with non-iid data A source of non-iidness: relational
information A new family of models, and what is
new Applications to classification of text
documents
![Page 3: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/3.jpg)
The prediction problem
X
Y
![Page 4: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/4.jpg)
Standard setup
X
Y
N Xnew
Ynew
![Page 5: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/5.jpg)
Prediction with non-iid data
X1
Y1
Xnew
Ynew
X2
Y2
![Page 6: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/6.jpg)
Where does the non-iid information come from?
Relations Links between data points
Webpage A links to Webpage B Movie A and Movie B are often rented together
Relations as data “Linked webpages are likely to present similar
content” “Movies that are rented together often have
correlated personal ratings”
![Page 7: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/7.jpg)
The vanilla relational domain: time-series
Relations: “Yi precedes Yi + k”, k > 0 Dependencies: “Markov structure G”
Y1 Y2 Y3… …
![Page 8: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/8.jpg)
A model for integrating link data
How to model the class labels dependencies?
Movies that are rented together often might have all other sources of common, unmeasured factors
These hidden common causes affect the ratings
![Page 9: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/9.jpg)
Example
MovieFeatures(M1)
Rating(M1)
MovieFeatures(M2)
Rating(M2)
Same genre?
Both released in same year?
Same director?
Target same age groups?
![Page 10: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/10.jpg)
Integrating link data
Of course, many of these common causes will be measured
Many will not Idea:
Postulate a hidden common cause structure, based on relations
Define a model Markov to this structure Design an adequate inference algorithm
![Page 11: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/11.jpg)
Example: Political Books database
A network of books about recent US politics sold by the online bookseller Amazon.com Valdis Krebs, http://www.orgnet.com/
Relations: frequent co-purchasing of books by the same buyers Political inclination factors as the hidden
common causes
![Page 12: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/12.jpg)
Political Books relations
![Page 13: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/13.jpg)
Political Books database
Features: I collected the Amazon.com front page
for each of the books Bag-of-words, tf-idf features, normalized
to unity Task:
Binary classification: “liberal” or “not-liberal” books
43 liberal books out of 105
![Page 14: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/14.jpg)
Contribution
We will show a classical multiple linear
regression model built a relational variation generalize with a more complex set of
independence constraints generalize it using Gaussian processes
![Page 15: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/15.jpg)
Seemingly unrelated regression (Zellner,1962)
Y = (Y1, Y2), X = (X1, X2)
Suppose you regress Y1 ~ X1, X2 and X2 turns out to be useless Analogously for Y2 ~ X1, X2
(X1 vanishes) Suppose you regress
Y1 ~ X1, X2, Y2 And now every variable is a
relevant predictor
X1 X2
Y1
X
X1 X2
Y1
Y2
![Page 16: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/16.jpg)
Graphically, with latents
Capital(GE)
Stock price(GE)
Capital(Westinghouse)
Stock price(Westinghouse)
Industry factor 1Industry factor 2
Industry factor k?
…
X:
Y:
![Page 17: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/17.jpg)
The Directed Mixed Graph (DMG)
Capital(GE)
Stock price(GE)
Capital(Westinghouse)
Stock price(Westinghouse)
X:
Y:
Richardson (2003), Richardson and Spirtes (2002)
![Page 18: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/18.jpg)
A new family of relational models
Inspired by SUR Structure: DMG graphs
Edges postulated from given relations
X1
Y1
Y3
Y4
Y2
Y5
X2 X3 X4 X5
![Page 19: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/19.jpg)
Model for binary classification
Nonparametric Probit regression
Zero-mean Gaussian process prior over f( . )
P(yi = 1| xi) = P(y*(xi) > 0)
y*(xi) = f(xi) + i, i ~ N(0, 1)
![Page 20: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/20.jpg)
Relational dependency model
Make {} dependent multivariate Gaussian
For convenience, decouple it into two error terms
= * +
![Page 21: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/21.jpg)
Dependency model: the decomposition
= * +
Independent from each other
Marginally independent Dependent according to relations
=* +
Diagonal Not diagonal, with 0s onlyon unrelated pairs
![Page 22: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/22.jpg)
Dependency model: the decomposition
If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply
y*(xi) = f(xi) + = f(xi) + + * = g(xi) + *
g(.) = K + *
![Page 23: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/23.jpg)
Approximation
Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate
Approximate posterior with a Gaussian Expectation-Propagation (Minka, 2001)
The reason for * becomes apparent in the EP approximation
![Page 24: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/24.jpg)
Approximation
Likelihood does not factorize over f( . ), but factorizes over g( . )
Approximate each factor p(yi | g(xi)) with a Gaussian if * were 0, yi would be a deterministic
function of g(xi)
p(g | x, y) p(g | x) p(yi | g(xi))i
![Page 25: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/25.jpg)
Generalizations
This can be generalized for any number of relations
Y1
Y3
Y4
Y2
Y5
= * + 1 + 2 + 3
![Page 26: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/26.jpg)
But how to parameterize ?
Non-trivial Desiderata:
Positive definite Zeroes on the right places Few parameters, but broad family Easy to compute
![Page 27: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/27.jpg)
But how to parameterize ?
“Poking zeroes” on a positive definite matrix doesn’t work
Y1 Y2 Y3
1 0.8 0.8
0.8 1 0.8
0.8 0.8 1
1 0.8 0
0.8 1 0.8
0 0.8 1
positive definite not positive definite
![Page 28: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/28.jpg)
Approach #1
Assume we can find all cliques for the bi-directed subgraph of relations
Create a “factor analysis model”, where for each clique Ci there is a latent variable Li
members of each clique are the only children of Li
Set of latents {L} is a set of N(0, 1) variables coefficients in the model are equal to 1
![Page 29: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/29.jpg)
Approach #1
Y1 = L1 + 1
Y2 = L1 + L2 + 2
Y1
Y3
Y4
Y2
L1 L2
Y1 Y3Y2 Y4
![Page 30: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/30.jpg)
Approach #1
In practice, we set the variance of each to a small constant (10-4)
Covariance between any two Ys is proportional to the number of cliques they
belong together inversely proportional to the number of
cliques they belong to individually
![Page 31: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/31.jpg)
Approach #1
Let U be the correlation matrix obtained from the proposed procedure
To define the error covariance, use a single hyperparameter [0, 1]
*
=(I – Udiag) + U
![Page 32: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/32.jpg)
Approach #1
Notice: if everybody is connected, model is exchangeable and simple
Y1
Y3
Y4
Y2
L1
Y1 Y3Y2 Y4
=1
1
1
1
![Page 33: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/33.jpg)
Approach #1
Finding all cliques is “impossible”, what to do?
Triangulate and them extract cliques Can be done in polynomial time
This is a relaxation of the problem, since constraints are thrown away
Can have bad side effects: the “Blow-Up” effect
![Page 34: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/34.jpg)
Political Books dataset
![Page 35: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/35.jpg)
Political Books dataset:the “Blow-up” effect
![Page 36: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/36.jpg)
Approach #2
Don’t look for cliques: create a latent for each pair of variables
Very fast to compute, zeroes respected
Y1
Y3
Y4
Y2
Y1
Y3
Y4
Y2
L13
L13
L13
L13
![Page 37: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/37.jpg)
Approach #2
Correlations, however, are given by
Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common
We call this the “pulverization” effect
Sqrt(#neigh(i) . #neigh(j))
1Corr(i, j)
![Page 38: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/38.jpg)
Political Books dataset
![Page 39: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/39.jpg)
Political Books dataset:the “pulverization” effect
![Page 40: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/40.jpg)
WebKB dataset: links of pages in University of Washington
![Page 41: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/41.jpg)
Approach #1
![Page 42: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/42.jpg)
Approach #2
![Page 43: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/43.jpg)
Comparison:undirected models
Generative stories Conditional random fields (Lafferty,
McCallum, Pereira, 2001) Wei et al., 2006/Richardson and Spirtes,
2002;
X1
Y1 Y3Y2
X2 X3
![Page 44: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/44.jpg)
Chu Wei’s model
Y1*
Y1 Y3Y2
Y2* Y3
*
X1 X2 X3
R12 = 1 R23 = 1
Dependency family equivalent to a pairwise Markov random field
Y1 Y3Y2
![Page 45: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/45.jpg)
Properties of undirected models
MRFs propagate information among “test” points
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
![Page 46: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/46.jpg)
Properties of DMG models
DMGs propagate information among “training” points
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
![Page 47: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/47.jpg)
Properties of DMG models
In a DMG, each “test” point will have in the Markov blanket a whole “training component”
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
![Page 48: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/48.jpg)
Properties of DMG models
It seems acceptable that a typical relational domain will not have a “extrapolation” pattern Like typical “structured output” problems,
e.g., NLP domains Ultimately, the choice of model
concerns the question: “Hidden common causes” or
“relational indicators”?
![Page 49: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/49.jpg)
Experiment #1
A subset of the CORA database 4,285 machine learning papers, 7 classes Links: citations between papers
“hidden common cause” interpretation: particular ML subtopic being treated
Experiment: 7 binary classification problems, Class 5 vs. others
Criterion: AUC
![Page 50: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/50.jpg)
Experiment #1
Comparisons: Regular GP Regular GP + citation adjacency matrix Chu Wei’s Relational GP (RGP) Our method, miXed graph GP (XGP)
Fairly easy task Analysis of low-sample tasks
Uses 1% of the data (roughly 10 data points for training)
Not that useful for XGP, but more useful for RGP
![Page 51: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/51.jpg)
Experiment #1
Chu Wei’s method get up to 0.99 in several of those…
![Page 52: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/52.jpg)
Experiment #2
Political Books database 105 datapoints, 100 runs using 50% for training
Comparison with standard Gaussian processes Linear kernels
Results 0.92 for regular GP 0.98 for XGP (using pairwise kernel generator)
Hyperparameters optimized by grid search Difference: 0.06 with std 0.02 Chu Wei’s method does the same…
![Page 53: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/53.jpg)
Experiment #3
WebKB Collections of webpages from 4 different
universities Task: “outlier classification”
Identify which pages are not a student, course, project or faculty pages
10% for training data (still not that hard) However, an order of magnitude of more data
than in Cora
![Page 54: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/54.jpg)
Experiment #3
As far as I know, XGP gets easily the best results on this task
![Page 55: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/55.jpg)
Future work
Tons of possibilities on how to parameterize output covariance matrix Incorporating relation attributes too
Heteroscedastic relational noise Mixtures of relations New approximation algorithms Clustering problems On-line learning
![Page 56: New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56649f115503460f94c24aa5/html5/thumbnails/56.jpg)
Thank You