learning hierarchical riffle independent groupings from ...ri ed independence , introduced recently...

Learning Hierarchical Rie Independent Groupings from Rankings

Jonathan Huang [email protected]

Carlos Guestrin [email protected]

Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA, 15213

Abstract

Ried independence is a generalized notionof probabilistic independence that has beenshown to be naturally applicable to rankeddata. In the ried independence model,one assigns rankings to two disjoint setsof items independently, then in a secondstage, interleaves (or ries) the two rankingstogether to form a full ranking, as if byshuing a deck of cards. Because of thisinterleaving stage, it is much more dicultto detect ried independence than ordinaryindependence. In this paper, we provide therst automated method for discovering setsof items which are rie independent froma training set of rankings. We show thatour clustering-like algorithms can be usedto discover meaningful latent coalitions fromreal preference ranking datasets and to learnthe structure of hierarchically decomposablemodels based on ried independence.

1. Introduction

Ranked data appears ubiquitously in various machinelearning application domains. Rankings are useful, forexample, in reasoning about preference lists in surveys,search results in information retrieval applications,and ballots in certain elections. The problem of build-ing statistical models on rankings has thus been animportant research topic in the learning community.As with many challenging learning problems, one mustcontend with an intractably large state space whendealing with rankings since there are n! ways to rank nobjects. In building a statistical model over rankings,simple (yet exible) models are therefore preferablebecause they are typically more computationallytractable and less prone to overtting.

Appearing in Proceedings of the 26 th International Confer-ence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).

A popular and highly successful approach for achievingsuch simplicity for distributions involving large collec-tions of interdependent variables has been to exploitconditional independence structures (e.g., naive Bayes,tree, Markov models). With ranking problems, how-ever, independence-based relations are typically trick-ier to exploit due to the so-called mutual exclusivity

constraints which constrain any two items to map todierent ranks in any given ranking. In a recent paper,Huang and Guestrin (2009) proposed an alternativenotion of independence, called ried independence,which they have shown to be more natural for rank-ings. In addition to being a natural way to representdistributions in a factored form, the concept of riedindependence leads to a natural way of grouping itemstogether; If we nd two subsets of items, A and B to berie independent of each other, then it is often the casethat items in A (and items in B) are associated witheach other in some way. Experimentally, for example,Huang and Guestrin (2009) were able to demonstrateon a particular election dataset (of the American Psy-chological Association, (Diaconis, 1989)), that the po-litical coalitions formed by candidates in the electionwere in fact approximately rie independent of eachother. However, it is not always obvious what kindof groupings ried independence will lead to. Shouldfruits really be rie independent of vegetables? Or aregreen foods rie independent of red foods?

In this paper, we address the problem of automaticallydiscovering optimal rie independent groupings ofitems from training data. Key among our observationsis the fact that while item ranks cannot be independentdue to mutual exclusivity, relative ranks between setsof items are not subject to the same constraints. Morethan simply being a `clustering' algorithm, however,our procedure can be thought of as a structure learn-ing algorithm, like those from the graphical models lit-erature, which nd the optimal (ried) independencedecomposition of a distribution. Structure learningalgorithms are useful for ecient probabilistic repre-sentations and inference, and in the spirit of graphicalmodels, we explore a family of probabilistic models for


rankings based on hierarchies of groupings using riedindependence. We show that our hierarchical modelsare simple and interpretable, and present a method forperforming model selection based on our algorithm.Specically, our contributions are as follows:

• We propose a method for nding the partitioningof the item set such that the subsets of the parti-tion are as close to rie independent as possible.In particular, we propose a novel objective forquantifying the degree to which two subsets arerie independent to each other.

• We dene a family of simple and interpretabledistributions over rankings, called hierarchical rif-e independent models, in which subsets of itemsare interleaved into larger subsets in a recursivestagewise fashion. We apply our partitioningalgorithm to perform model selection from train-ing data in polynomial time, without having toexhaustively search over the exponentially largespace of hierarchical structures.

2. Ried independence for rankings

In this paper we will be concerned with distributionsover rankings. A ranking σ = (σ(1), . . . , σ(n)) is aone-to-one association between n items and ranks,where σ(j) = i means that the jth item is assignedrank i under σ. We will also refer to a ranking σ by itsinverse, Jσ−1(1), . . . , σ−1(n)K (called an ordering anddenoted with double brackets instead of parentheses),where σ−1(i) = j also means that the jth item isassigned rank i under σ. As a running examplein this paper, we will consider ranking a small listof 6 items consisting of fruits and vegetables: (C)Corn, (P) Peas, (L) Lemons, (O) Oranges, (F) Figs,and (G) Grapes. The ordering σ = JP,F,C, . . .K, forexample, ranks Peas rst, Figs second, Corn thirdand so on. A distribution h(σ), dened over theset of rankings, Sn, can be viewed as a joint distri-bution over the n variables (σ(1), . . . , σ(n)) (whereσ(j) ∈ 1, . . . , n), subject to mutual exclusivity

constraints which stipulate that two objects cannotsimultaneously map to the same rank, or alternatively,that two ranks cannot simultaneously be occupied bythe same object (h(σ(i) = σ(j)) = 0 whenever i 6= j).

Since there are n! rankings for n items, representingarbitrary distributions h is not viable for large n, andit is typical to restrict oneself to simplied classesof distributions. It might be tempting to thinkthat probabilistic independence assertions wouldbe a useful simplication for ranking problems (asnaive Bayes models have been useful in other areasof machine learning). However, mutual exclusivity

makes probabilistic independence a tricky notion forrankings, since each pair of items is constrained tomap to dierent ranks. Another way to see this is toimagine constructing a graphical model representingh in which each node corresponds to the rank ofan item. For each pair of nodes, mutual exclusivityinduces an edge between those two nodes, leading toa fully connected graph (Huang et al., 2009).

Ried independence, introduced recently in (Huang &Guestrin, 2009), turns out to be a more natural notionof independence in the ranking setting. Considerour example of jointly ranking a set containing bothvegetables (denoted by set A) and fruits (denoted byset B) in order of preference. The idea of ried in-dependence is to rst draw two independent rankings(one for vegetables, one for fruits), then to interleavethe two sets to form a full ranking for the entire collec-tion. The word rie comes from the popular shuingtechnique called the rie shue in which one cuts adeck of cards and interleaves the piles. For a formaldenition, we establish the following notation aboutrelative rankings and interleavings. Given a rankingσ ∈ Sn, and a subset A ⊂ 1, . . . , n, let φA(σ) denotethe ranks of items in A relative to the set A. Forexample, in the ranking σ = JP,L,F,G,C,OK, therelative ranks of the vegetables is φA(σ) = JP,CK =JPeas, CornK. Thus, while corn is ranked fth inσ, it is ranked second in φA(σ). Similarly, therelative ranks of the fruits is φB(σ) = JL,F,G,OK =JLemons, F igs,Grapes,OrangesK. Likewise, letτA,B(σ) denote the way in which the sets A and Bare interleaved by σ. For example, using the sameσ as above, the interleaving of vegetables and fruitsis τA,B(σ) = JV eg, Fruit, Fruit, Fruit, V eg, FruitK.We will use ΩA,B to denote the collection of alldistinct ways of interleaving the sets A and B. Notethat if |A| = k, then |ΩA,B | =

(nk

).

Denition 1 (Huang and Guestrin (2009)). Leth be a distribution over Sn and consider a subsetA ⊂ 1, . . . , n (with |A| = k)and its complement B.The sets A and B are said to be rie independent ifh factors as:

h(σ) = m(τA,B(σ)) · fA(φA(σ)) · gB(φB(σ)), (2.1)

where m, fA, gB are distributions over ΩA,B , Sk,and Sn−k, respectively. We will refer to m as theinterleaving distribution, and to fA and gB as therelative ranking distributions of A and B. We willalso notate the relation as A ⊥m B.

Ried independence can be seen as a generalizationof full independence (when the interleaving distri-bution is deterministic). In addition to exploringtheoretical properties, Huang and Guestrin showed


C,P,L,O,F,G

C,PVegetables

L,O,F,GFruits

L,OCitrus

F,GMedi-

terranean

(a) Example of hierarchi-cal ried independencestructure on S6

C,P,L,O,F,G

C,P,L,O

C,P L,O

F,G

(b) Another example, notequivalent to (a)

A,B,C,D

A,B,C

A,B

A B

C

D

(c) Hierarchical decompo-sition into singleton sub-sets (each leaf set consistsof a single item)

Figure 1. Examples of distinct hierarchical rie independent structures

that the ried independence relation in fact holds(approximately) in certain datasets. In particular,they showed on a particular election dataset, that thepolitical coalitions in the candidate set were nearlyrie independent of each other. In other nonpoliticalranking datasets, however, it is not always obvioushow one should partition the item set so that the rieindependent approximation is as close as possibleto the true joint distribution. Is there any way todiscover latent coalitions of items arising from riedindependence from ranked data? Before presentingour solution to this question, we discuss a morecomplicated but realistic scenario in which coalitionsare further partitioned into smaller subgroups.

3. Hierarchical decompositions

For a xed subset A (and its complement B), there areO

(k! + (n− k)! +

(nk

))model parameters (specifying

fA, gB , and m). Despite the fact that rie inde-pendent models are much more compact comparedto full models over rankings, it is still intractable torepresent arbitrary rie independent models for largen, and one must make further simplications. Forexample, Huang and Guestrin discuss methods forrepresenting useful parametric families of interleavingdistributions, with just a handful of parameters.

We will explore the other natural model simplicationwhich comes from the simple observation that therelative ranking distributions fA and gB are againdistributions over rankings and so the sets A and Bcan further be decomposed into two rie independentsubsets. We call such models hierarchical rie inde-

pendent decompositions. Continuing with our runningexample, one can imagine that the fruits are furtherpartitioned into two sets, a set consisting of citrusfruits ((L) Lemons and (O) Oranges) and a set consist-ing of mediterranean fruits ((F) Figs and (G) Grapes).To generate a full ranking, one rst draws rankingsof the citrus and mediterranean fruits independently(JL,OK and JG,FK, for example). Secondly, the two

sets are interleaved to form a ranking of all fruits(JG,L,O,FK). Finally, a ranking of the vegetables isdrawn (JP,CK) and interleaved with the fruit rank-ings to form a full joint ranking: JP,G,L,O,F,CK.Notationally, we can express the hierarchical decom-position as P,C ⊥m (L,O ⊥m F,G). We canalso visualize hierarchies using trees (see Figure 1(a)for our example). The subsets of items which appearas leaves in the tree will be referred to as leaf sets.

A natural question to ask is: if we used a dierenthierarchy with the same leaf sets, would we capturethe same distributions? For example, does a distribu-tion which decomposes according to the tree in Fig-ure 1(b) also decompose according to the tree in Fig-ure 1(a)? The answer, in general, is no, due to thefact that distinct hierarchies impose dierent sets ofindependence assumptions, and as a result, dierentstructures can be well or badly suited for modeling agiven dataset. Consequently, it is important to use thecorrect structure if possible. We remark that whilethe two structures 1(a),1(b), capture distinct familiesof distributions, it is possible to identify a set of inde-pendence assumptions common to both structures.

4. Objective functions

Since dierent hierarchies impose dierent indepen-dence assumptions, we would like to nd the structurethat is best suited for modeling a given rankingdataset. The base problem that we now address ishow to nd the best structure if there is only one levelof partitioning and two leaf sets, A, B. Alternatively,we want to nd the topmost partitioning of thetree. In Section 5, we use this base case as part of atop-down approach for learning a full hierarchy.

Problem statement. Given a training set of rank-ings, σ(1), . . . , σ(m) ∼ h, in which a subset of items,A ⊂ 1, . . . , n, is rie independent of its complement,we would like to automatically determine the sets Aand B ⊂ 1, . . . , n. If h does not exactly factor rie


independently, then we would like to nd the rie in-dependent approximation which is closest to h in somesense. Formally, we would like to solve the problem:

arg minA

minm,f,g

DKL(h(σ) ||m(τA,B(σ))f(φA(σ))g(φB(σ))),

(4.1)

where h is the empirical distribution of trainingexamples and DKL is the Kullback-Leibler divergencemeasure. Equation 4.1 is a seemingly reasonable ob-jective since it can also be interpreted as maximizingthe likelihood of the training data. If A and B aretruly rie independent, then 4.1 can be shown viathe Gibbs inequality to attain its minimum, zero, at(and only at) the subsets A and B.

For small problems, one can actually solve Problem 4.1using a single computer by evaluating the approxima-tion quality of each subset A and taking the minimum,which was the approach taken in (Huang & Guestrin,2009). However, for larger problems, one runs intotime and sample complexity problems since optimizingthe globally dened objective function (4.1) requiresrelearning all model parameters (m, fA, and gB) foreach of the exponentially many subsets of 1, . . . , n.In the remainder, we propose a more locally denedobjective function, reminiscent of clustering, whichwe will use instead of Equation 4.1. As we show, ournew objective will be more tractable to compute andhave lower sample complexity for estimation.

Proposed objective function. The approach wetake is to minimize a dierent measure that exploitsthe observation that absolute ranks of items in Aare fully independent of relative ranks of items in B,

and vice versa. With our vegetables and fruits, forexample, knowing that Figs is ranked rst among allsix items (the absolute rank of a fruit) should giveno information about whether Corn is preferred toPeas (the relative rank of vegetables). More formally,given a subset A = a1, . . . , a`, let σ(A) denote thevector of (absolute) ranks assigned to items in A by σ(thus, σ(A) = (σ(a1), σ(a2), . . . , σ(a`))). We proposeto minimize an alternative objective function:

F(A) ≡ I(σ(A) ; φB(σ)) + I(σ(B) ; φA(σ)), (4.2)

where I denotes the mutual information (denedbetween two variables X1 and X2 by I(X1;X2) ≡DKL(P (X1, X2)||P (X1)P (X2)). We can show that Fis guaranteed to detect ried independence:

Proposition 2. F(A) = 0 is a necessary and su-

cient criterion for a subset A ⊂ 1, . . . , n to be rie

independent of its complement, B.

As with Equation 4.1, optimizing F is still intractablefor large n. However, it motivates a natural proxy,

AB

Internal Triplet

Cross Triplet

Figure 2. Graphical depiction of the problem of nding rif-e independent subsets. Note the similarities to clustering.Double bars represent the direct nature of the triangles.

in which we replace the mutual informations denedover all n variables by a sum of mutual informa-tions dened over just three variables at a time.Given any triplet of distinct items, (i, j, k), letIi;j,k ≡ I(σ(i) ; σ(j) < σ(k)). We dene ΩcrossA,B to bethe set of triplets which cross from set A to set B:

ΩcrossA,B ≡ (i; j, k) : i ∈ A, j, k ∈ B.

ΩcrossB,A is similarly dened. We also dene ΩintA to bethe set of triplets that are internal to A

ΩintA ≡ (i; j, k) : i, j, k ∈ A,

and again, ΩintB is similarly dened. Our proxy objec-tive function can be written as the sum of the mutualinformation evaluated over all of the crossing edges:

F(A) ≡X

(i,j,k)∈ΩcrossA,B

Ii;j,k +X

(i,j,k)∈ΩcrossB,A

Ii;j,k. (4.3)

F can be viewed as a low order version of F , involvingmutual information computations over triplets ofvariables at a time instead of n-tuples. The mutualinformation Ia;b,b′ , for example, reects how much therank of a vegetable tells us about how two fruits com-pare. If A and B are rie independent, then we knowthat Ia;b,b′ = 0 and Ib;a,a′ = 0 for any items a, a′ ∈ Aand b, b′ ∈ B. The objective F is somewhat remi-niscent of typical graphcut and clustering objectives.Instead of partitioning a set of nodes based on sumsof pairwise similarities, we partition based on sums oftripletwise anities. We show a graphical depictionof the problem in Figure 2, where cross triplets (inΩcrossA,B , ΩcrossB,A ) have low weight and internal triplets(in ΩintA , ΩintB ) have high weight. The objective is tond a partition such that the sum over cross triplets islow. In fact, the problem of optimizing F can be seenas an instance of the weighted, directed hypergraphcut problem (Gallo et al., 1993). Note that the worddirected is signicant for us, because, unlike typicalclustering problems, our triplets are not symmetric(for example, Ii;jk 6= Ij;ik), resulting in a nonstandardand poorly understood optimization problem.

Third-order detectability assumptions. Whendoes F detect ried independence? It is not dicult


to see, for example, that F = 0 is a necessary con-dition for ried independence, since A ⊥m B impliesIa;b,b′ = 0. We have:Proposition 3. If A and B are rie independent

sets, then F(A) = 0.

However, the converse of Prop. 3 is not true in fullgenerality without accounting for dependencies thatinvolve larger subsets of variables. Just as the pairwiseindependence assumptions (that are commonly usedfor randomized algorithms (Motwani & Raghavan,1996))1 do not imply full independence between twosets of variables, there exist distributions which lookrie independent from tripletwise marginals butdo not factor upon examining higher-order terms.Nonetheless, in most practical scenarios, we expectF = 0 to imply ried independence.

Estimating F from samples. We have so farargued that F is a reasonable function for nding rieindependent subsets. However, since we only haveaccess to samples rather than the true distribution hitself, it will only be possible to compute an approx-imation to the objective F . In particular, for everytriplet of items, (i, j, k), we must compute an estimateof the mutual information Ii;j,k from i.i.d. samplesdrawn from h, and the main question is: how manysamples will we need in order for the approximateversion of F to remain a reasonable objective function?

In the following, we denote the estimated value ofIi;j,k by Ii;j,k. For each triplet, we use a regularizedprocedure due to (Högen, 1993) to estimate mutualinformation. We adapt his sample complexity boundto our problem below.

Lemma 4. For any xed triplet (i, j, k), the mutual

information Ii;j,k can be estimated to within an

accuracy of ∆ with probability at least 1 − γ using

S(∆, γ) ≡ O(n2

∆2 log2 n∆ log n4

γ

)i.i.d. samples and

the same amount of time.

The approximate objective function is therefore:

F(A) ≡X

(i,j,k)∈ΩcrossA,B

Ii;j,k +X

(i,j,k)∈ΩcrossB,A

Ii;j,k.

What we want to now show is that, if there existsa unique way to partition 1, . . . , n into rie inde-pendent sets, then given enough training examples,our approximation F uniquely singles out the correctpartition as its minimum with high probability. Aclass of rie independent distributions for which the

1 A pairwise independent family of random variables isone in which any two members are marginally independent.Subsets with larger than two members may not necessarilyfactor independently, however.

uniqueness requirement is satised consists of thedistributions for which A and B are strongly connected

according to the following denition.

Denition 5. A subset A ⊂ 1, . . . , n is calledε-third-order strongly connected if, for every tripleti, j, k ∈ A with i, j, k distinct, we have Ii;j,k > ε.

If a set A is rie independent of B and both setsare third order strongly connected, then we canensure that ried independence is detectable fromthird-order terms and that the partition is unique.We have the following probabilistic guarantee.

Theorem 6. Let A and B be ε-third order strongly

connected rie independent sets, and suppose |A| = k.

Given S(∆, ε) ≡ O(n4

ε2 log2 n2

ε log n4

γ

)i.i.d. samples,

the minimum of F is achieved at exactly the subsets

A and B with probability at least 1− γ.

If k is small compared to n, then one can actually de-

rive a better bound: S(∆, ε) ≡ O(n2

ε2 log2 n2

ε log n4

γ

).

Finding balanced partitions. We conclude thissection with a practical extension of the basic ob-jective function F . In practice, like the minimumcut objective for graphs, the tripletwise objectiveof Equation 4.3 has a tendency to prefer smallpartitions (either |A| or |B| very small) to more bal-anced partitions (|A|, |B| ≈ n/2) due to the fact thatunbalanced partitions have fewer triplets that crossbetween A and B. The simplest way to avoid thisbias is to optimize the objective function over subsetsof a xed size k. As we discuss in the next section,optimizing with a xed k can be useful for buildingthin hierarchical rie independent models. Alterna-tively, one can use a modied objective function thatencourages more balanced partitions. For example,we use a variation of our objective function (inspiredby the normalized cut criterion (Shi & Malik, 2000))which has proven to be useful for detecting riedindependence when the size k is unknown.

5. Structure discovery algorithms.

Having now designed a function that is tractable toestimate from both perspectives of computational andsample complexity, we turn to the problem of learn-ing the hierarchical rie independence structure fromtraining examples. In this paper, we take a simpletop-down approach in which the item sets are recur-sively partitioned by optimizing F until some stoppingcriterion is met. The question is: can F be optimizedeciently? We begin by discussing a restricted class ofthin models for which we can tractably optimize F .


Learning thin chain models. By a k-thin chainmodel, we refer to a hierarchical structure in whichthe size of the smaller set at each split in the hierarchyis xed to be a constant k ∼ O(1) and can thereforebe expressed as:

(A1 ⊥m (A2 ⊥m (A3 ⊥m . . . ))), |Ai| = k, for all i.

We view thin chains as being somewhat analogousto thin junction tree models (Chechetka & Guestrin,2007), in which cliques are never allowed to havemore than k variables. To draw rankings from athin chain model, one sequentially inserts itemsindependently, one group of size k at a time, intothe full ranking. The structure learning problem istherefore to discover how the items are partitionedinto groups, which group is inserted rst, which groupis inserted second, and so on.

When the size k is known, the optimal partitioning ofan item set can be found by exhaustively evaluating Fover all k-subsets. Finding the global optimum of Fis therefore guaranteed at each stage of the recursion.The time complexity for nding the optimal partitionfromm samples is O(knk+2+mn3), accounting for pre-computing the necessary mutual informations as wellas optimization time, and is tractable if k is small.

Handling arbitrary partitions using anchors.

When k is large, or even unknown, F cannot be opti-mized using exhaustive methods. Instead, we proposea simple algorithm for nding A and B based on thefollowing observation. If an oracle could identify anytwo elements of the set A, say, a1, a2, in advance,then the quantity Ix;a1,a2 = I(x; a1 < a2) indicateswhether the item x belongs to A or B since Ix;a1,a2 isnonzero in the rst case, and zero in the second case.

For nite training sets, when I is only known approx-imately, one can sort the set Ix;a1,a2 ; x 6= a1, a2and if k is known, take the k items closest to zeroto be the set B. Since we compare all items againsta1, a2, we refer to these two xed items as anchors.Of course a1, a2 are not known in advance, but theabove method can be repeated for every pair of itemsas anchors to produce a collection of O(n2) candidatepartitions. Each partition can then be scored usingthe approximate objective F , and a nal optimalpartition can be selected as the minimum over thecandidates. See Algorithm 1. In cases when k is notknown a priori, we evaluate partitions for all possiblesettings of k using F .

Since the anchors method does not require searchingover subsets, it can be signicantly faster (withO(n3m) time complexity) than an exhaustive opti-mization of F . Moreover, by assuming ε-third order

Algorithm 1 The Anchors method

Input: training set σ(1), . . . , σ(m), k ≡ |A|.Estimate and cache Ii;j,k for all i, j, k;for all a1, a2 ∈ 1, . . . , n, a1 6= a2 do

Ik ← kth smallest item in Ix;a1,a2 ;x 6= a1, a2;Aa1,a2 ← x : Ix;a1,a2 ≤ Ik ;

end forAbest ← arg mina1,a2 F(Aa1,a2);

output Abest;

strong connectivity as in the previous section, one canuse similar arguments to derive sample complexitybounds. We remark that there are practical dierencesthat can at times make the anchors method somewhatless robust than an exhaustive search. Conceptually,anchoring works well when there exists two elementsthat are strongly connected with all of the otherelements in its set, whereas an exhaustive search canwork well in weaker conditions such as when itemsare strongly connected through longer paths. Weshow in our experiments that the anchors method cannonetheless be quite eective for learning hierarchies.

6. Experiments

Synthetic data. We rst applied our methods tosynthetic data to show that, given enough samples, ouralgorithms do eectively recover the optimal hierarchi-cal structures. For various settings of n, we simulateddata drawn jointly from a k-thin chain model (fork = 4) with a random parameter setting for each struc-ture and applied our exact method for learning thinchains to each sampled dataset. First, we investigatedthe eect of varying sample size on the proportionof trials (out of fty) for which our algorithms wereable to (a) recover the full underlying tree structureexactly, (b) recover the topmost partition correctly,or (c) recover all leaf sets correctly (but possibly outof order). Figure 3(a) shows the result for an itemsetof size n = 16. Figure 3(b), shows, as a function of n,the number of samples that were required in the sameexperiments to (a) exactly recover the full underlyingstructure or (b) recover the correct leaf sets, for atleast 90% of the trials. What we can observe from theplots is that, given enough samples, reliable structurerecovery is indeed possible. It is also interesting tonote that recovery of the correct leaf sets can bedone with much fewer samples than are required forrecovering the full hierarchical structure of the model.

After learning a structure for each dataset, we learnedmodel parameters and evaluated the log-likelihood ofeach model on 200 test examples drawn from thetrue distributions. In Figure 3(c), we compare log-


1 2 3 4 50

0.2

0.4

0.6

0.8

1

log10(# of samples)

Pe

rce

nta

ge

of tr

ials

all leaf sets

recovered

full tree

recovered

(red, solid)

top partition

recovered

(green, dashed)

(a) Success rate for structurerecovery vs. sample size (n =16, k = 4)

6 8 10 12 14 160

1

2

3

4

5

n

log1

0(#

sa

mp

les r

equ

ire

d t

o r

eco

ve

r tr

ee

)

all leaf sets

recovered

full tree recovered

top partition recovered

(b) Number of samples re-quired for structure recoveryvs. number of items n

1 2 3 4 5-6300

-6200

-6100

-6000

-5900

-5800

-5700

-5600

-5500

-5400

log-lik

elih

oo

d

log10(# samples)

leaf set recovery

success rate

true structure

learned structure

random 1-chain (with

learned parameters)

(c) Test set log-likelihoodcomparison

1 2 3 4 50

0.2

0.4

0.6

0.8

1

log10(# of samples)

Pe

rce

nta

ge

of tr

ials

all leaf sets

recovered

full tree

recovered

(red, solid)

top partition

recovered

(green, dashed)

(d) Anchors algorithm suc-cess rate (n = 16, unknown k)

Figure 3. Simulated data experiments

likelihood performance when (a) the true structureis given (but not parameters), (b) a k-thin chain islearned with known k, and (c) when we use a randomgenerated 1-chain structure. As expected, knowing thetrue structure results in the best performance, andthe 1-chain is overconstrained. However, our struc-ture learning algorithm is eventually able to catch upto the performance of the true structure given enoughsamples. It is also interesting to note that the jump inperformance at the halfway point in the plot coincideswith the jump in the success rate of discovering allleaf sets correctly we conjecture that performanceis sometimes less sensitive to the actual hierarchy used,as long as the leaf sets have been correctly discovered.

To test the Anchors algorithm, we ran the same sim-ulation using Algorithm 1 on data drawn from hierar-chical models with no xed k. We generated roughlybalanced structures, meaning that item sets wererecursively partitioned into (almost) equally sized sub-sets at each level of the hierarchy. From Figure 3(d),we see that the Anchors algorithm can also discoverthe true structure given enough samples. Interestingly,the dierence in sample complexity for discoveringleaf sets versus discovering the full tree is not nearly aspronounced as in Figure 3(a). We believe that this isdue to the fact that the balanced trees have less depththan the thin chains, leading to fewer opportunitiesfor our greedy top-down approach to commit errors.

Election data. We now demonstrate our methodswith real datasets. A number of electoral systemsaround the world require voters to provide a rankingof a set of candidates in order of preference. Diaco-nis (1989), for example, analyzed a 1980 presidentialelection of the American Psychological Association(APA) consisting of 5738 rankings of ve candidates(W. Bevan, I. Iscoe, C. Kiesler, M. Siegle, and L.Wright). We applied our methods to learning a hier-archy from the APA data. Since there are only vecandidates, we tried both the exhaustive optimizationalgorithm as well as the anchors algorithm. Both

76 171 389 882 20010

0.2

0.4

0.6

0.8

1

# samples

Su

ccess r

ate

Major party

leaves recovered

Top partition

recovered

All leaves

recovered

Full tree

recovered

(a) Stable features from re-covered hierarchies vs. sam-ple size

50 85 144 243 412 698 1180 2000-5000

-4500

loglik

elih

oo

d

# samples (logarithmic scale)

Optimized Hierarchy

Optimized 1-chain

(b) Test log-likelihood vs.sample size

Figure 4. Irish election data experiments

methods recovered the same structure in roughly thesame time (shown in Figure 5), which is also thestructure that (Huang & Guestrin, 2009) obtained byperforming a brute-force search over partitions usingthe KL-divergence objective (Equation 4.1). Theleaf sets of the learned hierarchy in fact reect threecoalitions in the APA made up of research, clinicaland community psychologists respectively.

Next, we applied our algorithms to a larger Irish Houseof Parliament (Dáil Éireann) election dataset fromthe Meath constituency in Ireland. See Gormley andMurphy (2006) for more election details (includingcandidate names) as well as an alternative analysis.There were 14 candidates in the 2002 election belong-ing to the two major rival political parties, Fianna Fáiland Fine Gael, as well as a number of smaller parties.We used a subset of 2500 fully ranked ballots from theelection. As with the APA data, both the exhaustiveoptimization of F and the anchors algorithm returnedthe same tree, with running times of 69.7 seconds and2.1 seconds respectively (not including the 3.1 secondsrequired for precomputing mutual informations). Theresulting tree, with candidates enumerated alphabet-ically from 1 through 14, is shown (only up to depth4), in Figure 6. As expected, the candidates belongingto the two major parties, Fianna Fáil and Fine Gael,are neatly partitioned into their own leaf sets. Thetopmost leaf is the Sinn Fein candidate, indicating


WB, II, CK, MS, LW

IIWB, CK, MS, LW

MS, LW MS, LW1,3 4,5

2

Research Clinical

Community

Figure 5. Hierarchy learned from APA election data

1,2,3,4,5,6,7,8,9,10,11,12,13,14

1,2,3,4,5,6,7,8,9,10,11,13,14 12

1,2,3,4,5,6,7,8,9,10,13,1411

1,4,13 2,3,5,6,7,8,9,10,14

2,5,6 3,7,8,9,10,14

3 7,8,9,10,14

7,8,9,10 14

7,8,9 10

Sinn Fein

Christian

Solidarity

Fianna Fáil

Fine Gael

Independent

Labour

GreenIndependent

Figure 6. Hierarchy learned from Irish election data.See (Gormley & Murphy, 2006) for a list of candidates.The candidates are ordered alphabetically in this tree.

that voters tended to insert him into the rankingindependently of all of the other 13 candidates.To understand the behavior of our algorithm withsmaller sample sizes, we looked for features of thetree from Figure 6 which remained stable even whenlearning with smaller sample sizes. In Figure 4(a),we subsample from the original training set 100 timesat dierent sample sizes and plot the proportion oflearned hierarchies which, (a) recover the Sinn Feincandidate as the topmost leaf, (b) partition the twomajor parties into leaf sets, and (c) agree with theoriginal tree on all leaf sets. Note that even with fewtraining examples, candidates belonging to the majorparties are correctly grouped together indicatingstrong party inuence in voting behavior.

Finally, we compared the results between learninga general hierarchy (without xed k) and learninga 1-thin chain model on the Irish data. Figure 4(b)shows the log-likelihoods achieved by both models ona held-out test set as the training set size increases.For each training set size, we subsampled the Irishdataset 100 times to produce condence intervals.Again, even with small sample sizes, the hierarchyoutperforms the 1-chain and continually improveswith more and more training data. One might thinkthat the hierarchical models, which use more param-eters are prone to overtting, but in practice, themodels learned by our algorithm devote most of the

extra parameters towards modeling the correlationsamong the two major parties. As our results suggest,such intraparty ranking correlations are crucial forachieving good modeling performance.

7. Conclusion

We have investigated an intuitive ranking modelbased on hierarchical ried independence decompo-sitions and have shown that the structure of suchhierarchies can be eciently learned from data usingour proposed algorithm which can automatically ndrie independent partitions within item sets.

Like Bayesian networks, cliques of variables for hier-archical ried independence models can often formintuitive groupings, such as in the election datasetsthat we analyzed in this paper. However, for problemsin which the item groupings are nonobvious, structurelearning is of fundamental importance, and we believethat our contributions in this paper open the doorto further research into the problem of decomposinghuge distributions into tractable pieces much likeBayesian networks have done for other distributions.

Acknowledgements

This work is supported in part by the ONR underMURI N000140710747, and the Young InvestigatorProgram grant N00014-08-1-0752. We thank K.El-Arini for feedback, and B. Murphy and C. Gormleyfor datasets. Discussions with M. Meila providedvaluable ideas upon which this work is based.

References

Chechetka, A. and Guestrin, C. Ecient principled learn-ing of thin junction trees. In Advances in Neural Infor-mation Processing Systems 21, 2007.

Diaconis, P. A generalization of spectral analysis with ap-plication to ranked data. The Annals of Statistics, 17(3):949979, 1989.

Gallo, G., Longo, G., Pallottino, S., and Nguyen, S. Di-rected hypergraphs and applications. Discrete AppliedMath., 42(2-3), 1993.

Gormley, C. and Murphy, B. A latent space model forrank data. In 23rd International Conference on MachineLearning, 2006.

Högen, K. U. Learning and robust learning of productdistributions. In Sixth Annual Conference on LearningTheory, 1993.

Huang, J. and Guestrin, C. Ried independence for rankeddata. In Advances in Neural Information Processing Sys-tems 23, 2009.

Huang, J., Guestrin, C., Jiang, X., and Guibas, L. Ex-ploiting probabilistic independence for permutations. InArticial Intelligence and Statistics, JMLR W&CP 5,2009.

Motwani, R. and Raghavan, P. Randomized algorithms.ACM Computational Survey, 28(1), 1996.

Shi, J. and Malik, J. Normalized cuts and image segmen-tation. IEEE Transactions in Pattern Analysis and Ma-chine Intelligence, 22(8), 2000.

learning hierarchical riffle independent groupings from ...ri ed independence , introduced recently...

Documents