ranking, aggregation, and youlmackey/papers/rank... · ranking, aggregation, and you lester mackeyy...

137
Ranking, Aggregation, and You Lester Mackey Collaborators: John C. Duchi and Michael I. Jordan * Stanford University * UC Berkeley October 5, 2014

Upload: others

Post on 06-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking, Aggregation, and You

Lester Mackey†

Collaborators: John C. Duchi† and Michael I. Jordan∗

†Stanford University ∗UC Berkeley

October 5, 2014

Page 2: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

A simple question

I On a scale of 1 (very white) to 10 (very black), how black is thisbox?

I Which box is blacker?

Page 3: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

A simple question

I On a scale of 1 (very white) to 10 (very black), how black is thisbox?

I Which box is blacker?

Page 4: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

A simple question

I On a scale of 1 (very white) to 10 (very black), how black is thisbox?

I Which box is blacker?

Page 5: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Another questionOn a scale of 1 to 10, how relevant is this result for the query flowers?

Page 6: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Another question

On a scale of 1 to 10, how relevant is this result for the query flowers?

Page 7: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Another question

Page 8: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What have we learned?

1. We are good at pairwise comparisons

I Much worse at absolute relevance judgments[Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]

2. We are good at expressing sparse, partial preferences

I Much worse at expressing complete preferences

Complete preferences:

ftd.com

en.wikipedia.org/...

1800flowers.com

What you express:

ftd.com

en.wikipedia.org/...

1800flowers.com

Page 9: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What have we learned?1. We are good at pairwise comparisons

I Much worse at absolute relevance judgments[Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]

2. We are good at expressing sparse, partial preferences

I Much worse at expressing complete preferences

Complete preferences:

ftd.com

en.wikipedia.org/...

1800flowers.com

What you express:

ftd.com

en.wikipedia.org/...

1800flowers.com

Page 10: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What have we learned?1. We are good at pairwise comparisons

I Much worse at absolute relevance judgments[Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]

2. We are good at expressing sparse, partial preferencesI Much worse at expressing complete preferences

Complete preferences:

ftd.com

en.wikipedia.org/...

1800flowers.com

What you express:

ftd.com

en.wikipedia.org/...

1800flowers.com

Page 11: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking

Goal: Order set of items/results to best match your preferences

I Web search: Return most relevant URLs for user queries

I Recommendation systems:

I Movies to watch based on user’s past ratingsI News articles to read based on past browsing historyI Items to buy based on patron’s or other patrons’ purchases

Page 12: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking

Goal: Order set of items/results to best match your preferences

I Web search: Return most relevant URLs for user queries

I Recommendation systems:

I Movies to watch based on user’s past ratingsI News articles to read based on past browsing historyI Items to buy based on patron’s or other patrons’ purchases

Page 13: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking

Goal: Order set of items/results to best match your preferences

I Web search: Return most relevant URLs for user queries

I Recommendation systems:I Movies to watch based on user’s past ratingsI News articles to read based on past browsing historyI Items to buy based on patron’s or other patrons’ purchases

Page 14: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 15: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 16: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 17: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 18: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 19: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 20: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 21: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking proceduresGoal: Order set of items/results to best match your preferences

1. Tractable: Run in polynomial time

2. Consistent: Recover true preferences given sufficient data

3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

I Standard (tractable) procedures for ranking with partialpreferences are inconsistent

I Aggregating partial preferences into more complete preferencescan restore consistency

I New estimators based on U-statistics achieve 1+2+3

Page 22: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Outline

Supervised RankingFormal definitionTractable surrogatesPairwise inconsistency

AggregationRestoring consistencyEstimating complete preferences

U-statisticsPractical proceduresExperimental results

Page 23: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Outline

Supervised RankingFormal definitionTractable surrogatesPairwise inconsistency

AggregationRestoring consistencyEstimating complete preferences

U-statisticsPractical proceduresExperimental results

Page 24: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: Sequence of training examples

I Query Q: e.g., search term “flowers”

I Set of m items IQ to rank

I e.g., websites {1, 2, 3, 4}

I Label Y representing some preferencestructure over items

I Item 1 preferred to {2, 3} and item 3 to 4

1

2 3

4

y12 y13

y34

Example: Y is agraph on items{1, 2, 3, 4}

Page 25: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: Sequence of training examples

I Query Q: e.g., search term “flowers”

I Set of m items IQ to rank

I e.g., websites {1, 2, 3, 4}

I Label Y representing some preferencestructure over items

I Item 1 preferred to {2, 3} and item 3 to 4

1

2 3

4

y12 y13

y34

Example: Y is agraph on items{1, 2, 3, 4}

Page 26: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: Sequence of training examples

I Query Q: e.g., search term “flowers”

I Set of m items IQ to rankI e.g., websites {1, 2, 3, 4}

I Label Y representing some preferencestructure over items

I Item 1 preferred to {2, 3} and item 3 to 4

1

2 3

4

y12 y13

y34

Example: Y is agraph on items{1, 2, 3, 4}

Page 27: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: Sequence of training examples

I Query Q: e.g., search term “flowers”

I Set of m items IQ to rankI e.g., websites {1, 2, 3, 4}

I Label Y representing some preferencestructure over items

I Item 1 preferred to {2, 3} and item 3 to 4

1

2 3

4

y12 y13

y34

Example: Y is agraph on items{1, 2, 3, 4}

Page 28: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: Sequence of training examples

I Query Q: e.g., search term “flowers”

I Set of m items IQ to rankI e.g., websites {1, 2, 3, 4}

I Label Y representing some preferencestructure over items

I Item 1 preferred to {2, 3} and item 3 to 4

1

2 3

4

y12 y13

y34

Example: Y is agraph on items{1, 2, 3, 4}

Page 29: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn)

Learn: Scoring function f to induce item rankings for each query

I Real-valued score for each item i in item set IQ

αi := fi(Q)

I Vector of scores f(Q) induces ranking over IQ

i ranked above j ⇐⇒ αi > αj

Page 30: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn)

Learn: Scoring function f to induce item rankings for each query

I Real-valued score for each item i in item set IQ

αi := fi(Q)

I Vector of scores f(Q) induces ranking over IQ

i ranked above j ⇐⇒ αi > αj

Page 31: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Example: Scoring function f with scores

f1(Q) > f2(Q) > f3(Q)

induces same ranking as preference graph Y

1

2

3

Y

f1(Q) > f2(Q)

f2(Q) > f3(Q)

Page 32: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn)

Learn: Scoring function f to predict item ranking

Suffer loss: L(f(Q), Y )

I Encodes discord between observed label Y and prediction f(Q)

I Depends on specific ranking task and available data

Page 33: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised rankingExample: Pairwise loss

I Let Y = (weighted) adjacency matrix for a preference graph

I Yij = the preference weight on edge (i, j)

I Let α = f(Q) be the predicted scores for query QI Then, L(α, Y ) =

∑i 6=j Yij1(αi≤αj)

I Imposes penalty for each misordered edge

1

2 3

4

y12 y13

y34

L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)

Page 34: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised rankingExample: Pairwise loss

I Let Y = (weighted) adjacency matrix for a preference graphI Yij = the preference weight on edge (i, j)

I Let α = f(Q) be the predicted scores for query QI Then, L(α, Y ) =

∑i 6=j Yij1(αi≤αj)

I Imposes penalty for each misordered edge

1

2 3

4

y12 y13

y34

L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)

Page 35: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised rankingExample: Pairwise loss

I Let Y = (weighted) adjacency matrix for a preference graphI Yij = the preference weight on edge (i, j)

I Let α = f(Q) be the predicted scores for query Q

I Then, L(α, Y ) =∑

i 6=j Yij1(αi≤αj)

I Imposes penalty for each misordered edge

1

2 3

4

y12 y13

y34

L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)

Page 36: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised rankingExample: Pairwise loss

I Let Y = (weighted) adjacency matrix for a preference graphI Yij = the preference weight on edge (i, j)

I Let α = f(Q) be the predicted scores for query QI Then, L(α, Y ) =

∑i 6=j Yij1(αi≤αj)

I Imposes penalty for each misordered edge

1

2 3

4

y12 y13

y34

L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)

Page 37: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: (Q1, Y1), . . . (Qn, Yn)

Learn: Scoring function f to rank items

Suffer loss: L(f(Q), Y )

Goal: Minimize the risk R(f) := E [L(f(Q), Y )]

Main Question:Are there tractable ranking procedures that minimize R as n→∞?

Page 38: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Supervised ranking

Observe: (Q1, Y1), . . . (Qn, Yn)

Learn: Scoring function f to rank items

Suffer loss: L(f(Q), Y )

Goal: Minimize the risk R(f) := E [L(f(Q), Y )]

Main Question:Are there tractable ranking procedures that minimize R as n→∞?

Page 39: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Tractable rankingFirst try: Empirical risk minimization

← Intractable!

minf

Rn(f) := En [L(f(Q), Y )] =1

n

∑n

k=1L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y )

L(α, Y ) =∑

i 6=j Yij1(αi≤αj) ϕ(α, Y ) =∑

i6=j Yijφ(αi − αj)

Hard Tractable

Page 40: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Tractable rankingFirst try: Empirical risk minimization ← Intractable!

minf

Rn(f) := En [L(f(Q), Y )] =1

n

∑n

k=1L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y )

L(α, Y ) =∑

i 6=j Yij1(αi≤αj) ϕ(α, Y ) =∑

i6=j Yijφ(αi − αj)

Hard Tractable

Page 41: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Tractable rankingFirst try: Empirical risk minimization ← Intractable!

minf

Rn(f) := En [L(f(Q), Y )] =1

n

∑n

k=1L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y )

L(α, Y ) =∑

i 6=j Yij1(αi≤αj)

ϕ(α, Y ) =∑

i6=j Yijφ(αi − αj)

Hard

Tractable

Page 42: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Tractable rankingFirst try: Empirical risk minimization ← Intractable!

minf

Rn(f) := En [L(f(Q), Y )] =1

n

∑n

k=1L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y )

L(α, Y ) =∑

i 6=j Yij1(αi≤αj)

ϕ(α, Y ) =∑

i6=j Yijφ(αi − αj)

Hard

Tractable

Page 43: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Tractable rankingFirst try: Empirical risk minimization ← Intractable!

minf

Rn(f) := En [L(f(Q), Y )] =1

n

∑n

k=1L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y )

L(α, Y ) =∑

i 6=j Yij1(αi≤αj) ϕ(α, Y ) =∑

i 6=j Yijφ(αi − αj)

Hard Tractable

Page 44: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Surrogate ranking

Idea: Empirical surrogate risk minimization

minf

Rϕ,n(f) := En [ϕ(f(Q), Y )] =1

n

∑n

k=1ϕ(f(Qk), Yk)

I If ϕ convex, then minimization is tractable

I argminf Rϕ,n(f)n→∞→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

Main Question:Are these tractable ranking procedures consistent?

⇐⇒Does argminf Rϕ(f) also minimize the true risk R(f)?

Page 45: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Surrogate ranking

Idea: Empirical surrogate risk minimization

minf

Rϕ,n(f) := En [ϕ(f(Q), Y )] =1

n

∑n

k=1ϕ(f(Qk), Yk)

I If ϕ convex, then minimization is tractable

I argminf Rϕ,n(f)n→∞→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

Main Question:Are these tractable ranking procedures consistent?

⇐⇒Does argminf Rϕ(f) also minimize the true risk R(f)?

Page 46: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Surrogate ranking

Idea: Empirical surrogate risk minimization

minf

Rϕ,n(f) := En [ϕ(f(Q), Y )] =1

n

∑n

k=1ϕ(f(Qk), Yk)

I If ϕ convex, then minimization is tractable

I argminf Rϕ,n(f)n→∞→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

Main Question:Are these tractable ranking procedures consistent?

⇐⇒Does argminf Rϕ(f) also minimize the true risk R(f)?

Page 47: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Surrogate ranking

Idea: Empirical surrogate risk minimization

minf

Rϕ,n(f) := En [ϕ(f(Q), Y )] =1

n

∑n

k=1ϕ(f(Qk), Yk)

I If ϕ convex, then minimization is tractable

I argminf Rϕ,n(f)n→∞→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

Main Question:Are these tractable ranking procedures consistent?

⇐⇒Does argminf Rϕ(f) also minimize the true risk R(f)?

Page 48: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Surrogate ranking

Idea: Empirical surrogate risk minimization

minf

Rϕ,n(f) := En [ϕ(f(Q), Y )] =1

n

∑n

k=1ϕ(f(Qk), Yk)

I If ϕ convex, then minimization is tractable

I argminf Rϕ,n(f)n→∞→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

Main Question:Are these tractable ranking procedures consistent?

⇐⇒Does argminf Rϕ(f) also minimize the true risk R(f)?

Page 49: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Classification consistency

Consider the special case of classification

I Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

I Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

I Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ isconsistent if and only if φ′(0) < 0. [Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

Page 50: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Classification consistency

Consider the special case of classification

I Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

I Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

I Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ isconsistent if and only if φ′(0) < 0. [Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

Page 51: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Classification consistency

Consider the special case of classification

I Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

I Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

I Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ isconsistent if and only if φ′(0) < 0. [Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

Page 52: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Classification consistency

Consider the special case of classification

I Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

I Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

I Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ isconsistent if and only if φ′(0) < 0. [Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

Page 53: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Classification consistency

Consider the special case of classification

I Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

I Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

I Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ isconsistent if and only if φ′(0) < 0. [Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

Page 54: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Good news: Can characterize surrogate ranking consistency

Theorem:1 Procedure based on minimizing ϕ is consistent ⇐⇒

minα

{E[ϕ(α, Y ) | q]

∣∣∣∣ α 6∈ argminα′

E[L(α′, Y ) | q]}

> minα

E[ϕ(α, Y ) | q].

I Translation: ϕ is consistent if and only if minimizing conditionalsurrogate risk gives correct ranking for every query

1[Duchi, Mackey, and Jordan, 2013]

Page 55: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Good news: Can characterize surrogate ranking consistency

Theorem:1 Procedure based on minimizing ϕ is consistent ⇐⇒

minα

{E[ϕ(α, Y ) | q]

∣∣∣∣ α 6∈ argminα′

E[L(α′, Y ) | q]}

> minα

E[ϕ(α, Y ) | q].

I Translation: ϕ is consistent if and only if minimizing conditionalsurrogate risk gives correct ranking for every query

1[Duchi, Mackey, and Jordan, 2013]

Page 56: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Bad news: The consequences are dire...

Consider the pairwise loss:

L(α, Y ) =∑

i 6=j

Yij1(αi≤αj)

1

2 3

4

y12 y13

y34

Task: Find argminα E[L(α, Y ) | q]I Classification (two node) case: Easy

I Choose α0 > α1 ⇐⇒ P[Class 0 | q] > P[Class 1 | q]

I General case: NP hard

I Unless P = NP , must restrict problem for tractable consistency

Page 57: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Bad news: The consequences are dire...

Consider the pairwise loss:

L(α, Y ) =∑

i 6=j

Yij1(αi≤αj)

1

2 3

4

y12 y13

y34

Task: Find argminα E[L(α, Y ) | q]I Classification (two node) case: Easy

I Choose α0 > α1 ⇐⇒ P[Class 0 | q] > P[Class 1 | q]

I General case: NP hard

I Unless P = NP , must restrict problem for tractable consistency

Page 58: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Bad news: The consequences are dire...

Consider the pairwise loss:

L(α, Y ) =∑

i 6=j

Yij1(αi≤αj)

1

2 3

4

y12 y13

y34

Task: Find argminα E[L(α, Y ) | q]

I Classification (two node) case: Easy

I Choose α0 > α1 ⇐⇒ P[Class 0 | q] > P[Class 1 | q]

I General case: NP hard

I Unless P = NP , must restrict problem for tractable consistency

Page 59: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Bad news: The consequences are dire...

Consider the pairwise loss:

L(α, Y ) =∑

i 6=j

Yij1(αi≤αj)

1

2 3

4

y12 y13

y34

Task: Find argminα E[L(α, Y ) | q]I Classification (two node) case: Easy

I Choose α0 > α1 ⇐⇒ P[Class 0 | q] > P[Class 1 | q]

I General case: NP hard

I Unless P = NP , must restrict problem for tractable consistency

Page 60: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Bad news: The consequences are dire...

Consider the pairwise loss:

L(α, Y ) =∑

i 6=j

Yij1(αi≤αj)

1

2 3

4

y12 y13

y34

Task: Find argminα E[L(α, Y ) | q]I Classification (two node) case: Easy

I Choose α0 > α1 ⇐⇒ P[Class 0 | q] > P[Class 1 | q]

I General case: NP hardI Unless P = NP , must restrict problem for tractable consistency

Page 61: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Low noise distributionDefine: Average preference for item i over item j:

sij = E[Yij | q]

I We say i � j on average if sij > sji

Definition (Low noise distribution): If i � j on average and j � kon average, then i � k on average.

2 3

1

s12 s31s13

s23

Low noise⇒ s13 > s31

I No cyclic preferences on average

I Find argminα E[L(α, Y ) | q]: Very easyI Choose αi > αj ⇐⇒ sij > sji

Page 62: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Low noise distributionDefine: Average preference for item i over item j:

sij = E[Yij | q]

I We say i � j on average if sij > sji

Definition (Low noise distribution): If i � j on average and j � kon average, then i � k on average.

2 3

1

s12 s31s13

s23

Low noise⇒ s13 > s31

I No cyclic preferences on average

I Find argminα E[L(α, Y ) | q]: Very easyI Choose αi > αj ⇐⇒ sij > sji

Page 63: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Low noise distributionDefine: Average preference for item i over item j:

sij = E[Yij | q]

I We say i � j on average if sij > sji

Definition (Low noise distribution): If i � j on average and j � kon average, then i � k on average.

2 3

1

s12 s31s13

s23

Low noise⇒ s13 > s31

I No cyclic preferences on average

I Find argminα E[L(α, Y ) | q]: Very easyI Choose αi > αj ⇐⇒ sij > sji

Page 64: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Pairwise ranking surrogate:[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

for φ convex with φ′(0) < 0. Common in ranking literature.

Theorem: ϕ is not consistent, even in low noise settings.[Duchi, Mackey, and Jordan, 2013]

⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

Page 65: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Pairwise ranking surrogate:[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

for φ convex with φ′(0) < 0. Common in ranking literature.

Theorem: ϕ is not consistent, even in low noise settings.[Duchi, Mackey, and Jordan, 2013]

⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

Page 66: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking consistency?

Pairwise ranking surrogate:[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

for φ convex with φ′(0) < 0. Common in ranking literature.

Theorem: ϕ is not consistent, even in low noise settings.[Duchi, Mackey, and Jordan, 2013]

⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

Page 67: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challenging

I Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 68: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )

I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 69: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 70: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 71: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 72: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 73: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes!

Page 74: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Ranking with pairwise data is challengingI Inconsistent in general (unless P = NP )I Low noise distributions

I Inconsistent for standard convex losses

ϕ(α, Y ) =∑

ij

Yijφ(αi − αj)

I Inconsistent for margin-based convex losses

ϕ(α, Y ) =∑

ij

φ(αi − αj − Yij)

Question:Do tractable consistent losses exist for partial preference data?

Yes, if we aggregate!

Page 75: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Outline

Supervised RankingFormal definitionTractable surrogatesPairwise inconsistency

AggregationRestoring consistencyEstimating complete preferences

U-statisticsPractical proceduresExperimental results

Page 76: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

An observation

Can rewrite risk of pairwise loss

E[L(α, Y ) | q] =∑

i 6=j

sij1(αi≤αj)

=∑

i 6=j

max{sij − sji, 0}1(αi≤αj)

where sij = E[Yij | q].

I Only depends on net expected preferences: sij − sjiConsider the surrogate

ϕ(α, s) :=∑

i 6=j

max{sij − sji, 0}φ(αi − αj)

6=∑

i6=j

sijφ(αi − αj)

for φ non-increasing and convex, with φ′(0) < 0.

I Either i→ j penalized or j → i but not both

I Consistent whenever average preferences are acyclic

Page 77: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

An observation

Can rewrite risk of pairwise loss

E[L(α, Y ) | q] =∑

i 6=j

sij1(αi≤αj) =∑

i 6=j

max{sij − sji, 0}1(αi≤αj)

where sij = E[Yij | q].I Only depends on net expected preferences: sij − sji

Consider the surrogate

ϕ(α, s) :=∑

i 6=j

max{sij − sji, 0}φ(αi − αj)

6=∑

i6=j

sijφ(αi − αj)

for φ non-increasing and convex, with φ′(0) < 0.

I Either i→ j penalized or j → i but not both

I Consistent whenever average preferences are acyclic

Page 78: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

An observation

Can rewrite risk of pairwise loss

E[L(α, Y ) | q] =∑

i 6=j

sij1(αi≤αj) =∑

i 6=j

max{sij − sji, 0}1(αi≤αj)

where sij = E[Yij | q].I Only depends on net expected preferences: sij − sji

Consider the surrogate

ϕ(α, s) :=∑

i 6=j

max{sij − sji, 0}φ(αi − αj)

6=∑

i 6=j

sijφ(αi − αj)

for φ non-increasing and convex, with φ′(0) < 0.

I Either i→ j penalized or j → i but not both

I Consistent whenever average preferences are acyclic

Page 79: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

An observation

Can rewrite risk of pairwise loss

E[L(α, Y ) | q] =∑

i 6=j

sij1(αi≤αj) =∑

i 6=j

max{sij − sji, 0}1(αi≤αj)

where sij = E[Yij | q].I Only depends on net expected preferences: sij − sji

Consider the surrogate

ϕ(α, s) :=∑

i 6=j

max{sij − sji, 0}φ(αi − αj) 6=∑

i 6=j

sijφ(αi − αj)

for φ non-increasing and convex, with φ′(0) < 0.

I Either i→ j penalized or j → i but not both

I Consistent whenever average preferences are acyclic

Page 80: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

An observation

Can rewrite risk of pairwise loss

E[L(α, Y ) | q] =∑

i 6=j

sij1(αi≤αj) =∑

i 6=j

max{sij − sji, 0}1(αi≤αj)

where sij = E[Yij | q].I Only depends on net expected preferences: sij − sji

Consider the surrogate

ϕ(α, s) :=∑

i 6=j

max{sij − sji, 0}φ(αi − αj) 6=∑

i 6=j

sijφ(αi − αj)

for φ non-increasing and convex, with φ′(0) < 0.

I Either i→ j penalized or j → i but not both

I Consistent whenever average preferences are acyclic

Page 81: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞1k

∑k ϕ(α, Yk)

I Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α,E[Y | q])

= limk→∞ ϕ(α,1k

∑k Yk)

I Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses

L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk))

where sk is a structure function that aggregates first k datapoints

I sk combines partial preferences into more complete estimates

I Consistency characterization extends to this setting

Page 82: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞1k

∑k ϕ(α, Yk)

I Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α,E[Y | q]) = limk→∞ ϕ(α,1k

∑k Yk)

I Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses

L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk))

where sk is a structure function that aggregates first k datapoints

I sk combines partial preferences into more complete estimates

I Consistency characterization extends to this setting

Page 83: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞1k

∑k ϕ(α, Yk)

I Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α,E[Y | q]) = limk→∞ ϕ(α,1k

∑k Yk)

I Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses

L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk))

where sk is a structure function that aggregates first k datapoints

I sk combines partial preferences into more complete estimates

I Consistency characterization extends to this setting

Page 84: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞1k

∑k ϕ(α, Yk)

I Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α,E[Y | q]) = limk→∞ ϕ(α,1k

∑k Yk)

I Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses

L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk))

where sk is a structure function that aggregates first k datapoints

I sk combines partial preferences into more complete estimates

I Consistency characterization extends to this setting

Page 85: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞1k

∑k ϕ(α, Yk)

I Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α,E[Y | q]) = limk→∞ ϕ(α,1k

∑k Yk)

I Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses

L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk))

where sk is a structure function that aggregates first k datapoints

I sk combines partial preferences into more complete estimates

I Consistency characterization extends to this setting

Page 86: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Aggregation via structure function

1

34

3

3

2

4

1

3

4

4

3

1

2

3

4

Y1, Y2, . . . , Yk sk(Y1, . . . , Yk)

Question: When does aggregation help?

Page 87: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Aggregation via structure function

1

34

3

3

2

4

1

3

4

4

3

1

2

3

4

Y1, Y2, . . . , Yk sk(Y1, . . . , Yk)

Question: When does aggregation help?

Page 88: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Complete data lossesI Normalized Discounted Cumulative Gain (NDCG)I Precision, Precision@kI Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogatesI e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

Idea:I Use aggregation to estimate complete preferences from partial

preferences

I Plug estimates into consistent surrogatesI Check that aggregation + surrogacy retains consistency

Page 89: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Complete data lossesI Normalized Discounted Cumulative Gain (NDCG)I Precision, Precision@kI Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogatesI e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

Idea:I Use aggregation to estimate complete preferences from partial

preferences

I Plug estimates into consistent surrogatesI Check that aggregation + surrogacy retains consistency

Page 90: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Complete data lossesI Normalized Discounted Cumulative Gain (NDCG)I Precision, Precision@kI Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogatesI e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

Idea:I Use aggregation to estimate complete preferences from partial

preferences

I Plug estimates into consistent surrogatesI Check that aggregation + surrogacy retains consistency

Page 91: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Complete data lossesI Normalized Discounted Cumulative Gain (NDCG)I Precision, Precision@kI Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogatesI e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

Idea:I Use aggregation to estimate complete preferences from partial

preferencesI Plug estimates into consistent surrogates

I Check that aggregation + surrogacy retains consistency

Page 92: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Complete data lossesI Normalized Discounted Cumulative Gain (NDCG)I Precision, Precision@kI Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogatesI e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

Idea:I Use aggregation to estimate complete preferences from partial

preferencesI Plug estimates into consistent surrogatesI Check that aggregation + surrogacy retains consistency

Page 93: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Cascade model for click data[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

I Person i clicks on first relevant result, k(i)

I Relevance probability of item k is pkI Probability of a click on item k is

pk

k−1∏

j=1

(1− pj)

I ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks:

s = argmaxp∈[0,1]m

n∑

i=1

log pk(i) +

k(i)∑

j=1

log(1− pj).

⇒ Consistent ERR minimization under our framework

1

2

3

4

5

Page 94: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Cascade model for click data[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

I Person i clicks on first relevant result, k(i)

I Relevance probability of item k is pk

I Probability of a click on item k is

pk

k−1∏

j=1

(1− pj)

I ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks:

s = argmaxp∈[0,1]m

n∑

i=1

log pk(i) +

k(i)∑

j=1

log(1− pj).

⇒ Consistent ERR minimization under our framework

1

2

3

4

5

Page 95: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Cascade model for click data[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

I Person i clicks on first relevant result, k(i)

I Relevance probability of item k is pkI Probability of a click on item k is

pk

k−1∏

j=1

(1− pj)

I ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks:

s = argmaxp∈[0,1]m

n∑

i=1

log pk(i) +

k(i)∑

j=1

log(1− pj).

⇒ Consistent ERR minimization under our framework

1

2

3

4

5

Page 96: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Cascade model for click data[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

I Person i clicks on first relevant result, k(i)

I Relevance probability of item k is pkI Probability of a click on item k is

pk

k−1∏

j=1

(1− pj)

I ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks:

s = argmaxp∈[0,1]m

n∑

i=1

log pk(i) +

k(i)∑

j=1

log(1− pj).

⇒ Consistent ERR minimization under our framework

1

2

3

4

5

Page 97: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Cascade model for click data[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

I Person i clicks on first relevant result, k(i)

I Relevance probability of item k is pkI Probability of a click on item k is

pk

k−1∏

j=1

(1− pj)

I ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks:

s = argmaxp∈[0,1]m

n∑

i=1

log pk(i) +

k(i)∑

j=1

log(1− pj).

⇒ Consistent ERR minimization under our framework

1

2

3

4

5

Page 98: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Benefits of aggregation

I Tractable consistency for partial preference losses

argminf

limk→∞

E[ϕ(f(Q), sk(Y1, . . . , Yk))]

⇒argmin

flimk→∞

E[L(f(Q), sk(Y1, . . . , Yk))]

I Use complete data losses with realistic partial preference dataI Models process of generating relevance scores from

clicks/comparisons

Page 99: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What remains?Before aggregation, we had

argminf

1

n

∑n

k=1ϕ(f(Qk), Yk)

︸ ︷︷ ︸empirical

→ argminf

E[ϕ(f(Q), Y )]︸ ︷︷ ︸population

What’s a suitable empirical analogue Rϕ,n(f) with aggregation?

⇐⇒When does

argminf

Rϕ,n(f)︸ ︷︷ ︸empirical

→ argminf

limk→∞

E[ϕ(f(Q), sk(Y1, . . . , Yk))]︸ ︷︷ ︸

population

?

Page 100: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What remains?Before aggregation, we had

argminf

1

n

∑n

k=1ϕ(f(Qk), Yk)

︸ ︷︷ ︸empirical

→ argminf

E[ϕ(f(Q), Y )]︸ ︷︷ ︸population

What’s a suitable empirical analogue Rϕ,n(f) with aggregation?

⇐⇒When does

argminf

Rϕ,n(f)︸ ︷︷ ︸empirical

→ argminf

limk→∞

E[ϕ(f(Q), sk(Y1, . . . , Yk))]︸ ︷︷ ︸

population

?

Page 101: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What remains?Before aggregation, we had

argminf

1

n

∑n

k=1ϕ(f(Qk), Yk)

︸ ︷︷ ︸empirical

→ argminf

E[ϕ(f(Q), Y )]︸ ︷︷ ︸population

What’s a suitable empirical analogue Rϕ,n(f) with aggregation?

⇐⇒When does

argminf

Rϕ,n(f)︸ ︷︷ ︸empirical

→ argminf

limk→∞

E[ϕ(f(Q), sk(Y1, . . . , Yk))]︸ ︷︷ ︸

population

?

Page 102: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Outline

Supervised RankingFormal definitionTractable surrogatesPairwise inconsistency

AggregationRestoring consistencyEstimating complete preferences

U-statisticsPractical proceduresExperimental results

Page 103: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation

q1

q2

q3

q4

q5

Y1 Y2 Y3... nq1

nq2

nq3

I Datapoint consists of query qand preference judgment Y

I nq datapoints for query q

I Structure functions foraggregation:

s(Y1, Y2, . . . , Yk)

Page 104: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation

q1

q2

q3

q4

q5

Y1 Y2 Y3... nq1

nq2

nq3

I Simple idea: for query q,aggregate all Y1, Y2, . . . , Ynq

I Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq))

Cons:

I Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k →∞

Ideal procedure:

I Agnostic to form of aggregation

I Take advantage of independence of Y1, Y2, . . .

Page 105: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation

q1

q2

q3

q4

q5

Y1 Y2 Y3... nq1

nq2

nq3

I Simple idea: for query q,aggregate all Y1, Y2, . . . , Ynq

I Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq))

Cons:

I Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k →∞

Ideal procedure:

I Agnostic to form of aggregation

I Take advantage of independence of Y1, Y2, . . .

Page 106: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation

q1

q2

q3

q4

q5

Y1 Y2 Y3... nq1

nq2

nq3

I Simple idea: for query q,aggregate all Y1, Y2, . . . , Ynq

I Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq))

Cons:

I Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k →∞

Ideal procedure:

I Agnostic to form of aggregation

I Take advantage of independence of Y1, Y2, . . .

Page 107: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Digression: U -statistics

q

nq︷ ︸︸ ︷

k︸ ︷︷ ︸

I U-statistic: classical tool in statistics

I Given X1, . . . , Xn, estimate E[g(X1, . . . , Xk)]for g symmetric

I Idea: Average all estimates based on kdatapoints

Un =

(n

k

)−1 ∑

i1<···<ik

g(Xi1 , Xi2 , . . . , Xik)

Page 108: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation: U -statistic in the loss

q

nq︷ ︸︸ ︷

k︸ ︷︷ ︸

I Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]

I Idea: Estimate with U -statistic:

(nqk

)−1 ∑

i1<···<ik

ϕ(α, s(Yi1 , . . . , Yik))

I Empirical risk for scoring function f :

Rϕ,n(f) =

1

n

q

nq

(nqk

)−1 ∑

i1<···<ik

ϕ(f(q), s(Yi1 , . . . , Yik))

Page 109: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation: U -statistic in the loss

q

nq︷ ︸︸ ︷

k︸ ︷︷ ︸

I Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]I Idea: Estimate with U -statistic:

(nqk

)−1 ∑

i1<···<ik

ϕ(α, s(Yi1 , . . . , Yik))

I Empirical risk for scoring function f :

Rϕ,n(f) =

1

n

q

nq

(nqk

)−1 ∑

i1<···<ik

ϕ(f(q), s(Yi1 , . . . , Yik))

Page 110: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Data with aggregation: U -statistic in the loss

q

nq︷ ︸︸ ︷

k︸ ︷︷ ︸

I Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]I Idea: Estimate with U -statistic:

(nqk

)−1 ∑

i1<···<ik

ϕ(α, s(Yi1 , . . . , Yik))

I Empirical risk for scoring function f :

Rϕ,n(f) =

1

n

q

nq

(nqk

)−1 ∑

i1<···<ik

ϕ(f(q), s(Yi1 , . . . , Yik))

Page 111: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Convergence of U -statistic procedures

Empirical risk for scoring function f :

Rϕ,n(f) =1

n

q

nq

(nqk

)−1 ∑

i1<···<ik

ϕ(f(q), s(Yi1 , . . . , Yik))

Theorem: If we choose kn = o(n) but kn →∞, then uniformly in f

Rϕ,n(f)→ limk→∞

E[ϕ(f(Q), s(Y1, . . . , Yk))]︸ ︷︷ ︸

Limiting aggregated loss

Page 112: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

New procedure for learning to rank

1

2

3

4

I Use loss function that aggregates per-query:

Rϕ,n(f) =

1

n

q

nq

(nqk

)−1 ∑

i1<···<ik

ϕ(f(q), s(Yi1 , . . . , Yik))

I Learn ranking function by taking

f ∈ argminf∈F

Rϕ,n(f)

I Can optimize by stochastic gradient descent overqueries q and subsets (i1, . . . , ik)

Page 113: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Experiments

I Web search

I Image ranking

Page 114: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Web searchI Microsoft Learning to Rank Web10K dataset

I 10,000 queries issuedI 100 items per queryI Estimated relevance score r ∈ R for each query/result pair

I Generating pairwise preferences

I Choose query q uniformly at randomI Choose pair (i, j) of items, and set i � j with probability

pij =1

1 + exp(rj − ri)

I Aggregate scores by setting

si =∑

j 6=i

logP (j ≺ i)

P (i ≺ j)

Page 115: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Web searchI Microsoft Learning to Rank Web10K dataset

I 10,000 queries issuedI 100 items per queryI Estimated relevance score r ∈ R for each query/result pair

I Generating pairwise preferences

I Choose query q uniformly at randomI Choose pair (i, j) of items, and set i � j with probability

pij =1

1 + exp(rj − ri)

I Aggregate scores by setting

si =∑

j 6=i

logP (j ≺ i)

P (i ≺ j)

Page 116: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Web searchI Microsoft Learning to Rank Web10K dataset

I 10,000 queries issuedI 100 items per queryI Estimated relevance score r ∈ R for each query/result pair

I Generating pairwise preferencesI Choose query q uniformly at randomI Choose pair (i, j) of items, and set i � j with probability

pij =1

1 + exp(rj − ri)

I Aggregate scores by setting

si =∑

j 6=i

logP (j ≺ i)

P (i ≺ j)

Page 117: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Web searchI Microsoft Learning to Rank Web10K dataset

I 10,000 queries issuedI 100 items per queryI Estimated relevance score r ∈ R for each query/result pair

I Generating pairwise preferencesI Choose query q uniformly at randomI Choose pair (i, j) of items, and set i � j with probability

pij =1

1 + exp(rj − ri)

I Aggregate scores by setting

si =∑

j 6=i

logP (j ≺ i)

P (i ≺ j)

Page 118: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Benefits of aggregation

NDCG risk as a function of aggregation level kfor n = 106 samples

100

101

102

103

104

0.65

0.7

0.75

0.8

0.85

Order k

ND

CG

@10

Aggregate

Pairwise

Score−based

Page 119: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Image ranking

I Setup [Grangier and Bengio 2008]

I Take most common image search queries on google.com

I Train an independent ranker based on aggregated preferencestatistics for each query

I Compare with standard, disaggregated image-rankingapproaches

Page 120: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Image ranking experiments

Highly ranked items from Corel Image Database for query tree car:

Aggregated

SVM

PLSA

Page 121: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Conclusions

1. Partial preference data is abundant and (more) reliable

2. General theory of ranking consistency: When is

argminf

E[ϕ(f(Q), s)] ⊆ argminf

E[L(f(Q), s)]?

I Tractable consistency difficult with partial preference dataI Possible with complete preference data

3. Aggregation can bridge the gap

I Can transform pairwise preferences/click data into scores s

4. Practical consistent procedures via U -statistic aggregation

I Allows for arbitrary aggregation sI High-probability convergence of the learned ranking function

Page 122: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Conclusions

1. Partial preference data is abundant and (more) reliable

2. General theory of ranking consistency: When is

argminf

E[ϕ(f(Q), s)] ⊆ argminf

E[L(f(Q), s)]?

I Tractable consistency difficult with partial preference dataI Possible with complete preference data

3. Aggregation can bridge the gap

I Can transform pairwise preferences/click data into scores s

4. Practical consistent procedures via U -statistic aggregation

I Allows for arbitrary aggregation sI High-probability convergence of the learned ranking function

Page 123: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Conclusions

1. Partial preference data is abundant and (more) reliable

2. General theory of ranking consistency: When is

argminf

E[ϕ(f(Q), s)] ⊆ argminf

E[L(f(Q), s)]?

I Tractable consistency difficult with partial preference dataI Possible with complete preference data

3. Aggregation can bridge the gap

I Can transform pairwise preferences/click data into scores s

4. Practical consistent procedures via U -statistic aggregation

I Allows for arbitrary aggregation sI High-probability convergence of the learned ranking function

Page 124: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Conclusions

1. Partial preference data is abundant and (more) reliable

2. General theory of ranking consistency: When is

argminf

E[ϕ(f(Q), s)] ⊆ argminf

E[L(f(Q), s)]?

I Tractable consistency difficult with partial preference dataI Possible with complete preference data

3. Aggregation can bridge the gapI Can transform pairwise preferences/click data into scores s

4. Practical consistent procedures via U -statistic aggregation

I Allows for arbitrary aggregation sI High-probability convergence of the learned ranking function

Page 125: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Conclusions

1. Partial preference data is abundant and (more) reliable

2. General theory of ranking consistency: When is

argminf

E[ϕ(f(Q), s)] ⊆ argminf

E[L(f(Q), s)]?

I Tractable consistency difficult with partial preference dataI Possible with complete preference data

3. Aggregation can bridge the gapI Can transform pairwise preferences/click data into scores s

4. Practical consistent procedures via U -statistic aggregationI Allows for arbitrary aggregation sI High-probability convergence of the learned ranking function

Page 126: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Future work

I Empirical directionsI Apply to more ranking problems!I Which aggregation procedures perform best?I How much aggregation is enough?

I Statistical questions: beyond consistencyI How does aggregation impact rate of convergence?I Can we design statistically efficient ranking procedures?

I Other ways of dealing with realistic partial preference data?

Page 127: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Future work

I Empirical directionsI Apply to more ranking problems!I Which aggregation procedures perform best?I How much aggregation is enough?

I Statistical questions: beyond consistencyI How does aggregation impact rate of convergence?I Can we design statistically efficient ranking procedures?

I Other ways of dealing with realistic partial preference data?

Page 128: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Future work

I Empirical directionsI Apply to more ranking problems!I Which aggregation procedures perform best?I How much aggregation is enough?

I Statistical questions: beyond consistencyI How does aggregation impact rate of convergence?I Can we design statistically efficient ranking procedures?

I Other ways of dealing with realistic partial preference data?

Page 129: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

Future work

I Empirical directionsI Apply to more ranking problems!I Which aggregation procedures perform best?I How much aggregation is enough?

I Statistical questions: beyond consistencyI How does aggregation impact rate of convergence?I Can we design statistically efficient ranking procedures?

I Other ways of dealing with realistic partial preference data?

Page 130: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

References I

P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American StatisticalAssociation, 101:138–156, 2006.

D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring functions with order-preserving losses and standardizedsupervision. In Proceedings of the 28th International Conference on Machine Learning, 2011.

O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Conference onInformation and Knowledge Management, 2009.

N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Web Searchand Data Mining (WSDM), pages 87–94, 2008.

O. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. In Advances in Neural Information ProcessingSystems 16, 2004.

J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of Statistics, 2013.

Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Efficient boosting algorithms for combining preferences. Journal of MachineLearning Research, 4:933–969, 2003.

R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In Advances in Large MarginClassifiers. MIT Press, 2000.

G. Miller. The magic number seven, plus or minus two: Some limits on our capacity for processing information. PsychologyReview, 63:81–97, 1956.

P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In Proceedings of the 14thInternational Conference on Artificial Intelligence and Statistics, 2011.

R. Shiffrin and R. Nosofsky. Seven plus or minus two: a commentary on capacity limitations. Psychological Review, 101(2):357–361, 1994.

N. Stewart, G. Brown, and N. Chater. Absolute identification by relative judgment. Psychological Review, 112(4):881–911,2005.

Page 131: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley
Page 132: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

Surrogate loss ϕ(α, s) =∑

ij sijφ(αi − αj)

2 3

1

s12

s23

s13

2 3

1

s31

2 3

1

s12 s31s13

s23

p(s) = .5 p(s′) = .5 Aggregate

s

p(s)ϕ(α, s) =1

2ϕ(α, s′) +

1

2ϕ(α, s′)

∝ s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

Page 133: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

Surrogate loss ϕ(α, s) =∑

ij sijφ(αi − αj)

2 3

1

s12

s23

s13

2 3

1

s31

2 3

1

s12 s31s13

s23

p(s) = .5 p(s′) = .5 Aggregate

s

p(s)ϕ(α, s) =1

2ϕ(α, s′) +

1

2ϕ(α, s′)

∝ s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

Page 134: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

More bang for your $$ by increasing to 0 from left: α1 ↓. Result:

α∗ = argminα

ij

sijφ(αi − αj)

can have α∗2 > α∗1, even if s13 − s31 > s12 + s23.

Page 135: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

More bang for your $$ by increasing to 0 from left: α1 ↓. Result:

α∗ = argminα

ij

sijφ(αi − αj)

can have α∗2 > α∗1, even if s13 − s31 > s12 + s23.

Page 136: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

a31φ(α3 - α1)

a13φ(α1 - α3)2 3

1

s12 s31s13

s23

More bang for your $$ by increasing to 0 from left: α1 ↓. Result:

α∗ = argminα

ij

sijφ(αi − αj)

can have α∗2 > α∗1, even if s13 − s31 > s12 + s23.

Page 137: Ranking, Aggregation, and Youlmackey/papers/rank... · Ranking, Aggregation, and You Lester Mackeyy Collaborators: John C. Duchiyand Michael I. Jordan yStanford University UC Berkeley

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

a31φ(α3 - α1)

a13φ(α1 - α3)2 3

1

s12 s31s13

s23

More bang for your $$ by increasing to 0 from left: α1 ↓. Result:

α∗ = argminα

ij

sijφ(αi − αj)

can have α∗2 > α∗1, even if s13 − s31 > s12 + s23.