how significant is statistically significant? the case of audio music similarity and retrieval

75
ISMIR 2012 Porto, Portugal · October 9th Picture by Humberto Santos How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz

Upload: julian-urbano

Post on 01-Jul-2015

791 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

ISMIR 2012 Porto, Portugal · October 9th Picture by Humberto Santos

How Significant is Statistically Significant?

The Case of Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid

J. Stephen Downie University of Illinois at Urbana-Champaign

Brian McFee University of California at San Diego

Markus Schedl Johannes Kepler University Linz

Page 2: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

let’s review two papers

Page 3: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14* +0.21

statistically significant

paper A: paper B:

Page 4: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14* +0.21

statistically significant

paper A: paper B:

Page 5: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14* paper A: paper B:

+0.14*

Page 6: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14* paper A: paper B:

+0.14*

Page 7: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Goal of Comparing Systems…

-1 1 Δeffectiveness

𝑑 0

Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user)

Impossible!

requires running the systems for the

universe of all queries

Page 8: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…what Evaluations can do Estimate 𝒅 with the average 𝑑

over a sample of queries 𝓠

-1 1 Δeffectiveness

𝑑 0

Page 9: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…what Evaluations can do

-1 1 Δeffectiveness

𝑑 0

Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠

Page 10: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…what Evaluations can do

-1 1 Δeffectiveness

𝑑 0

Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠

Page 11: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…what Evaluations can do

-1 1 Δeffectiveness

𝑑 0

Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠

Page 12: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…what Evaluations can do Estimate 𝒅 with the average 𝑑

over a sample of queries 𝓠

There is always random error

…so we need a measure of confidence

Page 13: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

The Significance Drill Test these hypotheses

H0: 𝑑 = 0 H1: 𝑑 ≠ 0

Page 14: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

The Significance Drill Test these hypotheses

H0: 𝑑 = 0 H1: 𝑑 ≠ 0

Result of the test… p-value = P( 𝒅 | H0 )

…interpretation of the test p-value is very small: reject H0 otherwise: accept H0

Page 15: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

The Significance Drill Test these hypotheses

H0: 𝑑 = 0 H1: 𝑑 ≠ 0

We accept/reject H0… (based on the p-value and α)

…not the test!

Page 16: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Usual (wrong) conclusions

A is substantially than B

A is much better than B

The difference is important

The difference is significant

Page 17: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

What does it mean?

That there is a difference (unlikely due to chance/random error)

Page 18: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

What does it mean?

That there is a difference (unlikely due to chance/random error)

We don’t need fancy statistics…

…we already know they are different!

Page 19: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

H0: 𝒅 = 0 is false by definition

because systems A and B

are different to begin with

Page 20: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

What is really important?

The effect-size: magnitude of 𝑑

This is what predicts user satisfaction, not p-values

Page 21: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

What is really important?

The effect-size: magnitude of 𝑑

𝒅 = +0.6 is a huge improvement

𝒅 = +0.0001 is irrelevant… …and yet, it can easily be

statistically significant

This is what predicts user satisfaction, not p-values

Page 22: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Example: t-test

The larger the statistic 𝑡, the smaller the p-value 𝒕 =

𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance?

Page 23: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Example: t-test

The larger the statistic 𝑡, the smaller the p-value 𝒕 =

𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance

Page 24: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Example: t-test

The larger the statistic 𝑡, the smaller the p-value 𝒕 =

𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance b) Further improve the system

Page 25: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Example: t-test

The larger the statistic 𝑡, the smaller the p-value 𝒕 =

𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!

Page 26: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Statistical Significance is eventually meaningless…

…all you have to do is use enough queries

Page 27: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction

Statistical Significance: p-value Confidence

An improvement may be statistically significant, but that

doesn’t mean it’s important!

Page 28: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

the real importance of an improvement

Page 29: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Purpose of Evaluation

We measure system effectiveness

How good is my system?

0 1 effectiveness

-1 0 Δeffectiveness

1

Is system A better than system B?

Page 30: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Page 31: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Page 32: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Page 33: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Page 34: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Page 35: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Assumption

System Effectiveness corresponds to

User Satisfaction

Does it? How well?

this is our ultimate goal!

Page 36: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How we measure System Effectiveness

Similarity scale Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100

Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank

What correlates better with user satisfaction?

we normalize to [0, 1]

Page 37: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Experiment

Page 38: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Experiment

Page 39: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Experiment

known effectiveness

Page 40: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Experiment

user preference

Page 41: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Experiment

non-preference

Page 42: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

What can we infer? Preference

(difference noticed by user)

Positive: user agrees with evaluation Negative: user disagrees with evaluation

Non-preference

(difference not noticed by user)

Good: both systems are satisfying Bad: both systems are unsatisfying

Page 43: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity

Random and Artificial examples

Query: selected randomly System outputs: random lists of 5 documents

2200 examples for 73 unique queries

2869 unique lists with 3031 unique clips balanced and complete design

Page 44: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Subjects Crowdsourcing

Cheap, fast and… diverse pool of subjects

2200 examples

Quality control

$0.03 per example Worker pool

Trap examples (known answers)

Page 45: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Results

6895 total answers 881 workers from 62 countries

3393 accepted answers (41%) 100 workers (87% rejected!)

95% average quality when accepted

Page 46: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

What do we expect?

Page 47: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

Linear mapping

Page 48: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

What do we have?

Page 49: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

Page 50: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

Page 51: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How good is my system? 884 nonpreferences (40%)

room for ~20% improvement

with personalization

Page 52: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

What do we expect?

Page 53: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Users always notice the difference…

…regardless of how large it is

Page 54: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

What do we have?

Page 55: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Page 56: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Page 57: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

>.3 & >.4 differences for >50% of users to agree

Page 58: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Fine scale is closer to the ideal 100%

Page 59: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Do users prefer the (supposedly)

worse system?

Page 60: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Is system A better than B? 1316 preferences (60%)

Page 61: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Statistical Significance

has nothing to do with this

Page 62: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Picture by Ronny Welter

Page 63: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Reporting Results

Confidence intervals / Variance

0.584

Page 64: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Reporting Results

Confidence intervals / Variance

Indicator of evaluation error

Better understanding of expected user satisfaction

0.584 ± .023

Page 65: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Reporting Results

Actual p-values

+0.037 ± .031 *

Page 66: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Reporting Results

Actual p-values

+0.037 ± .031 (p=0.02)

Statistical Significance is relative

Depends on context, cost of Type I errors and implementation, etc.

α=0.05 and α=0.01 are completely arbitrary

Page 67: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

let’s review two papers

(again)

Page 68: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14* paper A:

+0.21 paper B:

Page 69: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14 ± 0.03 (p=0.048) paper A (500 queries):

+0.21 ± 0.02 (p=0.052) paper B (50 queries):

Page 70: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14 ± 0.03 (p=0.048) paper A (500 queries):

+0.21 ± 0.02 (p=0.052) paper B (50 queries):

Page 71: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14 * paper A:

+0.14 * paper B:

Page 72: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14 ± 0.01 (p=0.004) paper A (cost=$500,000):

+0.14 ± 0.03 (p=0.043) paper B (cost=$50):

Page 73: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

…which one should get published?

a.k.a. which research line should we follow?

+0.14 ± 0.01 (p=0.004) paper A (cost=$500,000):

+0.14 ± 0.03 (p=0.043) paper B (cost=$50):

Page 74: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

effect-sizes are indicators of user satisfaction

need to personalize results small differences are not noticed

p-values are indicators of confidence

beware of collection size

need to provide full reports

Page 75: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

The difference between “Significant” and “Not Significant”

is not itself statistically significant

― A. Gelman & H. Stern