how significant is statistically significant? the case of audio music similarity and retrieval

ISMIR 2012 Porto, Portugal · October 9th Picture by Humberto Santos

How Significant is Statistically Significant?

The Case of Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid

J. Stephen Downie University of Illinois at Urbana-Champaign

Brian McFee University of California at San Diego

Markus Schedl Johannes Kepler University Linz

let’s review two papers

…which one should get published?

a.k.a. which research line should we follow?

+0.14* +0.21

statistically significant

paper A: paper B:



+0.14* paper A: paper B:

+0.14*

Goal of Comparing Systems…

-1 1 Δeffectiveness

𝑑 0

Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user)

Impossible!

requires running the systems for the

universe of all queries

…what Evaluations can do Estimate 𝒅 with the average 𝑑

over a sample of queries 𝓠


𝑑 0

…what Evaluations can do


𝑑 0

Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠

…what Evaluations can do Estimate 𝒅 with the average 𝑑

over a sample of queries 𝓠

There is always random error

…so we need a measure of confidence

The Significance Drill Test these hypotheses

H0: 𝑑 = 0 H1: 𝑑 ≠ 0


H0: 𝑑 = 0 H1: 𝑑 ≠ 0

Result of the test… p-value = P( 𝒅 | H0 )

…interpretation of the test p-value is very small: reject H0 otherwise: accept H0


H0: 𝑑 = 0 H1: 𝑑 ≠ 0

We accept/reject H0… (based on the p-value and α)

…not the test!

Usual (wrong) conclusions

A is substantially than B

A is much better than B

The difference is important

The difference is significant

What does it mean?

That there is a difference (unlikely due to chance/random error)

What does it mean?

That there is a difference (unlikely due to chance/random error)

We don’t need fancy statistics…

…we already know they are different!

H0: 𝒅 = 0 is false by definition

because systems A and B

are different to begin with

What is really important?

The effect-size: magnitude of 𝑑

This is what predicts user satisfaction, not p-values

What is really important?

The effect-size: magnitude of 𝑑

𝒅 = +0.6 is a huge improvement

𝒅 = +0.0001 is irrelevant… …and yet, it can easily be

statistically significant

This is what predicts user satisfaction, not p-values

Example: t-test

The larger the statistic 𝑡, the smaller the p-value 𝒕 =

𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance?

Example: t-test


𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance

Example: t-test


𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance b) Further improve the system

Example: t-test


𝒅 · 𝓠

𝒔𝒅

How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!

Statistical Significance is eventually meaningless…

…all you have to do is use enough queries

Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction

Statistical Significance: p-value Confidence

An improvement may be statistically significant, but that

doesn’t mean it’s important!

the real importance of an improvement

Purpose of Evaluation

We measure system effectiveness

How good is my system?

0 1 effectiveness


1

Is system A better than system B?

Assumption

System Effectiveness corresponds to

User Satisfaction

system effectiveness

use

r sa

tisf

acti

on

Assumption

System Effectiveness corresponds to

User Satisfaction

Does it? How well?

this is our ultimate goal!

How we measure System Effectiveness

Similarity scale Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100

Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank

What correlates better with user satisfaction?

we normalize to [0, 1]

Experiment

Experiment

known effectiveness

Experiment

user preference

Experiment

non-preference

What can we infer? Preference

(difference noticed by user)

Positive: user agrees with evaluation Negative: user disagrees with evaluation

Non-preference

(difference not noticed by user)

Good: both systems are satisfying Bad: both systems are unsatisfying

Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity

Random and Artificial examples

Query: selected randomly System outputs: random lists of 5 documents

2200 examples for 73 unique queries

2869 unique lists with 3031 unique clips balanced and complete design

Subjects Crowdsourcing

Cheap, fast and… diverse pool of subjects

2200 examples

Quality control

$0.03 per example Worker pool

Trap examples (known answers)

Results

6895 total answers 881 workers from 62 countries

3393 accepted answers (41%) 100 workers (87% rejected!)

95% average quality when accepted

How good is my system? 884 nonpreferences (40%)

What do we expect?


Linear mapping


What do we have?


room for ~20% improvement

with personalization

Is system A better than B? 1316 preferences (60%)

What do we expect?


Users always notice the difference…

…regardless of how large it is


What do we have?


>.3 & >.4 differences for >50% of users to agree


Fine scale is closer to the ideal 100%


Do users prefer the (supposedly)

worse system?

Statistical Significance

has nothing to do with this

Picture by Ronny Welter

Reporting Results

Confidence intervals / Variance

0.584

Reporting Results

Confidence intervals / Variance

Indicator of evaluation error

Better understanding of expected user satisfaction

0.584 ± .023

Reporting Results

Actual p-values

+0.037 ± .031 *

Reporting Results

Actual p-values

+0.037 ± .031 (p=0.02)

Statistical Significance is relative

Depends on context, cost of Type I errors and implementation, etc.

α=0.05 and α=0.01 are completely arbitrary

let’s review two papers

(again)



+0.14* paper A:

+0.21 paper B:



+0.14 ± 0.03 (p=0.048) paper A (500 queries):

+0.21 ± 0.02 (p=0.052) paper B (50 queries):



+0.14 * paper A:

+0.14 * paper B:



+0.14 ± 0.01 (p=0.004) paper A (cost=$500,000):

+0.14 ± 0.03 (p=0.043) paper B (cost=$50):

effect-sizes are indicators of user satisfaction

need to personalize results small differences are not noticed

p-values are indicators of confidence

beware of collection size

need to provide full reports

The difference between “Significant” and “Not Significant”

is not itself statistically significant

― A. Gelman & H. Stern

how significant is statistically significant? the case of audio music similarity and retrieval

Documents