how significant is statistically significant? the case of audio music similarity and retrieval
TRANSCRIPT
ISMIR 2012 Porto, Portugal · October 9th Picture by Humberto Santos
How Significant is Statistically Significant?
The Case of Audio Music Similarity and Retrieval
@julian_urbano University Carlos III of Madrid
J. Stephen Downie University of Illinois at Urbana-Champaign
Brian McFee University of California at San Diego
Markus Schedl Johannes Kepler University Linz
let’s review two papers
…which one should get published?
a.k.a. which research line should we follow?
+0.14* +0.21
statistically significant
paper A: paper B:
…which one should get published?
a.k.a. which research line should we follow?
+0.14* +0.21
statistically significant
paper A: paper B:
…which one should get published?
a.k.a. which research line should we follow?
+0.14* paper A: paper B:
+0.14*
…which one should get published?
a.k.a. which research line should we follow?
+0.14* paper A: paper B:
+0.14*
Goal of Comparing Systems…
-1 1 Δeffectiveness
𝑑 0
Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user)
Impossible!
requires running the systems for the
universe of all queries
…what Evaluations can do Estimate 𝒅 with the average 𝑑
over a sample of queries 𝓠
-1 1 Δeffectiveness
𝑑 0
…what Evaluations can do
-1 1 Δeffectiveness
𝑑 0
Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠
…what Evaluations can do
-1 1 Δeffectiveness
𝑑 0
Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠
…what Evaluations can do
-1 1 Δeffectiveness
𝑑 0
Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠
…what Evaluations can do Estimate 𝒅 with the average 𝑑
over a sample of queries 𝓠
There is always random error
…so we need a measure of confidence
The Significance Drill Test these hypotheses
H0: 𝑑 = 0 H1: 𝑑 ≠ 0
The Significance Drill Test these hypotheses
H0: 𝑑 = 0 H1: 𝑑 ≠ 0
Result of the test… p-value = P( 𝒅 | H0 )
…interpretation of the test p-value is very small: reject H0 otherwise: accept H0
The Significance Drill Test these hypotheses
H0: 𝑑 = 0 H1: 𝑑 ≠ 0
We accept/reject H0… (based on the p-value and α)
…not the test!
Usual (wrong) conclusions
A is substantially than B
A is much better than B
The difference is important
The difference is significant
What does it mean?
That there is a difference (unlikely due to chance/random error)
What does it mean?
That there is a difference (unlikely due to chance/random error)
We don’t need fancy statistics…
…we already know they are different!
H0: 𝒅 = 0 is false by definition
because systems A and B
are different to begin with
What is really important?
The effect-size: magnitude of 𝑑
This is what predicts user satisfaction, not p-values
What is really important?
The effect-size: magnitude of 𝑑
𝒅 = +0.6 is a huge improvement
𝒅 = +0.0001 is irrelevant… …and yet, it can easily be
statistically significant
This is what predicts user satisfaction, not p-values
Example: t-test
The larger the statistic 𝑡, the smaller the p-value 𝒕 =
𝒅 · 𝓠
𝒔𝒅
How to achieve statistical significance?
Example: t-test
The larger the statistic 𝑡, the smaller the p-value 𝒕 =
𝒅 · 𝓠
𝒔𝒅
How to achieve statistical significance? a) Reduce variance
Example: t-test
The larger the statistic 𝑡, the smaller the p-value 𝒕 =
𝒅 · 𝓠
𝒔𝒅
How to achieve statistical significance? a) Reduce variance b) Further improve the system
Example: t-test
The larger the statistic 𝑡, the smaller the p-value 𝒕 =
𝒅 · 𝓠
𝒔𝒅
How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!
Statistical Significance is eventually meaningless…
…all you have to do is use enough queries
Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction
Statistical Significance: p-value Confidence
An improvement may be statistically significant, but that
doesn’t mean it’s important!
the real importance of an improvement
Purpose of Evaluation
We measure system effectiveness
How good is my system?
0 1 effectiveness
-1 0 Δeffectiveness
1
Is system A better than system B?
Assumption
System Effectiveness corresponds to
User Satisfaction
system effectiveness
use
r sa
tisf
acti
on
Assumption
System Effectiveness corresponds to
User Satisfaction
system effectiveness
use
r sa
tisf
acti
on
Assumption
System Effectiveness corresponds to
User Satisfaction
system effectiveness
use
r sa
tisf
acti
on
Assumption
System Effectiveness corresponds to
User Satisfaction
system effectiveness
use
r sa
tisf
acti
on
Assumption
System Effectiveness corresponds to
User Satisfaction
system effectiveness
use
r sa
tisf
acti
on
Assumption
System Effectiveness corresponds to
User Satisfaction
Does it? How well?
this is our ultimate goal!
How we measure System Effectiveness
Similarity scale Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100
Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank
What correlates better with user satisfaction?
we normalize to [0, 1]
Experiment
Experiment
Experiment
known effectiveness
Experiment
user preference
Experiment
non-preference
What can we infer? Preference
(difference noticed by user)
Positive: user agrees with evaluation Negative: user disagrees with evaluation
Non-preference
(difference not noticed by user)
Good: both systems are satisfying Bad: both systems are unsatisfying
Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity
Random and Artificial examples
Query: selected randomly System outputs: random lists of 5 documents
2200 examples for 73 unique queries
2869 unique lists with 3031 unique clips balanced and complete design
Subjects Crowdsourcing
Cheap, fast and… diverse pool of subjects
2200 examples
Quality control
$0.03 per example Worker pool
Trap examples (known answers)
Results
6895 total answers 881 workers from 62 countries
3393 accepted answers (41%) 100 workers (87% rejected!)
95% average quality when accepted
How good is my system? 884 nonpreferences (40%)
What do we expect?
How good is my system? 884 nonpreferences (40%)
Linear mapping
How good is my system? 884 nonpreferences (40%)
What do we have?
How good is my system? 884 nonpreferences (40%)
How good is my system? 884 nonpreferences (40%)
How good is my system? 884 nonpreferences (40%)
room for ~20% improvement
with personalization
Is system A better than B? 1316 preferences (60%)
What do we expect?
Is system A better than B? 1316 preferences (60%)
Users always notice the difference…
…regardless of how large it is
Is system A better than B? 1316 preferences (60%)
What do we have?
Is system A better than B? 1316 preferences (60%)
Is system A better than B? 1316 preferences (60%)
Is system A better than B? 1316 preferences (60%)
>.3 & >.4 differences for >50% of users to agree
Is system A better than B? 1316 preferences (60%)
Fine scale is closer to the ideal 100%
Is system A better than B? 1316 preferences (60%)
Do users prefer the (supposedly)
worse system?
Is system A better than B? 1316 preferences (60%)
Statistical Significance
has nothing to do with this
Picture by Ronny Welter
Reporting Results
Confidence intervals / Variance
0.584
Reporting Results
Confidence intervals / Variance
Indicator of evaluation error
Better understanding of expected user satisfaction
0.584 ± .023
Reporting Results
Actual p-values
+0.037 ± .031 *
Reporting Results
Actual p-values
+0.037 ± .031 (p=0.02)
Statistical Significance is relative
Depends on context, cost of Type I errors and implementation, etc.
α=0.05 and α=0.01 are completely arbitrary
let’s review two papers
(again)
…which one should get published?
a.k.a. which research line should we follow?
+0.14* paper A:
+0.21 paper B:
…which one should get published?
a.k.a. which research line should we follow?
+0.14 ± 0.03 (p=0.048) paper A (500 queries):
+0.21 ± 0.02 (p=0.052) paper B (50 queries):
…which one should get published?
a.k.a. which research line should we follow?
+0.14 ± 0.03 (p=0.048) paper A (500 queries):
+0.21 ± 0.02 (p=0.052) paper B (50 queries):
…which one should get published?
a.k.a. which research line should we follow?
+0.14 * paper A:
+0.14 * paper B:
…which one should get published?
a.k.a. which research line should we follow?
+0.14 ± 0.01 (p=0.004) paper A (cost=$500,000):
+0.14 ± 0.03 (p=0.043) paper B (cost=$50):
…which one should get published?
a.k.a. which research line should we follow?
+0.14 ± 0.01 (p=0.004) paper A (cost=$500,000):
+0.14 ± 0.03 (p=0.043) paper B (cost=$50):
effect-sizes are indicators of user satisfaction
need to personalize results small differences are not noticed
p-values are indicators of confidence
beware of collection size
need to provide full reports
The difference between “Significant” and “Not Significant”
is not itself statistically significant
― A. Gelman & H. Stern