statistical methods for health intelligence lecture 2: perspectives, data types & summaries iain...

39
Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries Iain Buchan University of Manchester [email protected]

Upload: marcel-hutchinson

Post on 14-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Statistical Methodsfor Health Intelligence

Lecture 2: Perspectives,Data Types & Summaries

Iain BuchanUniversity of Manchester

[email protected]

Course Material 1: Basic Text

• Medical Statistics, 4th EdCampbell, Machin & WaltersWiley 2007

• Statistical knowledge level:Public health practitioner

• How are you getting on?• Are you using any other learning materials?

Your Participation

• Today: questions about your reading

• Take notes on my comments

• Prepare to reproduce exercises in R

Course Material 2: R

• Statistics: An Introduction Using RCrawley, Wiley 2005

• cran.r-project.org

• Reproduce each example in course text• Prepare to do submit R scripts for assessment

Course Material: Optional

• Probability and Random Variables: a beginner’s guideStirzaker, Cambridge University Press 1999

• Bad ScienceGoldacre, Fourth Estate Ltd, 2008

Define

• statistics– quantitative information about a topic

• Statistics– The measurement of uncertainty

The Statistical Movement

Circa 1900: Galton, Pearson, Edgeworth and Yule establish Statistics as a discipline

Early/mid 1900s: Fisher consolidatesstatistical methods and experimental philosophy

Think

• Whose perspective is Chapter 1?– Medical Statistician

• Why must the Informatician look wider? – May not have the luxury of study design– Data- vs. hypothesis-driven research– Maximise information validity & utility

Health Statistics 1600-1860

Observation

Knowledge

Reasoning

Summarisation

Health Statistics 1860-≈2000/now

Observation± Experimentation

Knowledge

Reasoning

Summarisation& Statistical Modelling

Evidence Based Medicine

Early/mid 1900s: Greenwood, Bradford-Hill & Doll pushStatistics into medical research

Mid-late 1900s: Cochrane pushes for the routine application of randomised clinical trials and leaves the

evidence based medicine movement in his wake

Causality

Clinical Trials

Effectiveness & Efficiency

Hypothesis-driven Research

Problem Question

HypothesisDesign

Data collection

Data collation

Data analysisInference

Interpretation

Dissemination

Define

• Epidemiology– the study of

the distributionand determinantsof diseaseand health-related statesin populations

JM Last, 2000

Define

• Confounding factor– A factor associated with both

exposure and outcomebut not on the causal pathwayabout which the inference is being made

– What confounded the water cancer vs. water fluoridation example in the book?

Causal Inference

Confounder

OutcomeExposureCausal pathway

ASSOCIATION

Sieving AssociationsAssociation Bias Type Explanation

CM Cause-effect Real Cause-effect

MIC Reverse Real Effect-cause

C?MI Confounding Real Effect-effect

C </> MI Random error Spurious Chance

C </> MI Systematic error Spurious Bias

C = caffeine, MI = myocardial infarction (heart attack)

Disciplined approach to causal inference, Bradford-Hill:Criteria (temporality, strength, dose-response,consistency, plausibility, consideration of alternatives,open to experiment, specificity, coherence)

Hard to Make a Confident Causal Inference

• Plausible pathway to link outcome to exposure

• Same results if repeat in different time, place person

• Exposure precedes outcome

• Strong relationship ± dose effect

• Causal factor relates only to the outcome in question

• Outcome falls if risk factor removed...

Think

• What is the most important question a Statistician wants a medic to ask?– How might I be wrong?

• In designing my study• In making an inference about an association• In generalising my inference beyond the study

population

• Statisticians are understandably conservativeInformaticians must be carefully informative

Exhausted Epidemiology Platform

The big public health problemse.g. Type 2 Diabeteshave “complex webs of causes”

Problem 1:Dwindling hits from tools todetect independent “causes”

Problem 2:Knowledge can’t be managedby reading papers any more

The “data-set” and structureextend beyondthe study’s observations

Evidence limits showing

• Epidemiology has exhausted the big simple causes of ill health

• Many trials have weak external validity

• Public health interventions are largely unstudied

Many patterns of ill health in society remain unexplained via conventional studies

Need Statistical InformaticsD

ata

Nec

essa

ry C

ompl

exity

of

Mod

els

Human Resource

Define• Statistical Data-types & Measurement Scales

– Categorical Qualitative measuring• Binary/Dichotomous• Nominal > 2 categories, without order• Ordinal (loose)

– Nominal with order– Ordinal (ties = lack of measurement sensitivity)

– Numerical Quantitative measuring• Counts• Continuous (any value in a range)

– Interval (fixed and defined, meaningful mean difference)– Ratio (zero means something)

Caution• Don’t treat ordered nominal data as interval!

– Why?– Give examples?– Relate these to software requirements

Programming Note• Which has the greater information utility?

Sex = 1|2Sex = m|fGender = m|fMale = 1|0Gender_Male = 1|0– Maximum information

Minimum ambiguityGender_Male = 1|0

Discuss• Why categorise continuous data?

– Meaningful thresholds (e.g. Hypertensive)– Compact summary / easy presentation– Easier analysis (good / bad?)– Avoid regression to the mean (homework)

Think• What is audit?

– A quality improvement process that seeks to improve a service through systematic against explicit criteria and implementing change

• How does this differ from research?– Ethics– Constrained design

• What is a natural experiment?– Homework...

Summarise Binary Data: r/n• Describe a proportion

– r = outcome or feature present (numerator)– n = number of subjects observed (denominator)– p=r/n; RR = p1/p2; (A)RD = |p2-p1|

• Relative Risk (RR) abuse– Pill ↑ risk DVT by (RR =) 2

statistically significantclinically insignificant2 women in 10,000 pill-years

Summarise Binary Data: r/n~t• Describe a rate

– r = outcome/success/failure (numerator)– n = number of subjects observed (denominator)– t = time over which subjects observed– n*t = person time – why important?

• Some may drop out or be lost to follow-up– (incidence) rate IR=r/n, IRR– IRR = 1R1/IR2; IRD = |IR2-IR1|

Source: John Hacking & Iain Buchan, pre-publication 2009

0%

5%

10%

15%

20%

25%

Year

Males

Females

Percentage excess deaths in North vs. South England

Summarise Binary Data: Crosstabs• Variables C1-Ck – what is a crosstab?

– Cross-tabulate categorical variablessay disease registration by gender2 by 2 r by c tables

– Usually two way or two dimensional– Models may need higher dimensions

say disease registration by gender by speciality• Is a data cube the same?

– Data Cube: A relational aggregation operator generalizing group-by, crosstab, and subtotals

Contingency Table

Dimension 1: Exposure/Treatment/Category 1

Dimension 2:Outcome/Status/Category 2

Present

Absent

Present Absent

a b

c d

Summarise Binary Data: Odds• How do odds differ from

risk/proportion/probability?– Ratio of occurrence to non occurrence– Odds = p(1-p)– OR = (a/c)/(b/d)=ad/bc– p=a/(a+c),

so if a<<c then a/(a+c) ≈ a/c and OR ≈ RR– OR_success = 1/OR_failure, not so for RR– Tractable computation with log odds

Caution• If the odds ratio is interpreted as a relative risk it will always overstate any

effect size: the odds ratio is smaller than the relative risk for odds ratios of less than one, and bigger than the relative risk for odds ratios of greater than one

• The extent of overstatement increases as both the initial risk increases and the odds ratio departs from unity

• However, serious divergence between the odds ratio and the relative risk occurs only with large effects on groups at high initial risk. Therefore qualitative judgments based on interpreting odds ratios as though they were relative risks are unlikely to be seriously in error

• In studies which show reductions in risk (odds ratios of less than one), the odds ratio will never underestimate the relative risk by a greater percentage than the level of initial risk

• In studies which show increases in risk (odds ratios of greater than one), the odds ratio will be no more than twice the relative risk so long as the odds ratio times the initial risk is less than 100%

Visualise Categorical Data• When is a pie chart useful?

– Seldom: arguably only in metaphor• How do you add dimensions to a bar chart?

– Cluster• When is a 3D effect useful

– Not in 2D concepts!– Showing additional dimensions e.g. 2nd level

cluster

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 20030%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Not fat

Overweight

Obese

Year

1988-19911992-1994

1995-19981999-2003

0%

5%

10%

15%

20%

25%

Shortest

2

3

4

Tallest

Time Period

Perc

enta

ge O

verw

eig

ht

Fifths of Height Distribution

1 2 3 4 5 6 7 8 9 100.0000

0.0100

0.0200

0.0300

0.0400

0.0500

0.0600

0.0700

0.0800

Tenth of height (smallest to tallest)

SD

S B

MI i

ncr

ease

bet

wee

n 1

988

and

200

3

What is arguably wrong with this visualisation?

Preparation for 15 Feb

• Read chapters 4,5,6 to understand natural distributions and sampling

• Return to chapter 3, run the examples in R and generate some alternative examples

• Prepare to show ideal visualisations and summaries with your R scripts