basic concepts in statistics - hku nursingnursing.hku.hk/biostats/tank/lecture1.pdf ·...
TRANSCRIPT
Basic Concepts in Statistics
Daniel Y.T. Fong
NURS8222 – Statistical Practice in Health Sciences
SCHOOL OF NURSING The University of Hong Kong
Learning Objectives
1. To learn the phases of analysis
2. To recap the key elements of statistics
3. To learn presenting statistical data
Phases of Analysis
Data Preparation/Clinical Data Management
Descriptive Statistics
Inferential Statistics
Presentation of Numerical Data
Clinical Data Management
~ Fong DYT (2001). Drug Information Journal.
Clinical Data Management (CDM)A vital vehicle to ensure the integrity and quality of
data being transferred from the study subjects to a database system
Statisticians Investigators
Sponsor
Bad CDM Practices Management− lack of training, competition, insufficient standards,
etc. Study Design− poor data structure, unnecessary study
complexity, etc.
The most irresponsible source of error is however the assumption that the data come automatically error-free.
The Important Words:
Q1. Gender Female Male(If “Male”, please go to Q?.)
Q2. Are you pregnant? No Yes(If “No”, please answer Q?.)
Q3. Is it your first pregnancy? No Yes(If “No”, please answer Q?.)
Q4. Is it your second pregnancy? No Yes(If “No”, please answer Q?.)
Poor Data Structure - Example
Missing!
Poor Data Structure - Remedy Use of referential questions should be
minimized, if feasible
Q1. Gender Female Male
Q2. Are you Pregnant? No Yes NA
Q3. How many times have you been pregnant? _____ NA
A Common Scenario
A. both readingsB. only the average value
■ Need to take BP twice of 5 minutes apart.■ Only the average value is used.
■ Manual calculations should be strictly avoided■ Similar to Quality of Life (questionnaire) data
What data will you enter into the database/CRF?
■ Ensure data quality of final database■ Sources of error:
– Source data CRF– CRF Database
Data Validation
Not an UncommonData Entry Error
• • • •• • • •
• • • •• • • •
Subject ID: 5
Visit date: . . .
. . . .
Group: 1
Score: 0.494
Case Report Form ??
?
The computer database
Can we detect the error?
-5
0
5
10
Subject ID = 5, True score = 0.494
Scores in Group 1 (Dot plot)
Mean difference (Group2 – Group1)
= 5.4 (SE = 2.3); p = 0.025
Mean difference (Group2 – Group1)
= 5.6 (SE = 2.3); p = 0.020
Erroneous database
Error-free database
■ Missing values
Before the Data are analyzable■ Logic checks
■ Dates
– range check– relational conflicts– outliers
■ Duplications
Catches
1. Use well designed data forms2. Use standardized data entry template3. Train data entry personnel
Descriptive Statistics
Data Types
Quantitative(takes numerical values)
• Discrete(whole numbers)e.g. number of accidents, household size
• Continuous(takes decimal places)e.g. height, weight
Qualitative/Categorical(takes coded numerical values)
• Ordinal(ranking order exists)e.g. Poor/Average/Good
• Nominal(no ranking order)e.g. gender, race
Measures of Location
Advantages Disadvantages
Median• Middle value
Mean• Average
value
Mode• Most popular
value
1. “Robust”, i.e. not affected by aberrant values
1. Does not use all the data2. Not easy to manipulate
mathematically
1. Is the “expected” value2. Uses all the data 3. Easy to calculate
1. Not robust to aberrant values
2. Can be difficult to interpret due to “skewness”
1. Can be useful for discrete and categorical measurements
1. Not useful for continuous data2. May not be unique3. Does not use all the data
Skewness
Simplest measure of skewness: mean – median− Mean – median > 0 right skewed− Mean – median < 0 left skewed
mode median mean modemedianmean
Right skewed Left skewed
Normal
Mean, median, mode
Measures of Dispersion
Advantages Disadvantages
Range :- Maximum - minimum
Interquartile range :- 3rd quartile – 1st quartile
SD :- Average distance of each score from the mean
1. The simplest measure of dispersion
1. Not robust to aberrant values2. Does not use all the data
1. Robust to aberrant values2. Included 50% of the data
1. Does not use all the data2. Not easy to manipulate
mathematically
1. For normal distribution, mean and SD describe the entire distribution
2. Easy to manipulate mathematically
1. Not robust to aberrant values
for measurements at least ordinal
Deciding the Marriage
Median• Middle value
Mean• Average
value
Range :-Maximum - minimum
Interquartile range :-3rd quartile – 1st quartile
SD :-Average distance of each score from the mean
Location Measures Dispersion Measures
Characterizing a Sample
Inclusion Criteria age 18 yrs or above HbA1c 0.7
NNo. of
missing Mean SD Median Min MaxAge 45 1 50.2 7.2 53.4 38 65HbA1c 39 7 1.2 0.09 0.9 0.7 1.2
N %Gender
FemaleMale
(2)41
3
(4.3)89.16.5
*Numbers in parenthesescorrespond to missing values.
Inferential Statistics
The Three Basic Elements
q y
To estimate an unknown/population
quantity p
To determine the existence of a relationship
Estimation (using CI)
or Significance Testing (using p-value)?
A. Is cancer associated with salt intake?B. What is the prevalence of hypertension in
Hong Kong?C. How much longer can I live if I have
regular exercises?D. Can Vitamin C prolong my life span?
Deciding a Significance Test?
1. Specify your objective
2. Identify the outcome variable
3. Identify its measurement scale
4. Find out the study design
5. ….. more?? (experience)
Deciding a Significance Test– Example 11. Specify your
objectiveTo examine the effect of brief problem solving treatment in primary care at 52 weeks
2. Identify the outcome variable
SF-36: PF, RP, BP, VT, GH, RE, MH, SF
3. Identify its measurement scale
Continuous
4. Find out the study design
RCT with two parallel arms
~ Modified from Lam et al. 2010 Arch Geron & Geri
Deciding a Significance Test– Example 2
1. Specify your objective
To examine the effect of brief problem solving treatment in primary care at 52 weeks
2. Identify the outcome variable
Consultation in the past month
3. Identify its measurement scale
Nominal/Categorical with two levels (Yes/No)
4. Find out the study design
RCT with two parallel arms
~ Modified from Lam et al. 2010 Arch Geron & Geri Definition of p-value
P-value is a probability between 0 and 1 The larger is the p-value, the less likely to reject
the null hypothesis (significant), Why?
It is the observed chance of committing a False Positive Error
Use of p-value
A small p-value implies a small chance of committing a false positive error (had we concluded a significant result)
A large p-value implies a high chance of committing a false positive error (had we concluded a significant result)
Hence, we should only conclude a significant result when p-value is small
We may commit a false positive error only when we conclude a result is statistically significant.
Suggested Interpretation of p-values
Strong evidence against H0
Weak evidence against H01.0 -
0.1 -
0.01 -
0.001 -
0.0001 -
Level of significance
0.05
0.01
0.10
Must be specifiedbefore data analysis in confirmatory studies
~ BMJ (2001)
It is the observed chance of committing a
False Positive Error
p-value (H0: no difference) is 0.038
Catches …
1. Estimation or Significance test?2. Be clear of your objective3. Be clear of your outcome4. Be clear of your design5. Conclude statistical significance only when
p-value is small (often smaller than 5%)
Presentation of Numerical Data
Altman & Bland (1996). BMJ.
Bar Chart
~ Apple Daily Feb 21, 2001
Talents
Qualification
Personality
Outlook
Others
Occupation
None
No response
Life style
Income
Hobbies
Health
Feeling
Age
Perc
enta
ge60
50
40
30
20
10
0
■ Nominal data■ Ordinal data
Criteria of Picking Partners
Mortality due to Cancer of the Oesophagus, England and Wales
Which graph is more trustworthy?
Deaths/100,000
Nasal symptom scores
None Very severe0 10
Sneezing:
Itchiness: None Very severe0 10
0
1
2
3
4
5
-4 0 4 8 12 16 20 24 28
Weeks
0
2
4
6
8
10
-4 0 4 8 12 16 20 24 28
Weeks
0
2
4
6
8
10
-4 0 4 8 12 16 20 24 28
Which symptom was improved more?
No difference! Numerical Presentation Summary statistics such as means should not be
given more than one extra decimal place over the raw data (similar to SD).e.g. Raw: 3.45; mean = 5.1, 5.12, or 5.123 BUT NOT 5.1234
Correlation coefficients should be given no more than two decimal places.e.g. r = 0.1 or 0.12 BUT NOT 0.123
For confidence intervals, “12.4 to 53.9” or “(12.4, 53.9)” are better than “12.4-52.9”
Numerical Presentation (2)
• 3 decimal places for a p-value, if possible
It should be reported as p=0.024
It means p<0.0005 and should be reported as p<0.001
0.046 0.053
p=0.05 sig.?
Numerical Presentation (3)
Avoid p<0.05 or p<0.01, in general
Even worse p=NS
p=0.001
p=0.049Is there a large difference in level of evidence?
p=0.048
p=0.051Now, is there a large difference in level of evidence?
P<0.05P<0.05
P<0.05P=NS
Can you tell the difference?
A Painful Reality
Data Preparation/Clinical Data Management
Descriptive Statistics
Inferential Statistics
Presentation of Data
FAQs1. Do we need 100% data cleaned?2. How aberrant is a value before we can call
it as aberrant?3. Can we conclude statistical significance
when p-value < 0.1?