basic concepts in statistics - hku nursingnursing.hku.hk/biostats/tank/lecture1.pdf ·...

Basic Concepts in Statistics

Daniel Y.T. Fong

NURS8222 – Statistical Practice in Health Sciences

SCHOOL OF NURSING The University of Hong Kong

Learning Objectives

1. To learn the phases of analysis

2. To recap the key elements of statistics

3. To learn presenting statistical data

Phases of Analysis

Data Preparation/Clinical Data Management

Descriptive Statistics

Inferential Statistics

Presentation of Numerical Data

Clinical Data Management

~ Fong DYT (2001). Drug Information Journal.

Clinical Data Management (CDM)A vital vehicle to ensure the integrity and quality of

data being transferred from the study subjects to a database system

Statisticians Investigators

Sponsor

Bad CDM Practices Management− lack of training, competition, insufficient standards,

etc. Study Design− poor data structure, unnecessary study

complexity, etc.

The most irresponsible source of error is however the assumption that the data come automatically error-free.

The Important Words:

Q1. Gender Female Male(If “Male”, please go to Q?.)

Q2. Are you pregnant? No Yes(If “No”, please answer Q?.)

Q3. Is it your first pregnancy? No Yes(If “No”, please answer Q?.)

Q4. Is it your second pregnancy? No Yes(If “No”, please answer Q?.)

Poor Data Structure - Example

Missing!

Poor Data Structure - Remedy Use of referential questions should be

minimized, if feasible

Q1. Gender Female Male

Q2. Are you Pregnant? No Yes NA

Q3. How many times have you been pregnant? _____ NA

A Common Scenario

A. both readingsB. only the average value

￭ Need to take BP twice of 5 minutes apart.￭ Only the average value is used.

￭ Manual calculations should be strictly avoided￭ Similar to Quality of Life (questionnaire) data

What data will you enter into the database/CRF?

￭ Ensure data quality of final database￭ Sources of error:

– Source data CRF– CRF Database

Data Validation

Not an UncommonData Entry Error

• • • •• • • •

• • • •• • • •

Subject ID: 5

Visit date: . . .

. . . .

Group: 1

Score: 0.494

Case Report Form ??

?

The computer database

Can we detect the error?

-5

0

5

10

Subject ID = 5, True score = 0.494

Scores in Group 1 (Dot plot)

Mean difference (Group2 – Group1)

= 5.4 (SE = 2.3); p = 0.025

Mean difference (Group2 – Group1)

= 5.6 (SE = 2.3); p = 0.020

Erroneous database

Error-free database

￭ Missing values

Before the Data are analyzable￭ Logic checks

￭ Dates

– range check– relational conflicts– outliers

￭ Duplications

Catches

1. Use well designed data forms2. Use standardized data entry template3. Train data entry personnel


Data Types

Quantitative(takes numerical values)

• Discrete(whole numbers)e.g. number of accidents, household size

• Continuous(takes decimal places)e.g. height, weight

Qualitative/Categorical(takes coded numerical values)

• Ordinal(ranking order exists)e.g. Poor/Average/Good

• Nominal(no ranking order)e.g. gender, race

Measures of Location

Advantages Disadvantages

Median• Middle value

Mean• Average

value

Mode• Most popular

value

1. “Robust”, i.e. not affected by aberrant values

1. Does not use all the data2. Not easy to manipulate

mathematically

1. Is the “expected” value2. Uses all the data 3. Easy to calculate

1. Not robust to aberrant values

2. Can be difficult to interpret due to “skewness”

1. Can be useful for discrete and categorical measurements

1. Not useful for continuous data2. May not be unique3. Does not use all the data

Skewness

Simplest measure of skewness: mean – median− Mean – median > 0 right skewed− Mean – median < 0 left skewed

mode median mean modemedianmean

Right skewed Left skewed

Normal

Mean, median, mode

Measures of Dispersion

Advantages Disadvantages

Range :- Maximum - minimum

Interquartile range :- 3rd quartile – 1st quartile

SD :- Average distance of each score from the mean

1. The simplest measure of dispersion

1. Not robust to aberrant values2. Does not use all the data

1. Robust to aberrant values2. Included 50% of the data

1. Does not use all the data2. Not easy to manipulate

mathematically

1. For normal distribution, mean and SD describe the entire distribution

2. Easy to manipulate mathematically

1. Not robust to aberrant values

for measurements at least ordinal

Deciding the Marriage

Median• Middle value

Mean• Average

value

Range :-Maximum - minimum

Interquartile range :-3rd quartile – 1st quartile

SD :-Average distance of each score from the mean

Location Measures Dispersion Measures

Characterizing a Sample

Inclusion Criteria age 18 yrs or above HbA1c 0.7

NNo. of

missing Mean SD Median Min MaxAge 45 1 50.2 7.2 53.4 38 65HbA1c 39 7 1.2 0.09 0.9 0.7 1.2

N %Gender

FemaleMale

(2)41

3

(4.3)89.16.5

*Numbers in parenthesescorrespond to missing values.


The Three Basic Elements

q y

To estimate an unknown/population

quantity p

To determine the existence of a relationship

Estimation (using CI)

or Significance Testing (using p-value)?

A. Is cancer associated with salt intake?B. What is the prevalence of hypertension in

Hong Kong?C. How much longer can I live if I have

regular exercises?D. Can Vitamin C prolong my life span?

Deciding a Significance Test?

1. Specify your objective

2. Identify the outcome variable

3. Identify its measurement scale

4. Find out the study design

5. ….. more?? (experience)

Deciding a Significance Test– Example 11. Specify your

objectiveTo examine the effect of brief problem solving treatment in primary care at 52 weeks


SF-36: PF, RP, BP, VT, GH, RE, MH, SF


Continuous


RCT with two parallel arms

~ Modified from Lam et al. 2010 Arch Geron & Geri

Deciding a Significance Test– Example 2

1. Specify your objective

To examine the effect of brief problem solving treatment in primary care at 52 weeks


Consultation in the past month


Nominal/Categorical with two levels (Yes/No)


RCT with two parallel arms

~ Modified from Lam et al. 2010 Arch Geron & Geri Definition of p-value

P-value is a probability between 0 and 1 The larger is the p-value, the less likely to reject

the null hypothesis (significant), Why?

It is the observed chance of committing a False Positive Error

Use of p-value

A small p-value implies a small chance of committing a false positive error (had we concluded a significant result)

A large p-value implies a high chance of committing a false positive error (had we concluded a significant result)

Hence, we should only conclude a significant result when p-value is small

We may commit a false positive error only when we conclude a result is statistically significant.

Suggested Interpretation of p-values

Strong evidence against H0

Weak evidence against H01.0 -

0.1 -

0.01 -

0.001 -

0.0001 -

Level of significance

0.05

0.01

0.10

Must be specifiedbefore data analysis in confirmatory studies

~ BMJ (2001)

It is the observed chance of committing a

False Positive Error

p-value (H0: no difference) is 0.038

Catches …

1. Estimation or Significance test?2. Be clear of your objective3. Be clear of your outcome4. Be clear of your design5. Conclude statistical significance only when

p-value is small (often smaller than 5%)

Presentation of Numerical Data

Altman & Bland (1996). BMJ.

Bar Chart

~ Apple Daily Feb 21, 2001

Talents

Qualification

Personality

Outlook

Others

Occupation

None

No response

Life style

Income

Hobbies

Health

Feeling

Age

Perc

enta

ge60

50

40

30

20

10

0

￭ Nominal data￭ Ordinal data

Criteria of Picking Partners

Mortality due to Cancer of the Oesophagus, England and Wales

Which graph is more trustworthy?

Deaths/100,000

Nasal symptom scores

None Very severe0 10

Sneezing:

Itchiness: None Very severe0 10

0

1

2

3

4

5

-4 0 4 8 12 16 20 24 28

Weeks

0

2

4

6

8

10

-4 0 4 8 12 16 20 24 28

Weeks

0

2

4

6

8

10

-4 0 4 8 12 16 20 24 28

Which symptom was improved more?

No difference! Numerical Presentation Summary statistics such as means should not be

given more than one extra decimal place over the raw data (similar to SD).e.g. Raw: 3.45; mean = 5.1, 5.12, or 5.123 BUT NOT 5.1234

Correlation coefficients should be given no more than two decimal places.e.g. r = 0.1 or 0.12 BUT NOT 0.123

For confidence intervals, “12.4 to 53.9” or “(12.4, 53.9)” are better than “12.4-52.9”

Numerical Presentation (2)

• 3 decimal places for a p-value, if possible

It should be reported as p=0.024

It means p<0.0005 and should be reported as p<0.001

0.046 0.053

p=0.05 sig.?

Numerical Presentation (3)

Avoid p<0.05 or p<0.01, in general

Even worse p=NS

p=0.001

p=0.049Is there a large difference in level of evidence?

p=0.048

p=0.051Now, is there a large difference in level of evidence?

P<0.05P<0.05

P<0.05P=NS

Can you tell the difference?

A Painful Reality

Data Preparation/Clinical Data Management



Presentation of Data

FAQs1. Do we need 100% data cleaned?2. How aberrant is a value before we can call

it as aberrant?3. Can we conclude statistical significance

when p-value < 0.1?

basic concepts in statistics - hku nursingnursing.hku.hk/biostats/tank/lecture1.pdf ·...

Documents