lecture notes on: basics of research methodology and

Prof. Dr. Mohamed Fahmy Amin

Professor of Community Medicine

Department

Dr. Dalia Gaber Sos

Assistant Professor of Community Medicine

Community Medicine Department

Faculty of Medicine

Modern University for Technology and

Information

Cairo – Egypt

Lecture Notes on:

Basics of Research

Methodology and

Biostatistics

For First Year

Medical Students

2021 - 2022

2

List of Content

no. Subject Page

1. Chapter (1): Introduction to Research and Research Process 4

2. Chapter (2): Research Design I - Descriptive Studies 16

3. Chapter (3): Research Design II - Analytical Studies-Cohort and Case

Control Studies

28

4. Chapter (4): Applied Intervention Studies (clinical trial) 40

5. Chapter (5): Protocol Writing 48

6. Chapter (6): Source of Data and Types of Variables 50

7. Chapter (7): Data Presentation 58

8. Chapter (8): Descriptive Statistics 66

9. Chapter (9): Applied Statistics (Normal Distribution Curve) 76

4

Chapter (1)

Introduction to Research and Research Process

Intended Learning Outcomes:

By the end of this chapter student should be able to:

1. Define research.

2. Know the motivate in conducting a research.

3. Identify different types of research.

4. Understand the criteria of a good research.

5. Understand the importance of studying statistics and biostatistics.

6. Define how research problems and questions are formulated.

7. Outline the objective of research.

8. Identify the different items of research process.

Content:

I. Introduction to Research

1. Meaning of research.

2. Motivation in research.

3. Types of research.

4. Criteria of a good research

5. Statistics and biostatistics.

II. Research Process

1. Definition of research problem and research questions.

2. Reviewing literature

3. Objective of research.

4. Research hypothesis.

5. Research design.

6. Sampling design.

7. Data collection and analysis.

8. Interpretation and report writing.

5

I. Introduction to Research

1. Meaning of research

Research could be defined as follow:

A scientific and systematic way for collecting information on a specific topic.

An organized and systematic way of finding answers to a specific problem.

Systematized effort to gain new knowledge.

A movement from the known to the unknown.

An attempt to discover something.

2. Motivation in research

What makes people to undertake research?

The possible motives for doing research may be either one or more of the following:

Desire to get a research degree.

Desire to face the challenge in solving the unsolved problems.

Desire to get intellectual joy of doing some creative work.

Desire to serve society.

Desire to get respectability.

Desire of the government to understand some health problem and to find a solution for it.

6

3. Types of research

Fig. 1. Illustrating the different types of research

3.1. Fundamental vs Applied

Fundamental research (basic or pure) is mainly concerned with gathering information and

formulation of a theory.

e:g. Natural phenomenon.

Applied research (or action) aims at finding a solution for an immediate problem facing a

society or an industrial/business organization.

e:g. Treat or cure a specific disease

Thus, the aim of applied research is directed to discover a solution for some problem, whereas

basic research is directed towards finding information that has a broad base of applications and

thus, adds new findings to the already existing scientific knowledge.

7

3.2.Descriptive vs. Analytical

Descriptive research, the major purpose of descriptive research is description of the problem as

it exists. The researcher can only report what has happened or what is happening. He describes a

problem using surveys and fact-finding. The methods of research utilized in descriptive research

are survey methods, including comparative and correlational methods.

e:g. Describes obesity among young children

Analytical research, the researcher has to use facts or information already available and analyze

these to make a critical evaluation of the material.

e:g. Relation between smoking and lung cancer

3.3. Quantitative vs. Qualitative

Quantitative research is based on the measurement of quantity or amount. It is applicable to

phenomena that can be expressed in terms of quantity.

e:g. Measuring the number of school students suffering from anemia and dental cares.

Qualitative research is concerned with qualitative phenomenon, that are difficult or impossible

to quantify i.e. phenomena relating to beliefs, meanings, feelings, and attitudes. Qualitative

research is important in the behavioral sciences where the aim is to discover the underlying

motives of human behavior.

e:g. How people feel or what they think about a particular subject.

3.4.Conceptual vs. Empirical:

Conceptual research is that related to some abstract idea(s) or theory. It is generally used by

philosophers and thinkers to develop new concepts or to reinterpret existing ones. It doesn't

involve any practical experiments.

Empirical research (experimental research) relies on experience or observation alone, often

without regard for system and theory. It is data-based research, coming up with conclusions

which are capable of being verified by observation or experiment. Empirical research is

appropriate when certain variables affect other variables in some way.

e:g. Usage of antihypertensive drugs in decreasing blood pressure

8

4. Criteria of a good research

Good research fulfils the following:

1. The purpose of the research should be clearly defined.

2. The research procedure used should be described in sufficient detail to permit another

researcher to repeat the research for further advancement.

3. The procedural design of the research should be carefully planned to yield results that are

as objective as possible.

4. The analysis of data should be sufficiently adequate to reveal its significance and the

methods of analysis used should be appropriate.

5. Conclusions should be confined to those justified by the data of the research.

5. Statistics/ Biostatistics

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation,

and presentation of data.

Biostatistics is statistical processes and methods applied to the collection, analysis, and

interpretation of biological data and especially data related to human biology, health, and

medicine.

Importance of biostatistics in research

Discover the causes and risks of diseases. Reach conclusions within certain population

groups about different diseases. Determine how diseases develop, progress and spread.

9

II. Research Process

1. Definition of research problem and research questions

Definition of research problem:

A question that the researcher wants to answer or a problem that a researcher wants to solve. A

research problem is an area of concern where there is a gap in the knowledge needed for

professional practices.

A research problem, in general, refers to some difficulty facing a researcher experiences in either

a theoretical or practical situation and wants to obtain a solution for them.

Necessity of defining the problem

A proper definition of research problem will enable the researcher to be on the track whereas an

ill-defined problem may create difficulties. Defining a research problem properly is a

prerequisite for any study and is a step of the highest importance. In fact, formulation of a

problem is often more essential than its solution. It is only on careful detailing the research

problem that we can work out the research design and can smoothly carry on all the

consequential steps involved while doing research.

Research questions

The researcher put some questions when studying the problem in order to help him in completion

of research process. Such as; What data are to be collected? What characteristics of data are

relevant and need to be studied? What relations are to be explored? What techniques are to be

used for the purpose? and similar other questions crop up in the mind of the researcher who can

well plan his strategy and find answers to all such questions only when the research problem has

been well defined.

Criteria of a good research questions

1. ―F‖: Feasible: you and/or the research team have enough budget, time, number of

participants and appropriate expertise to manage the research.

2. ―I‖: Interesting at least to the investigator.

3. ―N‖: Novel: confirms or refutes previous findings, extends previous findings or provides

new findings

4. ―E‖: Ethical: no harm inflicted, and no benefit denied

10

5. ―R‖: Relevant: to scientific knowledge, to clinical and health policy and/or to future

research directions

Identification of a research problem

Identification of research problem could be carried out through reviewing the literature, group of

experts, personal research experience, and patients.

2. Reviewing literature

When planning a research project, it is essential to know what the current state of knowledge is

in your chosen subject as it is obviously a waste of time to spend months producing knowledge

that is already available or not important to research. Therefore, one of the first steps in planning

a research project is to do a literature review: that is, to search through all the available

information sources in order to track down the latest knowledge, and to assess it for relevance,

quality, controversy and gaps.

Sources could be:

Previous researches.

Data in various organizations

Experts‘ opinions.

Journals and newspapers.

Electronic databases.

3. Objective of research

The aim of research objectives is to find out the truth which is hidden, and which has not been

discovered as yet. Although each research study has its own specific purpose, we may think of

research objectives as falling into a number of following broad groupings:

1. To gain familiarity with a phenomenon or to achieve new facts into it.

2. To reveal accurately the characteristics of a particular individual, situation or a group.

3. To determine the frequency with which something occurs or with which it is associated

with something else.

4. To test a hypothesis and predict the relationship between variables.

11

Types of objectives

There are two types of research objectives, general and specific objectives.

Criteria of good objectives

Objectives should be clear, specific, focused, measurable, attainable, relevant, and refers

to time frame of the study.

Action verbs should be used when stating correct objectives.

4. Research hypothesis

Definition

A hypothesis is a proposition that is stated in a testable form and predicts a particular relationship

between two or more variables. In other words, if we think that a relationship exists, we first

state it as hypothesis and then test the hypothesis in the field. A hypothesis is written in such a

way that it can be proven or disproven by valid and reliable data.

e:g. Retinal detachment is more common in those who have a family history of diabetes.

Characteristics: Hypothesis must possess the following characteristics

1. It should be clear and precise.

2. It should be capable of being tested.

3. Validity of hypothesis should be unknown

4. It should state relationship between variables.

5. It should be limited in scope and must be specific.

6. It should be stated in simple terms.

Importance

The role of the hypothesis is to guide the researcher by specifying the area of the research and to

keep him on the right track.

The hypothesis translates the research question into a prediction of expected outcomes. The

researcher starts with a hypothesis and conducts the study to prove or disprove this hypothesis.

12

5. Research design

The research design is an outline of what the researcher will do from writing the hypothesis and

its operational implications to the final analysis of data. Due to several research designs the

researcher must decide in advance of collection and analysis of data which design would prove

to be more appropriate for his research project. Different types of research design will be

discussed in the next chapter.

6. Sampling design

A sample design is a definite plan for obtaining a sample from a given population. It refers to the

technique or the procedure the researcher would adopt in selecting items for the sample. Sample

design may as well lay down the number of items to be included in the sample i.e., the size of the

sample. Sample design is determined before data are collected. There are many sample designs

from which a researcher can choose. Researcher must select/prepare a sample design which

should be reliable and appropriate for his research study. The importance of sampling is that it

decreases the cost, time, and effort of the researcher in the study.

7. Data collection and analysis

Data collection

The task of data collection begins after a research problem and design has been defined.

Methods of data collection

Observation: It the most commonly used method especially in the study related to

behavioral science.

Interview: Personal interview such as face to face interviews, it is costly and need long

time.

Questionnaires: Used by researcher for collection of data, should be formulated in good

manner, and give to accurate data.

Schedules: It is an interview without a questionnaire.

Data analysis

Data analysis is the most important part of any research. Data analysis summarizes collected

data. It involves the interpretation of data gathered through applying statistical and/or logical

techniques to describe and illustrate, condense and recap, and evaluate data.

13

8. Interpretation and report writing

Interpretation refers to the task of drawing conclusions from the collected facts after an

analytical and/or experimental study. In fact, it is a search for broader meaning of research

findings.

Interpretation has two major aspects

Establish continuity in research through linking the results of a given study with those of

another.

Establishment of some explanatory concepts.

Research report is considered a major component of the research study, the research task

remains incomplete till the report has been presented and/or written. As a matter of fact even the

most brilliant hypothesis, highly well designed and conducted research study, and the most

important findings are of little value unless they are effectively communicated to others. All this

explains the significance of writing research report.

Different steps in writing report

1. Logical analysis of the subject-matter.

2. Preparation of the final outline.

3. Preparation of the rough draft.

4. Rewriting and polishing.

5. Preparation of the final bibliography.

6. Writing the final draft.

A report is typically made up of three main divisions:

14

Fig. 2. Summarizing the research process

15

Activity

Activity

16

Chapter (2)

Research Design I - Descriptive Studies


By the end of this lecture student should be able to:

1. Define Epidemiology and recognize its major aims

2. Explain the role of descriptive studies in identifying problems and establishing

hypotheses.

3. Explain how the characteristics of person, place, & time are used to formulate hypotheses

in acute disease outbreaks and in studies of chronic diseases.

4. Identify case reports and case series and explain their uses and their limitations.

5. Describe the design features of an ecologic study and discuss their strengths and

weaknesses.

6. Describe the design features of a cross-sectional study and describe their uses, strengths,

and limitations

Content:

1. Importance of clinical epidemiology in research studies.

2. Descriptive epidemiology (Person-Place-Time) studies.

3. Types of descriptive studies.

3.1. Case report

3.2. Case series

3.3. Ecological study

3.4. Cross-sectional study

17

1. Importance of clinical epidemiology in research studies

Epidemiology

Definition: The study of the distribution and determinants of health-related states and events

in specific population and the application of this study to the control of diseases and other

health problems

1. By distribution, we mean who gets the disease, when and where i.e. Person-Place-

Time.

2. By determinants, we mean causes and factors that influence the disease frequency in a

population.

Clinical Epidemiology

Definition: Science dealing with the use of epidemiological data in clinical settings. It is

usually answering the following questions:

1. Is the patient sick or well? In other words, what is normal and what is

abnormal?

2. What is the cause of the disease? Etiological studies.

3. How to diagnose the disease? What tools can be used to differentiate between distinct

phases or stages of the disease? Diagnostic studies.

4. What is the disease and its complications? Prognostic studies.

5. Is there an effective treatment for that disease? Therapeutic studies.

6. Is there a way to prevent the occurrence of disease in healthy individuals? Preventive

studies.

18

Epidemiologic studies

Why Conduct Studies?

To describe burden of disease or prevalence of risk factors, health behaviors, or other

characteristics of a population that influences the risk of disease.

To determine causes or risk factors for illness.

To determine relative effectiveness of interventions.

Fig. 3. Illustrating different types of epidemiological studies

1. Descriptive studies:

Descriptive studies are usually the first step of an epidemiological investigation

conducted to describe certain phenomenon and its relation to certain exposure

i.e. to generate a hypothesis.

Answer what, who, where, and when.

They include case reports, case series, Ecological, and cross-sectional studies that could

be sometimes classified as analytical, due to the possible associations between exposure

and outcome that could be generated through this study.

2. Analytical studies:

2.1. Observational

These studies are used to assess the association between factors of interest and

19

disease in the population i.e. to test a hypothesis

Answer why and how

They include, case-control, cohort and cross-sectional studies.

2.2. Interventional studies (Experimental)

Where the investigator intervenes actively to affect the outcome. The clinical trial is an

example in which the investigator is testing a new drug for treatment of disease like

hypertension or diabetes. It is classified into clinical trials and community studies.

1. Descriptive Epidemiology (Person- Place-Time) studies

Characterized by who, where, or when in relation to what (outcome). Compiling and analyzing

data by time, place, and person is desirable for several reasons.

First, by looking at the data carefully, the epidemiologist becomes very familiar with

the data. He or she can see what the data can or cannot reveal based on the variables

available, its limitations (for example, the number of records with missing information

for each important variable), and its eccentricities (for example, all cases range in age

from 2 months to 6 years, plus one 17-year-old.).

Second, the epidemiologist learns the extent and pattern of the public health problem

being investigated — which months, which neighborhoods, and which groups of

people have the most and least cases.

Third, the epidemiologist creates a detailed description of the health of a population

that can be easily communicated with tables, graphs, and maps.

Fourth, the epidemiologist can identify areas or groups within the population that have

high rates of disease. This information in turn provides important clues to the causes of

the disease, and these clues can be turned into testable hypotheses.

Types of Descriptive Studies

Case report

A case report is a detailed description of the disease occurrence in a single person. Unusual or

newly observed manifestations may suggest a new hypothesis about the causes or mechanism

of disease.

20

Case series

A case series is a report on the characteristics of a group of patients who all have a particular

disease or condition. Common features among the group give more valid hypotheses about

disease causation. Note that the "series" may be small or large (hundreds or thousands of

cases). However, the chief limitation is that there is no comparison group.

Ecological study

This type of study is concerned with data on groups, not individuals. It is possible to measure

associations between exposures and outcomes in groups and hypotheses generated from such

observation are proposed for more elaborate analytical studies.

e:g. Cancer is more prevalent in high income countries than low income countries.

Cross-sectional study

It assesses the prevalence of disease and the prevalence of risk factors at the same point in

time and provide a "snapshot" of diseases and their potential risk factors simultaneously in a

defined population.

Person characteristics

Age:

The most important factor, some diseases occur exclusively in one age group, while others

predominate in another age but can occur in any age. Many chronic diseases showed progressive

increase with age due to aging itself or cumulative exposure to harmful effect.

The causes of morbidity and mortality differ according to stages of life; during childhood,

infectious diseases especially in unvaccinated populations; teenagers are affected by

unintentional injuries, violence and substance abuse; in young adults, unintentional injuries are

Person Characteristics (age, sex, socio-economic status) of the affected individuals

Place

Characteristics (residence, work, hospital) of the affected individuals

Time

Characteristics (Secular, seasonal, point, cyclic)

21

the leading cause while chronic degenerative diseases predominate in the late stages of life.

Sex

In general, morbidities and mortalities from most diseases are higher in males than females.

Certain conditions are more common among males or females due to anatomical and

physiological differences. Variation in sex distribution could be due to:

A- Sex linked inheritance.

B- Hormonal or reproductive factors.

C- Habits, social factors or environmental exposure.

Race and ethnicity

Black Americans are more liable to develop hypertension and its complications compared to

Black African. Closed groups (e.g. prisons, camping) may be susceptible to certain diseases. The

variations in mortality and morbidity could be due to genetically difference, difference in culture,

socioeconomic status, and availability of medical care.

Marital Status

Married people have lower mortality than singles. Death rates from specific diseases and for all

causes co-morbidity vary from lowest to highest; according to marital status: married, single,

widowed and divorced.

Socioeconomic Status (SES)

The term usually describes the person‘s position in society and is often formulated as a

composite measure of three interrelated dimensions: Income, Education and Occupation. SES

affects perception of the disease and the healthcare seeking behavior of the individual.

Place characteristics

Describing the occurrence of disease by place provides insight into the geographic extent of the

problem and its geographic variation. Characterization by place refers not only to place of

residence but to any geographic location relevant to disease occurrence. Such locations include

place of diagnosis or report, birthplace, site of employment, school district, hospital unit, or

recent travel destinations. The unit may be as large as a continent or country or as small as a

street address, hospital wing, or operating room. Sometimes place refers not to a specific location

22

at all but to a place category such as;

a) Morbidities and mortalities occur with different rates in the different countries. Migrant

studies can differentiate between genetic and environmental causes of these differences.

b) National (within country): differences between regions in the same country. Upper

Egypt, for example suffers from lack of medical and health services compared to urban

cities or Lower Egypt. Moreover, there are differences between urban and rural areas in

the same region.

c) Areas within a city or a village may exhibit different pattern of diseases. In a big village,

regions close to a swamp (water collections) can be more affected by Malaria and

mosquito born infections. Slum regions in big cities usually show high prevalence of

nutritional problems and infectious diseases.

John Snow's famous map shows the spread of cholera near the Broad Street water pump in 1854.

He created this map to show the spread of cholera cases around the Broad Street water pump in

London in 1854

23

Time characteristics

Some diseases emerge at a certain period of time while others emerge at another time.

a) When does the disease occur or rarely?

b) Is the frequency of disease at present differing from the corresponding frequency in past?

Time characteristics of a certain disease may range from hours to decades. Short-term changes in

disease incidence are used to study epidemics of infectious or non-infectious diseases.

1. Secular (long-term) pattern. The long-term trend of disease occurrence, usually by years.

2. Seasonal pattern: respiratory infections in winter compared to gastrointestinal infections

in summer.

3. Point (short term) Epidemic and outbreaks

4. Cyclic trend: Occurrence of measles outbreaks every third year –before the obligatory

vaccination in Egypt, and every 7 years in the past two decades.

Fig.4. Histogram shows each case represented by a square stacked into columns.

Cases of Salmonella Enteriditis — Chicago, February 13–21, by Date and Time of Symptom Onset

24

Important facts about cross-sectional study

Cross-sectional studies measure simultaneously the exposure and health outcome in a

given population and in a given geographical area at a certain time.

A cross-sectional study is an observational study.

Often described as a ―snapshot‖ of a population in a certain point in time because

exposure and outcome are determined simultaneously for each subject.

Cross-sectional is also called prevalence study.

The temporal relationship between exposure and disease cannot be determined.

Cross-sectional studies can be helpful in determining how many people are affected by a

condition and whether the frequency of the occurrence varies across groups or population

characteristics.

Cross-sectional studies are mostly carried out for public health planning. For example,

―Knowledge, attitude and practice (KAP) of family planning methods among women

attending antenatal clinic in area ―x‖ is a cross-sectional study.

Cross-sectional Study Design

1. Define the population for study.

2. Determine the presence or absence of exposure and the presence or absence of disease for

each individual enrolled in the study.

25

For example

we survey a population and for each study participant, we determine at the same time the serum

cholesterol (exposure) and evidence of cardiovascular diseases (outcome). Each study participant

will be in one of the following possible subgroups (a, b, c and d): a. Persons who have been

exposed and have the disease. b. Persons who have been exposed but do not have the disease.

c. Persons who have the disease but have not been exposed. d. Persons who have neither been

exposed nor have the disease.

In a cross-sectional study we can calculate the prevalence of disease and the prevalence

of exposure, using the 2 X 2 table.

Prevalence of disease in exposed compared to non-exposed: a/a+b vs c/c+d

Prevalence of exposure in diseased compared to non-diseased: a/a+c vs b/b+d

26

Advantages of cross-sectional study

1. It is simple, inexpensive and done in a short time.

2. The prevalence rate of disease(s) and exposure(s) can be measured.

3. It is the first step to develop evidence for causal association (generate hypotheses).

4. It is often useful at the time of an epidemic as it helps to determine the extent of the

epidemic in the population.

Disadvantages of cross-sectional study

1. It is not appropriate to study rare diseases or events with short duration.

2. It does not provide solid evidence for causal association as the temporal relationship

between exposure and disease cannot be confirmed objectively (Egg or chicken

dilemma).

3. Use of prevalent cases to detect risk factors may result in wrong conclusions as prevalent

cases may differ from incident cases in term of survival factors (will be discussed in

cohort study).

27

Summary on cross-sectional study steps

1. Defining the population

The first step is, therefore, to define ―the population base‖ not only in terms of total

number, but also its composition in terms of age, sex as well as other socio-

demographic characteristics.

2. Defining the disease or characteristic under study

The epidemiologist must define precisely and accurately the condition being

investigated i.e. an operational definition which is a clear description of the disease

or the phenomenon under study in term of measurable variable(s) in the defined

population.

3. Describing the disease or the characteristic and its associates

Person: Age, sex, occupation, education.

Place: Rural vs. urban, Upper vs. Lower Egypt, closeness to a factory or a

water canal.

Time: Year (secular changes over years), season, month, week, day or even

hour of the day.

4. Measurement of disease

In descriptive studies, the disease under study should be ascertained using the proper

diagnostic tools and techniques.

5. Comparing with known indices (Prevalence Rate)

To judge the rate of disease development, one must compare the calculated rates

with previously recorded or estimated ones. We can also identify groups who are at

higher risk of developing the disease.

6. Formulation of a hypothesis or hypotheses

The importance of the descriptive studies is their use in generating hypotheses about

etiology of the health-related conditions. Theses hypotheses should be subjected to

further investigations using more elaborate methods.

28

Chapter (3)

Research Design II - Analytical Studies-Cohort and Case Control

Studies



1. Define and explain the distinguishing features of a cohort study

2. Identify the risk factors.

3. Determine different types and measurements derived from cohort study.

4. Define and explain the distinguishing features of a case-control study

5. Describe and identify when case control studies are desirable.

6. Estimate and interpret the measuring of risk in both designs.

7. Identify the potential strengths and limitations of both designs

Content:

1. Cohort study

(Definition-characteristics-design-types-measurement-advantages-disadvantages).

2. Case control study

(Definition-purpose-characteristics-steps-measurement-advantages-disadvantages)

29

Analytical epidemiology

Is concerned with the search for causes and effects, or the why and the how. Epidemiologists use

analytic epidemiology to quantify (measure) the association between exposures and outcomes

and to test hypotheses about causal relationships. It has been said that epidemiology by itself can

never prove that a particular exposure caused a particular outcome. However, epidemiology

provides sufficient evidence to take appropriate control and prevention measures.

Analytic studies test hypotheses about exposure outcome relationships.

Measure the association between exposure and outcome.

Include a control group.

What is risk factor? And does it differ from disease cause?

A risk factor is an attribute, exposure (physical, chemical or biological.) or behavior that

increases the probability of an individual to have a disease. When the risk factor is

unpreventable/modifiable such as age, sex and race, some authors call it a risk attribute. Any risk

factor alone is not sufficient to cause a disease but requires the presence of other risk factors.

Component cause and concept of risk factor

Factor (A): Present in all component, so it called a necessary factor.

Disease occurs due to the combination of more than one risk factor.

30

Characteristics of disease etiology

The etiology of any disease is multi-factorial, i.e., the development of a disease needs

the contribution of more than one risk factor.

Each disease can be caused by a number of sufficient causes.

Each sufficient cause is sufficient to produce the disease.

Each sufficient cause consists of a combination of many Risk Factors (component

causes) that work in different combinations or sequence.

Component causes change over time and in different populations.

A Risk factor that is present in all component causes is a necessary factor.

1. Cohort study

Definition

A well-defined group of individuals who share a common characteristic or experience.

Example: Pregnant diabetics is a cohort, individuals born at specific year is a birth cohort. Other

names: longitudinal study or follow-up study.

Characteristics of cohort study

Participants are classified according to exposure status and followed-up over time to

ascertain outcome.

Can be used to find multiple outcomes from a single exposure.

Appropriate for rare exposures

Ensures temporality (exposure occurs before observed outcome)

Cohort study design

Etiologic studies (cohort) require at least two

groups. One group, the index group, is exposed

to the factor thought to influence occurrence of

the study outcome. The other group, the

referent or control group remains unexposed

to provide a reference for comparison.

31

Types of cohort studies

Prospective

Group participants according to current exposure and follow-up into the future to determine if

outcome occurs.

Retrospective cohort studies

At the time that the study is conducted, potential exposure and outcomes have already occurred

in the past

N.B: “Reconstructive Cohort Study”: (is a combination of both prospective and retrospective

studies) You may assemble a cohort that started at a point of time in the past and continue to

follow the cohort members for a period of time from now to a time-point in the future. e.g: A

cohort of doctors graduated in 1980-1990 is assembled from the medical school records and

followed till 2020 for causes of death.

32

Measurement in cohort study

A- Absolute risk (Incidence rate)

Table 1: Relation between Smoking and Hypertension

Smokers Nonsmokers Total population

Hypertension 80 (20%) 30 (5%) 110 (11%)

Free from hypertension 320 570 890

Total 400 600 1000

Measures of incidence (measure of disease frequency) among exposed and among non-

exposed: (Cumulative Incidence)

1. Incidence of hypertension in smokers =80/400 =20%

2. Incidence of hypertension in non-smokers =30/600 = 5%

3. Incidence of hypertension in the population = 110/1000=11%

B- Risk ratio or relative risk

Measures of association (The relative risk or risk ratio)

1. Relative risk in our example = 20/5 =4 which means that smokers are at a higher risk of

developing hypertension four time the risk of non-smokers.

2. Risk of hypertension among smokers is 4 TIMES the risk among non-smokers.

If the relative risk = 1, the exposure is not associated with disease, in other words,

the exposure is not a risk for the disease.

If the relative risk is >1 then the incidence in exposed exceeds that in unexposed and

the exposure is a risk factor for the disease.

Lastly, if the relative risk is <1, this means that the incidence in exposed is less than

in unexposed and the exposure is rather a ―protective factor‖ than a risk of disease or

in other words, absence of this exposure is a risk factor for the disease.

Relative risk is a measure of the strength of the association between exposure and outcome

and indicates etiological relationship between exposure and outcome, i.e., the higher the

relative risk the stronger the etiological association.

33

Advantages of cohort studies

1. Incidence rate and Relative Risk can be calculated

2. Temporal relationship between exposure and outcome are preserved

3. Several possible outcomes related to a single exposure can be studied simultaneously,

4. No recall bias (see case- control study)

5. Dose-response effect can be studied

6. Suitable for rare exposure

Disadvantages of cohort studies

1. Cohort studies involve a large number of people.

2. It takes a long time to complete the study.

3. Unsuitable for uncommon diseases or diseases with low incidence in the population.

4. Loss of individuals during follow-up may be due to travelling, migration, death or loss of

interest.

5. Expensive in term of cost and effort consumed.

6. Ethical problems.

34

2. Case-control study

Definition

A case-control study is an epidemiological study design in which individuals with an event or

condition/disease of interest, cases, are identified and then compared with individuals without the

event or condition of interest, controls, as regard one or more exposures.

Case-control studies are the most common type of observational analytical studies constituting

about 90% of all epidemiological studies.

Purpose

To study rare diseases

To study multiple exposures that may be related to a single outcome

Study Subjects

Participants selected based on outcome status:

Case-subjects have outcome of interest (cancer).

Control-subjects do not have outcome of interest.

When to conduct a case-control study?

The outcome of interest is rare (cancer) When the disease or outcome has a long

induction and latent period (i.e., a long time between exposure and the eventual causal

manifestation of disease).

Multiple exposures may be associated with a single outcome.

Funding or time is limited.

35

Characteristics of case control study

1. Both exposure (risk factor) and outcome (disease) have occurred before the start of the

study. (Exposures are assessed in a retrospective way and that is why case-control studies

are called ―Retrospective Studies‖ )

2. Being relatively easy and inexpensive, it is commonly the first approach to test causal

relationship hypotheses

Steps to conduct case-control study design

1- Selection of cases (case definition)

It involves diagnostic criteria and eligibility criteria such as the case must be newly diagnosed

within a specific period of time ―incident case‖. The sources of cases may be: Hospital or clinic,

Population-based or community where new cases are reported to health departments, registries,

hospital record departments, etc.

NB: The use of prevalent cases will examine the factors of survival of the disease and not the

risk factors of its etiology.

2- Selection of controls

The controls must be free from the disease under the study.

Control group with condition(s) related to the exposure under study may change the relationship

between that exposure and the disease under study.

If we chose a control group of patients from chest clinic for cases of lung cancer, we may end up

with no association between smoking and lung cancer. A relatively high proportion of the

controls chosen from chest clinics are most likely to be smokers as they are suffering from other

diseases related to smoking e.g. gastritis, peptic erosions or ulcer…….

On the other hand, if we chose athletes control for the same study, an overestimation of the

association between smoking and lung cancer will result, as athletes are most likely to be non-

smokers (not like normal).

Selection of control is one of the difficult tasks in case-control studies and is the source of

introducing many errors (bias).

36

3. Matching

Controls are similar to cases with regard to certain selected variables, e.g. age and sex which are

known to influence the outcome of the disease and which if not adequately matched could distort

the results.

The size of the control group should be at least equal to the size of the case group or more but

use of more than 3 controls for each case will not add to the efficiency of the study.

Measurement of exposure

Information about Exposure (examples: smoking, dietary intake of fat, exposure to

asbestos, hormonal contraceptive intake) should be obtained in precisely the same

manner both for cases and controls.

As the human memory is very selective, recall errors may occur. Women who had a child

with congenital malformation will have a very good recall of all events that occurred

during pregnancy and delivery compared to the approximately complete forgetfulness of

women who had normal babies (recall bias).

Sometimes ascertainment of exposure may be affected by previous knowledge of the data

collector about the disease status of the individual: the interviewer may explore history of

smoking more deeply in cases of bronchogenic carcinoma than healthy controls

(interviewer bias).

It is better to measure exposure with an objective and validated method (biological

marker), but this may not be feasible in many situations.

Measurement in case control study

Cases Controls

Smokers 30 (a) 15 (c)

Non-smokers 10 (b) 45 (d)

Total 40 60

The following could be measured from the case control study

Proportion of exposure among cases, smokers among hypertensive = 75%

Proportion of exposure in the controls, smokers among normotensives =25%.

37

How to measure the strength of the association between smoking and hypertension?

1. Apparently, we cannot use the relative risk as we cannot measure the incidence of

disease among exposed and unexposed.

2. A measure of association can be calculated from the case-control study, called the Odds

ratio (OR). It is the ratio between the Odds of exposure in cases and the Odds of

exposure in controls.

Odds = the ratio between probability of having a characteristic and the probability of not

having that characteristic.

Odds of exposure in cases = probability of exposure/probability of non-exposure

= (30/40) ÷ (10/40) = 30/10

Odds of exposure in controls = probability of exposure/probability of non-exposure

= (15/60) ÷ (45/60) = 15/45

Odds ratio= 30/10 ÷ 15/45 = (30 x 45) / (10 x 15) = 9

Odds ratio is a measure of the strength of the association between the risk factor and

outcome and is an approximation of the relative risk, when prevalence of the disease in

the general population is low and the risk ratio is low.

If the disease prevalence is high, odds ratio will overestimate the relative risk.

From the above example we can conclude that hypertensives have 9 times the risk to be

smokers than the normotensives.

A simple way to calculate the Odds ratio is to arrange the 2 by 2 table so that the upper-

left corner includes the exposed cases and labeled as follows:

Odds ratio = AD/BC, In the previous example = 30*45 / 10*15 =9 This is why it is called

cross-product ratio.

38

Advantages of case control study

1. Relatively easy to carry out

2. Rapid and inexpensive (compared with cohort study)

3. Particularly suitable to investigate rare diseases

4. No ethical problem

5. Allows the study of several etiological factors for a single disease, e.g., smoking, physical

activity in myocardial infarction

6. No attrition problems (Loss of individuals during follow-up) because case control studies

do not require follow-up of individuals into the future.

Disadvantages of case control study

1. Recall bias e.g. relies on memory or past records

2. Selection of an appropriate control group may be difficult.

3. We cannot measure incidence rates and so relative risk cannot be calculated.

4. Odds ratio is an estimate of the relative risk only with diseases of low prevalence.

―Egg or chicken problem‖ Sometimes it is difficult to ascertain which comes first: the etiologic

factor or the disease especially for non-incident cases (e.g. physical activity and obesity; which

one becomes first?).

39

Activity

40

Chapter (4)

Applied Intervention Studies (clinical trial)



1. Understand different phases of a clinical trial.

2. Identify the concept of randomization, blinding and the different types of blinding

3. Identify some ethical considerations while conducting clinical trials

4. Calculate measures of treatment effects in clinical trials

Content:

1. Definition of clinical trial.

2. Objectives of clinical trials.

3. Phases of clinical trial.

4. Types of clinical trial

5. Steps of carrying out clinical trial.

6. Ethical issues

41

Clinical Trials

1. Definition

It is one of the interventional studies. It is a prospective study to assess the effect of one or more

intervention (therapeutic) in a group of patients against a control in human beings.

A controlled clinical trial compares the outcomes of a treated group with a comparable group of

patients receiving the control treatment. The intervention being tested is often a drug treatment

but may also be a non-drug treatment such as surgery.

2. Objectives of clinical trial

Discovering new treatments for life threatening diseases.

Discovering new ways to detect, diagnose, and reduce the risk of disease.

Help researchers and physicians to decide if the benefits of the new treatments outweigh

the side effects.

To overcome the problem of drug resistance.

3. Phases of clinical trial

Phase I: Pharmacology and toxicology

First stage of testing in human beings.

Less than 30 healthy volunteers are involved in the clinical trial.

Duration of clinical trial: 6-12 months.

The researcher follows up the safety, tolerability, absorption, distribution, metabolism,

and execration of tested drug in the study group.

Aim of phase I

To determine the maximum tolerated dose (MTD) of the new drug.

Phase II: Initial investigation of treatment effect

It is a therapeutic exploratory Trial

It starts after the completion of phase I and detection of the MTD

Less than 100 patients are involved in the clinical trial.

Duration of clinical trial: 6 months to several years.

42

Aim of phase II

To determine efficacy and safety of tested drug.

To determine optimum dose (Dose efficacy relationship- therapeutic dose regimen-

duration of therapy-frequency of administration-therapeutic window)

Phase III: Clinical evaluation of treatment

It is a therapeutic confirmatory trial.

It starts after the completion of phase I and phase II

From 100‘s to 3000 patients are involved in the clinical trial.

Duration of clinical trial: Takes a long time, up to 5 years.

Aim of phase III

To compare the efficacy of the tested drug against existing therapy in larger number of

patients

To assess overall and relative therapeutic value of the new drug (Efficacy and Safety).

Phase IV: Post Marketing Surveillance (PMS)

Start after the end of clinical trial activities (Phases I-II-III) and the approval of the drug

from the U.S FDA.

No fixed duration / patient population.

Aim of phase II

Detect rare and long-term adverse drug reactions and drug interactions during usage of

patients.

Explore new uses of drugs.

4. Types of clinical trial

Preventive: look for better ways to prevent a disease in people who have never had the disease

or to prevent the disease from returning. Approaches may include medicines, vaccines, or

lifestyle changes.

Screening: test new ways for detecting diseases or health conditions.

43

Diagnostic: study or compare tests or procedures for diagnosing a particular disease or

condition.

Treatment: test new treatments, new combinations of drugs, or new approaches to surgery or

radiation therapy.

Behavioral: evaluate or compare ways to promote behavioral changes designed to improve

health.

Quality of life (or supportive care trials): explore and measure ways to improve the comfort

and quality of life of people with conditions or illnesses.

Types of clinical trials (in relation to comparison groups)

1. One Arm clinical trial: One group of patients will receive the treatment, without

control. We will assess the effect of the treatment by comparing the state of the

participants before and after the new treatment.

2. Two arms clinical trial: This is the classical clinical trial. It is also called controlled

clinical trial. One group will receive the new treatment; meanwhile the other group will

receive the old treatment or the placebo.

Placebo

Placebo is an inert compound randomly allocated to subjects in a clinical trial.

Placebo arm is a true control for an intervention: -Assess relative effect of intervention –

relative risk - Assess risk for adverse events

Placebo arms are not ethical if there is an established standard treatment/management.

5. Steps of carrying out clinical trial

5.1. The protocol

Clinical trials follow a plan known as a protocol. The protocol is carefully designed to define the

benefits and risks to participants and answer specific research questions. A protocol describes the

following:

The goal of the study.

Who is eligible to share in the trial.

Protections against risks to participants.

44

Inform the participants about tests, procedures, and treatments.

How long the trial is expected to last.

What information will be gathered.

5.2. Selection of study groups

Researchers follow clinical trials guidelines when deciding who can participate, in a study.

Factors that allow you to take part in a clinical trial are called "inclusion criteria." Those that

exclude or prevent participation are "exclusion criteria." These criteria are based on factors such

as age, gender, the type and stage of a disease, treatment history, and other medical conditions.

Randomization is a statistical procedure by which the participants are allocated into two similar

groups usually called ―study‖ and ―control‖ groups, to receive or not to receive a new preventive

or therapeutic intervention. It is done to allow comparability between both groups. Thus, any

observed differences in outcome are likely to result from differences in treatment effect.

Randomization is an attempt to eliminate‖ selection bias‖ and allow for fair comparison.

5.3. Blinding

Importance

▪ Blinding is used to prevent conscious or unconscious bias in the design of a clinical trial

and how it is carried out.

▪ It is used to ensure the objectivity of trial results.

Types of blinding

a. Single blinded trial: The trial is planned so that the participant is not aware whether

he/she belongs to the study or control group.

b. Double blinded trial: The trial is planned so that neither the doctor nor the participant is

aware of the group allocation and the treatment received.

c. Triple blinded trial: This goes one step further. The participant, the investigator, and the

person judging the outcome or the person analyzing the data are all not aware‖ blind‖.

NB: The two drugs should be identical in shape, color, taste and the container (if possible).

45

NB: Unblinded trials are only done under certain conditions as surgical procedures where

blinding is impossible or if ethically not permitted

Types of bias

a. Participant bias: Who may subjectively feel better or report improvement if they knew

that they were receiving a new form of treatment.

b. Observer bias: when measuring the outcome of a therapeutic trial the investigator may be

influenced if he knows earlier the particular therapy to which the patient has been subjected.

c. Evaluation bias: when the data analyst subconsciously gives a report of the outcome of

the trial in favor of the new or old drug.

5.4. Assessment

The final step in clinical trial is assessment in terms of positive results as reduction in incidence

rate or severity of the disease or increase in survival time or negative results as adverse events

among treated and control groups.

Relative risk (measure the reduced risk of developing the disease after receiving the

treatment) = Incidence rate in treatment group/Incidence rate in placebo or control group

which.

Number needed to treat (NNT) (It is the number of patients needed to be treated with the

new treatment to have one favorable outcome) = 1/ Absolute Risk Reduction

6. Ethical issues

Stopping rules: If severe and unexpected side effects or complications occur. Or when the

benefit from the intervention becomes evident and undeniable.

Standard care protocol: Should be applied to all participants in both groups

Informed consent: Should be read, agreed upon and signed by each participant.

46

Fig.5. Shows the Clinical trial flow chart

Steps of clinical trial

47

Fig.6. Summary of chapter 4

48

Chapter (5)

Protocol Writing

After identifying and defining the research problem, researcher must arrange his ideas in order

and write them in the form of an experimental plan or what can be described as ‗Research

Protocol‘. This is essential specially for new researcher because of the following:

(a) It helps researcher to organize his ideas in a form possible for him to look for flaws and

inadequacies, if any.

(b) It provides a list of what must be done and which materials have to be collected as

a preliminary step.

(c) It is a document that can be given to others for comment.

Research protocol must contain the following items

1. Research objective should be clearly stated in a line or two which tells exactly what the

researcher expects to do.

2. The problem to be studied by researcher must be clearly stated so that one may know

what information is to be obtained for solving the problem.

3. Each major concept which researcher wants to measure should be defined in operational

terms in context of the research project.

4. The protocol should contain the method to be used in solving the problem. An overall

description of the approach to be adopted is usually given and assumptions, if any, of the

concerning method to be used are clearly mentioned in the research protocol.

5. The protocol must also state the details of the techniques to be adopted. For instance, if

interview method is to be used, an account of the nature of the contemplated interview

procedure should be given. Similarly, if tests are to be given, the conditions under which

they are to be administered should be specified along with the nature of instruments to be

used. If public records are to be consulted as sources of data, the fact should be recorded

in the research protocol. Procedure for quantifying data should also be written out in all

details.

49

6. A clear mention of the population to be studied should be made. If the study happens to

be sample based, the research protocol should state the sampling plan i.e., how the

sample is to be identified. The method of identifying the sample should be such that

generalization from the sample to the original population is feasible.

7. The protocol must also contain the methods to be used in processing the data. Statistical

and other methods to be used must be indicated in the protocol. Such methods should not

be left until the data have been collected. This part of the protocol may be reviewed by

experts in the field, for they can often suggest changes that result in substantial saving of

time and effort.

8. Results of pilot test, if any, should be reported. Time and cost budgets for the research

project should also be prepared and laid down in the protocol itself.

50

Chapter (6)

Source of Data and Types of Variables



1. Identify the sources of data.

2. Define variables.

3. Differentiate between a concept and a variable.

4. Identify different types of variables.

5. Recognize the differences between coding, scaling, and scoring.

Content:

I. Sources of data

1. Census.

Definition.

Importance.

2. Registration of births and deaths.

3. Notification.

4. Hospital records.

5. Other health records.

II. Variables

1. Definition of variables.

2. Difference between concept and variable.

3. Types of variables.

4. Coding, scaling, and scoring.

51

I. Sources of data

There are

1. Census.

2. Registration of births and deaths.

3. Notification of diseases.

4. Hospital records.

In and out door patients.

5. Other health records.

Mother and child health centers.

Records of school health services.

Records of occupation health units, hospitals, etc.

1. Census

Definition

Census is defined as instantaneous enumeration or counting of population at specified time,

census is taken in most of the world at a regular interval usually every 10 years.

In Egypt, the last census was carried out in 2018 and the total population was 98.2 million.

Importance

1. Estimate the total number of populations.

2. Provide features of the population regarding, age, sex distribution, occupation,

socioeconomic classes, etc.

3. Provides the necessary denominator for calculating vital statistical such as birth and

death.

4. Important in strategic planning.

2. Registration of birth and deaths

Births

Registration of births is compulsory in most .countries in Egypt, births are to be notified within

10 days of occurrence. Further, before admission of a child to school, production of birth

certificate is mandatory. In development countries birth certificate contain a lot of information

52

useful to the epidemiologist, such as birth weight, congenital malformation, complication during

pregnancy of mother, blood group. The more the recorded information, the greater its usefulness.

Deaths

Deaths are to be notified in Egypt within 24 hours. These deaths are to be medically certified as

the cause of the death. Death certificate is the foundation of modern epidemiology. Death

certificates also tell us about the frequency and distribution of many diseases.

The cause and age of death are most important items in this certificate, they have to correctly

recorded for the national and international comparison.

The internationally agreed form of death certificate known as the ‗international death certificate‘

recommended by the WHO.

3. Notification

Notification was first introduced for the of control infectious disease. It is valuable source of

information regarding the incidence of certain specific diseases in the community. Lists of

notifiable diseases vary from country to country. Usually diseases which are considered to be

serious menaces to public health are included in the list of notifiable diseases, this list can be

found in statistical report of Ministry of Health.

Notification has following limitations

1. It covers only a small part of the total sickness in the community.

2. Many cases (atypical cases, subclinical cases) escape from notification.

3. Not uniform throughout the world.

Despite the above limitation, notification provides valuable information about disease frequency

and distribution. It also provides early warning of epidemics.

4. Hospital records

They are basic and primary source of information about diseases prevalent in the community

The main disadvantage of the record

1. They are highly selective (i.e mild cases may not go to the hospital).

2. Population served by the hospital (population at the risk) cannot be defined. That is

hospital statistics provide only numerator, but not denominator.

53

5. Other health prerecords

A lot of information is also found in the records of mother and child health centers, school health

services, occupational health services, etc…. Certain diseases are recorded in many countries

where they are common (viz. leprosy, cancer, T.B).

II. Variables

1. Definition

A variable can be defined as qualities, properties, characteristics of persons, things, or situations

that change or vary, and that can be measured in a research study. A variable is a property that

takes on different values.

It is also defined as any characteristics, number, or quantity that can be measured or counted. A

variable may also be called a data item.

2. Difference between a concept and a variable

Data related to concepts are subjective, while in variable is objective

Data related to concepts can‘t be measured, where variable can be measured (very

important difference).

Data related to concepts among people isn‘t the same, but in variable it is specific.

Examples of concept data are, effectiveness, satisfaction, sadness, while in variables are age,

height and weight.

N.B. Concepts can be converted to variables, so it can be analyzed.

Example illustrating the change of concept to variable:

1. Concept: Rich/poor.

2. Indicator: Income/value to assess.

3. Change to variable: Total income per year/ total values of cars and homes.

4. Measure: Consider rich when income is more than X per year and poor is less than X per

year.

54

3. Types of variables

Variable can be classified by different ways:

3.1. Qualitative and quantitative (according to measurement scale).

3.2. Dependent, independent, and extraneous (according to causal relationship).

Fig. 7. Illustrates different types of variable

3.1. Qualitative and quantitative

3.1.1. Qualitative or Categorical variables

Definition

It is a non-numerical value. It has values that describe a 'quality' or 'characteristic' of a data unit,

like 'what type' or 'which category'. When asked about the blood group, there are four possible

appropriate mutually exclusive answers A, B, AB and O and the individual will choose the one

that applies to him. Mutually exclusive = cannot occur together.

Categorical variables can be classified into:

A. Ordinal variable: Observations can take a value that can be logically ordered or ranked. The

categories associated with ordinal variables can be ranked higher or lower than another, but do

55

not necessarily establish a numeric difference between each category. Examples of ordinal

categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium,

large, extra-large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).

B. Nominal variable: Observations can take a value that is not able to be organized in a logical

sequence. Examples of nominal categorical variables include sex, business type, eye color, and

religion.

3.1.2. Quantitative or numeric variables

Definition

They have values that describe a measurable quantity as a number, like 'how many' or 'how

much'. Therefore, numeric variables are quantitative variables. Examples include the number of

children per family, number of molar teeth in the mouth, number of beds in a hospital, number of

fingers per hand and RBCs count.

Numeric variables can be classified into:

A. Continuous variables: Numbers with fractions = measurements. Examples include

temperature, systolic blood pressure, age, height, and fasting blood sugar.

B. Discrete variables: Observations can take a value based on a count of the values. A discrete

variable cannot take the value of a fraction between one value and the next closest

value. Examples of discrete variables include the number of registered cars, number of factories

in certain locations, and number of children in a family.

3.2. Independent, dependent, and extraneous

In research terminology, change variables are called independent variables, outcome/effect

variables are called dependent variables, the unmeasured variables are called extraneous

variables.

A. Independent (Cause): the cause/risk factor supposed to be responsible for bringing change(s)

in an outcome.

B. Dependent (Effect/ outcome): The effect brought by the independent variable.

56

C. Extraneous (Confounding): A variable that is associated with both the problem and the

possible cause of the problem. It may either strengthen or weaken the apparent relationship

between an outcome and possible cause.

Example: Age and height relationship. The independent variable is the age, while the dependent

variable is the height.

In a survey to study the relationship between cigarette smokers‘ mothers and the weight of their

newborn. The independent variable is the mother‘s smoking habit, while the dependent variable

is the newborn weight. While other extraneous variables may be number of smoked cigarettes,

diet, age, exercise, etc. All the variables that might affect this relationship either positively or

negatively are extraneous variables.

4. Coding, scaling, and scoring

Coding process

It is important in statistical analysis. Computer statistical programs can deal better with numbers

(quantitative data). While qualitative data a process of coding should be done. Coding = giving

numeric codes to different categories of the variable.

Gender: Male=1 and Female=2.

Scaling

Likert Scale

A Likert Scale is a type of rating scale used to measure attitudes, preferences, and subjective

reactions.

Example: Family planning is a good practice: Strongly disagree = 1, Disagree = 2, Neither

agree nor disagree = 3, Agree = 4, Strongly agree = 5.

Four to seven items are usually used in the scale. Dozens of variations are possible on themes

like agreement, frequency, quality and importance for example:

- Agreement: Strongly agree to strongly disagree.

- Frequency: Often to never.

- Quality: Very good to very bad.

- Importance: Very important to unimportant.

57

A Visual Analogue Scale (VAS)

A measurement instrument that measure a characteristic or attitude and range across a continuum

of values and cannot easily be directly measured. It is often used in epidemiologic and clinical

research to measure the intensity or frequency of various symptoms. For example, the degree of

pain that a patient feels ranges from none to severe pain.

The VAS can be dealt with as a continuous quantitative variable or it can be coded into no, mild,

moderate…etc., i.e., an ordinal qualitative variable.

Scoring

Item responses may be summed to create a score for a group of items.

Example

Patient Satisfaction Questionnaire usually filled after the client received the medical service and

used to evaluate the quality of the health services provided by the institute. The questionnaire

includes a list of questions in the form of Likert scale. The summed score of all the questions

will reflect the level of client satisfaction.

58

Chapter (7)

Data Presentation



1. Determine different means for data presentation.

2. Select the suitable data presentation mean as per different variable type.

3. Capability of constructing proper tables and graphs.

Content:

1. Studying proper tables and graphs.

2. Creating intervals in the tables.

3. Measuring of data from total row and/or column.

59

Data Presentation

The Huge data collected during a research must be represented in a suitable format to provide the

needed information to take the suitable decision. These data should be organized in a suitable

form. Three formats of data presentation are available such as tables, graphs and numbers.

1. Tables

It is a suitable method for data presentation.

Aim: to arrange the data in simple, concise and readable form.

Characteristics:

1. The table should be self-explanatory.

2. Can be used in quantitative and qualitative data.

3. Heading and different columns should be clearly defined with units of

measurements.

4. The columns and/or rows should be calculated.

5. The length of table should be suitable.

6. Any explanation on the table should be placed under it as a foot note.

7. Intervals between variables should be as possible of the same width, except

with intervals containing zeros.

Types of tables

A. Frequently distribution table

A frequency distribution table consists of two columns: the first is the class of the

classifying variable (that may be categorical or categories of a continuous variable) and the

second is the number of observations belonging to this category. A third column contains the

percentages.

Table 2: Frequency distribution of social class among women

Social class Frequency (No.) %

Low social class 10 40.0

Middle social class 6 24.0

High social class 9 36.0

Total 25 100.0

It is clear now from table 1 that the highest 40% of these women are belonging to the low social

class

60

B. Contingency table

It is used to explain the relationship between two categorical variables. Table 2 presents

cross-tabulation of two categorical variables. Women are classified according to the

socioeconomic class into three categories and the presence of anemia into two categories.

The result shows 8 women were anemic and belonged to the low social class. They represent

80% of women of the low social class women. Similarly, 33.3% of women of high social

class were not anemic.

Table 3: Frequency distribution of anemia in women according to social classes

Social Class Anemia

no. (%)

No anemia

no. (%)

Total

no. (%)

Low social class 8 (80.0) 2 (20.0) 10 (40.0)

Middle social class 3 (50.0) 3 (50.0) 6 (24.0)

High social class 6 (66.7) 3 (33.3) 9 (36.0)

Total 17 (68.0) 8 (32.0) 25 (100.0)

Cross-tabulation can be done for more than two variables. The relationship between social

class and anemia can be examined in rural and urban regions, the so-called three-way

contingency table. Four-way contingency table is constructed between socioeconomic status,

the presence of anemia, urban-rural residency and Upper-Lower Egypt residency and so on.

Table 3 presents cross-tabulation between two variables: the age is transformed into ordinal

qualitative variable through grouping into three groups and the presence of anemia, a binary

variable.

Table 4: Age distribution of women with or without anemia

Age groups in years Anemia

No. %

No anemia

No. %

Total

No. %

25- 2 11.5 6 75.0 8 32.0

30- 10 59.0 1 12.5 11 44.0

35-49 5 29.5 1 12.5 6 24.0

Total 17 68.0 8 32.0 25 100.0

61

Creating the intervals

Uses:

In changing the quantitative variable into groups.

Intervals should be as possible of the same width, except with intervals containing

zeros. 5-12 intervals should be enough in most cases.

Each interval has an open end, e.g. in table 3. 25 - means women whose age is 25

years to any age below 30, the beginning of the next interval. Alternatively, the

interval may have an open beginning, i.e. -29 = age from the end of the previous

interval up to 29 years.

Percentages

Can be calculated from the row total or the column total, with different meanings. In table 3,

the percentages are taken from the columns‘ total, so we can say that 59% of women with

anemia are aged between 30 and 34. In Table 2, 80% of women from low social class had

anemia; the percentage was taken from the row. However, the percentage in a certain

direction may have no meaning at all according to the design of the study.

2. Graphs

Aim: Graphs are more capable of gaining attention, stressing a certain phenomenon and

giving a quick idea about the general situation.

Characteristics: Graphs should be accurate, simple, clear and well designed.

Types of graphs: According to the type of variable graphs are classified into

A. Categorical data:

1. Bar chart: the frequencies or the relative frequencies of the different groups are

represented by rectangles of the same width and based on the x-axis. The heights of these

rectangles measured on the y-axis are proportional to the relative frequencies of the groups.

62

Types of bar charts:

Simple bar chart (relative frequency of one categorical variable).

Fig.8. Simple bar chart illustrating the causes of renal failure in hemodialysis patients

Composite or compound bar chart (cross tabulated two categorical variable).

Fig. 9. Composite Bar Chart Illustrating the Relation between Education Level and Residence

44%

38%

14%

4%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Hypertension DM Infection Others

Causes of renal failure

60

20

10 8

2

35

25

10

15 15

0

10

20

30

40

50

60

70

Never at school Primary Preparatory Secondary University

Level of education in rural and urban sample of working women

Rural Urban

63

Stacked bar chart (either a single variable or cross tabulation of two or more variables).

Fig. 10. Stacked bar chart illustrating levels of education in urban and rural women

In the stacked bar chart, we have one column for each category of one categorical variable

presenting 100% that is then divided into portions according to the categories of the other

categorical variable.

2. Pie chart: a circle whose area represents the total frequency and subdivided into

segments presenting proportionally the different categories.

Fig 11. Pie graph illustrating prevalence of different eye disesase

60

35

20

25

10

10

8

15

2

15

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Rural Urban

Level of education in a rural and urban sample of working

women

Never at school Primary Preparatory Secondary University

64

B. Quantitative data

1. Histogram: Area representation of the relative frequency of a variable using rectangles

adjacent to each other, the width of each rectangle = the width of the corresponding

interval Figure5 . The area of each rectangle represents the relative frequency in that

interval.

2. Frequency polygon: a line connecting the midpoints of the tops of the rectangles of the

histogram Figure 5.

Fig. 12. Shows histogram and frequency polygon for distribution of student

3. Other graphical presentation: such as ogive and scatter diagram (discussed later).

Fig. 13 Shows ogiva chart. Fig. 14. Shows scatter diagram.

65

Fig. 15. Summarize different types of data presentation

Presentation of data

Tables

Frequance distribution

table

Contingency table (corss tabulation)

Graphs

Categorical data

Bar chart

Simple Composite Stacked

Pie chart

Quanitative data

Histogram Frequency polygon

Others

Ogiva Scatter

66

Chapter (8)

Descriptive Statistics



1. Identify qualitative and quantitative variables.

2. Calculate and interpret measures of central tendency

3. Calculate and interpret measures of dispersion.

4. Differentiate between measures of central tendency and measures of dispersion.

Content:

1. Study qualitative/categorical variables.

(Count data, proportion, ratio, rate).

2. Study quantitative variables.

2.1. Measurement of central tendency.

(Mid-range, mode, median, arithmetic mean).

2.2. Measurement of dispersion.

(Range, deviation from the mean, variance, standard deviation, coefficient of

variation).

67

1. Qualitative variables

1. Count data

They are points representing occurrences in term of time or space.

Number of males with HIV infection

Number of microscopes in a bacteriological lab,

Number of patients cures

2. Proportion

It is a fraction in which the numerator is included in the denominator

Proportion of males in a class = number of males/(number of males + number of females)=

A/(A+B).

There are 50 anemic females in a sample of 150 of above 40 years‘ female.

What is the proportion of anemia in this sample? = 50/150 = 30%

3. Ratio

It is a fraction in which the numerator is not included in the denominator.

The numerators and denominators of a ratio can be related or unrelated. In other words, you

are free to use a ratio to compare the number of males in a population with the number of

females.

In epidemiology, ratios are used as both descriptive measures and as analytic tools. As a

descriptive measure, ratios can describe the male-to-female ratio of participants in a study.

As an analytic tool, ratios can be calculated for occurrence of illness, injury, or death

between two groups. These ratio measures, including risk ratio (relative risk).

4. Rate

It is the instantaneous change in one quantity per unit change in another quantity usually it

is time or space.

There is no upper limit to its value. Attack rate of flu = 125 cases per week

In epidemiology, rates are particularly useful for comparing disease frequency in different

locations, at different times, or among different groups of persons with potentially different

sized populations; that is, a rate is a measure of risk.

68

For epidemiologists, a rate describes how quickly disease occurs in a population, for

example, 70 new cases of breast cancer per 1,000 women per year. This measure conveys a

sense of the speed with which disease occurs in a population and seems to imply that this

pattern has occurred and will continue to occur for the foreseeable future. This rate is an

incidence rate.

2. Quantitative variables

2.1. Measures of central tendency

Definition: A Measure of central tendency is a single value representing all data. They include

(the Midrange, the Mode, the Median and the Mean)

1. Midrange

It is calculated by adding the smallest and largest observation together then divided by 2.

Interpretation: It represents the average of the two extreme observations.

Advantages: It is easy to calculate.

Disadvantages: Affected by the presence of extreme values.

Example: The following are scores of 11 students obtained in English class

10 8 14 15 7 3 3 8 12 10 9

From the given example: Midrange = (3 + 15) / 2 = 9.

Assuming the last value is 31 instead of 15 Midrange = (3 +31 )/2 = 17 which clearly does

not accurately estimate the central tendency of the data.

2. Mode

The mode is the value that occurs most frequently in a data.

Mode may be bimodal with two modes and other data sets do not have a mode because

each value only occurs once.

The mode is rarely used as a summary measure.

From the previous example the mode is 3.

69

3. Median

The median is the middle value of an arranged/ordered distribution.

It divides the series into two halves; in one half all items are less than median, whereas in

the other half all items have values higher than median, after arranging the values in an

ascending or descending order.

Steps to calculate the median: we need to rank the observations either in an ascending or

descending order. Then we look at the number of observations (n).

1. If n is odd, there is one median whose rank is (n+1)/2 (note: it is not the value, you must

go to this rank, then the value of this rank is the median).

2. If n is even, there are two medians whose ranks are (n/2) and the (n/2 +1) and the

average of the values of these two ranks is taken as the median.

Example

What is the median of the following scores: 10 8 14 15 7 3 3 8 12 10 9

Arrange in ascending or descending order :

15 14 12 10 10 9 8 8 7 3 3

Calculate the order of the median:

middle = (N + 1) / 2 = (11 + 1) / 2 = 6

The median = 9

Advantages

The median is a measure of location.

The median is not affected by extremes values; if the smallest value wrongly written

smaller (e.g. 14 instead of 15) or the largest value wrongly written larger, it would not

change the value of the median.

Disadvantages

It may be difficult to order a large number of observations by hand; however, the

computer software solved this problem.

The median does not use all data set values, so it may not be representative as a summary

measure.

70

4. Arithmetic mean

The arithmetic mean often simply called the mean or average, of a set of values is

calculated by adding up all the values and dividing this sum by the number of values in the

set.

This is expressed by the following symbols: where 𝑥 (pronounced ―x bar‖) signifies the

mean; xi is each values in the data set; n is the number of these values; and Σ, (the Greek

uppercase ‗sigma‘) denotes ―the sum of‖, and the sub and superscripts on the Σ indicate

that we sum the values from i = 1 to i = n.

Example: Using the given example of students‘ scores follows: 10 8 14 15 7 3 3 8 12

10 9 Mean = (10 + 8 + 14 +……+9 )/11 = 9

Advantages

All the values of the data set are included in the calculation of the mean

The mean is the main measure used in inferential statistics.

Disadvantage

It is sensitive to extreme values. For example, replacing 15 by 31 in the above example will

yield a mean of: (10+ 8 + 14 +……+ 31 )/12 = 10.45.

2.2. Measures of scatter/ Dispersion

1. Range

A simple measure is the range, which is the difference between the largest and smallest

observations. As with the midrange, the range is affected by extreme values

Example: Using the previous example, Range = 15 – 3 = 12.

If in the previous example 15 is replaced by 31. Range = 31 – 3 = 28.

2. Deviation from the mean

A measure of scatter calculated by finding the differences between the mean and individual

observations and dividing the sum difference by n where n is the number of observations.

71

Table 5: Calculation of deviation from the mean

Xi 𝑋𝑖 - 𝑋 Abs (𝑋𝑖 - 𝑋 ) 0.2 -1.475 1.475

0.3 -1.375 1.375

0.6 -1.075 1.075

0.7 -0.975 0.975

0.8 -0.875 0.875

1.5 -0.175 0.175

1.7 0.025 0.025

1.8 0.125 0.125

1.9 0.225 0.225

1.9 0.225 0.225

2 0.325 0.325

2 0.325 0.325

2.1 0.425 0.425

2.8 1.125 1.125

3.1 1.425 1.425

3.4 1.725 1.725

Sum 0 11.9

The difference between the mean and individual observations is calculated as follows:

d = 𝛴 (𝑥i - 𝑥) / n.

But if the differences were added up, the positive would exactly balance the negative and their

sum would be zero, so we take the absolute mean deviations |Di| / n.

The absolute value = the value ignoring the sign, so |1.725|=1.725 and |-1.495|=1.4959.

Example:

Urinary concentration of lead in 16 rural children (µmol/24 h) as follows: 0.2, 0.3, 0.6, 0.7, 0.8,

1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4.

Mean (𝑥 = 1.675).

Absolute mean deviations = |Di| / n = 11.9/16 = 0.744.

3. The variance

The differences of each observation from the mean of all the observations.

Instead of summing the absolute difference, here we square the differences (to remove the

negative sign) and then sum them. The sum of the squares is then divided by the number of

observations minus one to give the mean of the squares.

72

Variance:

Where u= 𝑋 = mean

Example

Urinary concentration of lead in 16 rural children (µmol/24 h) as follows: 0.2, 0.3, 0.6, 0.7, 0.8,

1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4.

Table 6:Calculation of variance

Xi 𝑋𝑖 - 𝑋 (𝑋𝑖 - 𝑋

0.2 -1.475 2.176

0.3 -1.375 1.891

0.6 -1.075 1.156

0.7 -0.975 0.951

0.8 -0.875 0.766

1.5 -0.175 0.031

1.7 0.025 0.001

1.8 0.125 0.016

1.9 0.225 0.051

1.9 0.225 0.051

2 0.325 0.106

2 0.325 0.106

2.1 0.425 0.181

2.8 1.125 1.266

3.1 1.425 2.031

3.4 1.725 2.976

Sum 0.0 13.750

The calculation of the variance is illustrated in the table beside. The readings are set out in

column (1). In column (2) the difference between each reading and the mean (𝑥) is recorded. The

differences are squared and summed. The sum of the squares of the differences (or deviations)

from the mean, 13.75, is now divided by the total number of observation minus one, to give the

variance.

S2 = 13.75/15 = 0.917 (µmol/24 h)2.

Why (n-1) as a divider in calculation of variance? The reason for this is that we usually

rely on sample data to estimate the variance of the population. It is shown theoretically

that we obtain a better sample estimate of the population variance if we divide by (n -1).

The units of the variance are the square of the units of the original observations, e.g. if the

variable is weight measured in kg, the units of the variance are kg2

N

X

2

2

73

4. Standard deviation

Standard deviation (s) is the square root of the variance. It brings the measurements back to the

units we started with.

In a sample of ―n‖ observations, it is calculated as: It is evaluated in the same units as the raw

data.

Using the given example, the variance is calculated (see before) and the square root of the

variance provides the standard deviation (SD): s = √𝛴(𝑥𝑖 - )2 / 𝑛 - 1) ) = √0.917 = 0.957

µmol/24h.

5. Coefficient of variation

If we divide the standard deviation by the mean and express this quotient as a percentage, we

obtain the coefficient of variation.

CV (𝑥) = standard deviation (s) / mean (𝑥) %

It is a measure of variability of the observation around its mean. It is independent of the unit of

measurement.

Example: If a group of men 30 – 40 years of age has a mean weight of 80 Kg and s of 20

Kg, while their heights have a mean of 165 cm and s of 30 cm. Can the variation in weight and

height be compared for this group?

Answer: CV weight= 20/80 *100= 25%, CV height = 30/165 *100 = 18%

We can conclude that the variation in weight is more than the variation in height.

74

Fig. 16. Illustrating the different types of descriptive statistics

75

Activity

76

Chapter (9)

Applied Statistics (Normal Distribution Curve)



1. Understand the properties of a normal distribution curve.

2. Know the practical applications of the standard normal model

Content:

1. Definition of normal distribution curve

2. Properties of a normal distribution curve.

3. Distribution of data in normal distribution:

4. Practical Applications of the Standard Normal Model.

77

Normal Distribution Curve

Definition: A normal distribution is an arrangement of a data set in which most values cluster in

the middle of the range and the rest taper off symmetrically toward either extreme. A normal

distribution, sometimes called the bell curve.

For example, the bell curve is seen in tests. The bulk of students will score the average (C), while

smaller numbers of students will score a B or D. An even smaller percentage of students score an

F or an A. This creates a distribution that resembles a bell (hence the nickname). The bell curve is

symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.

Can be used in:

Heights of people.

Measurement errors.

Blood pressure.

Points on a test.

IQ scores.

Salaries.

The empirical rule tells you what percentage of your data falls within a certain number

of standard deviations from the mean:

68% of the data falls within one standard deviation of the mean.

95% of the data falls within two standard deviations of the mean.

99.7% of the data falls within three standard deviations of the mean.

https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/02/standard-normal-distribution.jpg

78

The standard deviation controls the spread of the distribution. A smaller standard deviation

indicates that the data is tightly clustered around the mean; the normal distribution will be taller.

A larger standard deviation indicates that the data is spread out around the mean; the normal

distribution will be flatter and wider.

Properties of a normal distribution

The mean, mode and median are all equal.

The curve is symmetric at the center (i.e. around the mean, μ).

Exactly half of the values are to the left of center and exactly half the values are to the

right.

The total area under the curve is 1.

Distribution of data in normal distribution:

One way of figuring out how data are distributed is to plot them in a graph. If the data is evenly

distributed, you may come up with a bell curve. A bell curve has a small percentage of the points

on both tails and the bigger percentage on the inner part of the curve. In the standard normal

model, about 5 percent of your data would fall into the ―tails‖ (colored darker orange in the image

below) and 90 percent will be in between. For example, for test scores of students, the normal

distribution would show 2.5 percent of students getting very low scores and 2.5 percent

getting very high scores. The rest will be in the middle; not too high or too low. The shape of the

standard normal distribution looks like this:

https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/standard-normal-distribution.jpg

79

Practical Applications of the Standard Normal Model

The standard normal distribution could help you figure out which subject you are getting good

grades in and which subjects you must exert more effort into due to low scoring percentages.

Once you get a score in one subject that is higher than your score in another subject, you might

think that you are better in the subject where you got the higher score. This is not always true.

You can only say that you are better in a particular subject if you get a score with a certain

number of standard deviations above the mean. The standard deviation tells you how tightly your

data is clustered around the mean; It allows you to compare different distributions that have

different types of data — including different means.

For example, if you get score of 90 in math and 95 in English, you might think that you are better

in English than in math. However, in math your score is 2 standard deviation above the mean. In

English it is only one standard deviation above the mean. It tells you that in math your score is far

higher than most of the students (your score falls into the tail), based on this data you actually

performed in Math than in English.

80

- Basic Epidemiology (WHO) http://apps.who.int/iris/bitstream/10665/43/41/1/9241547073eng.pdf Basic epidemiology: Chapter 1: What is epidemiology Chapter 3: Types of studies

- Introduction to Epidemiology|Public Health 101 Series - CDC https://www.cdc.gov/training/publichealth101/epidemiology.html

References

lecture notes on: basics of research methodology and

Documents