lecture notes on: basics of research methodology and
TRANSCRIPT
Prof. Dr. Mohamed Fahmy Amin
Professor of Community Medicine
Department
Dr. Dalia Gaber Sos
Assistant Professor of Community Medicine
Community Medicine Department
Faculty of Medicine
Modern University for Technology and
Information
Cairo – Egypt
Lecture Notes on:
Basics of Research
Methodology and
Biostatistics
For First Year
Medical Students
2021 - 2022
2
List of Content
no. Subject Page
1. Chapter (1): Introduction to Research and Research Process 4
2. Chapter (2): Research Design I - Descriptive Studies 16
3. Chapter (3): Research Design II - Analytical Studies-Cohort and Case
Control Studies
28
4. Chapter (4): Applied Intervention Studies (clinical trial) 40
5. Chapter (5): Protocol Writing 48
6. Chapter (6): Source of Data and Types of Variables 50
7. Chapter (7): Data Presentation 58
8. Chapter (8): Descriptive Statistics 66
9. Chapter (9): Applied Statistics (Normal Distribution Curve) 76
3
4
Chapter (1)
Introduction to Research and Research Process
Intended Learning Outcomes:
By the end of this chapter student should be able to:
1. Define research.
2. Know the motivate in conducting a research.
3. Identify different types of research.
4. Understand the criteria of a good research.
5. Understand the importance of studying statistics and biostatistics.
6. Define how research problems and questions are formulated.
7. Outline the objective of research.
8. Identify the different items of research process.
Content:
I. Introduction to Research
1. Meaning of research.
2. Motivation in research.
3. Types of research.
4. Criteria of a good research
5. Statistics and biostatistics.
II. Research Process
1. Definition of research problem and research questions.
2. Reviewing literature
3. Objective of research.
4. Research hypothesis.
5. Research design.
6. Sampling design.
7. Data collection and analysis.
8. Interpretation and report writing.
5
I. Introduction to Research
1. Meaning of research
Research could be defined as follow:
A scientific and systematic way for collecting information on a specific topic.
An organized and systematic way of finding answers to a specific problem.
Systematized effort to gain new knowledge.
A movement from the known to the unknown.
An attempt to discover something.
2. Motivation in research
What makes people to undertake research?
The possible motives for doing research may be either one or more of the following:
Desire to get a research degree.
Desire to face the challenge in solving the unsolved problems.
Desire to get intellectual joy of doing some creative work.
Desire to serve society.
Desire to get respectability.
Desire of the government to understand some health problem and to find a solution for it.
6
3. Types of research
Fig. 1. Illustrating the different types of research
3.1. Fundamental vs Applied
Fundamental research (basic or pure) is mainly concerned with gathering information and
formulation of a theory.
e:g. Natural phenomenon.
Applied research (or action) aims at finding a solution for an immediate problem facing a
society or an industrial/business organization.
e:g. Treat or cure a specific disease
Thus, the aim of applied research is directed to discover a solution for some problem, whereas
basic research is directed towards finding information that has a broad base of applications and
thus, adds new findings to the already existing scientific knowledge.
7
3.2.Descriptive vs. Analytical
Descriptive research, the major purpose of descriptive research is description of the problem as
it exists. The researcher can only report what has happened or what is happening. He describes a
problem using surveys and fact-finding. The methods of research utilized in descriptive research
are survey methods, including comparative and correlational methods.
e:g. Describes obesity among young children
Analytical research, the researcher has to use facts or information already available and analyze
these to make a critical evaluation of the material.
e:g. Relation between smoking and lung cancer
3.3. Quantitative vs. Qualitative
Quantitative research is based on the measurement of quantity or amount. It is applicable to
phenomena that can be expressed in terms of quantity.
e:g. Measuring the number of school students suffering from anemia and dental cares.
Qualitative research is concerned with qualitative phenomenon, that are difficult or impossible
to quantify i.e. phenomena relating to beliefs, meanings, feelings, and attitudes. Qualitative
research is important in the behavioral sciences where the aim is to discover the underlying
motives of human behavior.
e:g. How people feel or what they think about a particular subject.
3.4.Conceptual vs. Empirical:
Conceptual research is that related to some abstract idea(s) or theory. It is generally used by
philosophers and thinkers to develop new concepts or to reinterpret existing ones. It doesn't
involve any practical experiments.
Empirical research (experimental research) relies on experience or observation alone, often
without regard for system and theory. It is data-based research, coming up with conclusions
which are capable of being verified by observation or experiment. Empirical research is
appropriate when certain variables affect other variables in some way.
e:g. Usage of antihypertensive drugs in decreasing blood pressure
8
4. Criteria of a good research
Good research fulfils the following:
1. The purpose of the research should be clearly defined.
2. The research procedure used should be described in sufficient detail to permit another
researcher to repeat the research for further advancement.
3. The procedural design of the research should be carefully planned to yield results that are
as objective as possible.
4. The analysis of data should be sufficiently adequate to reveal its significance and the
methods of analysis used should be appropriate.
5. Conclusions should be confined to those justified by the data of the research.
5. Statistics/ Biostatistics
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation,
and presentation of data.
Biostatistics is statistical processes and methods applied to the collection, analysis, and
interpretation of biological data and especially data related to human biology, health, and
medicine.
Importance of biostatistics in research
Discover the causes and risks of diseases. Reach conclusions within certain population
groups about different diseases. Determine how diseases develop, progress and spread.
9
II. Research Process
1. Definition of research problem and research questions
Definition of research problem:
A question that the researcher wants to answer or a problem that a researcher wants to solve. A
research problem is an area of concern where there is a gap in the knowledge needed for
professional practices.
A research problem, in general, refers to some difficulty facing a researcher experiences in either
a theoretical or practical situation and wants to obtain a solution for them.
Necessity of defining the problem
A proper definition of research problem will enable the researcher to be on the track whereas an
ill-defined problem may create difficulties. Defining a research problem properly is a
prerequisite for any study and is a step of the highest importance. In fact, formulation of a
problem is often more essential than its solution. It is only on careful detailing the research
problem that we can work out the research design and can smoothly carry on all the
consequential steps involved while doing research.
Research questions
The researcher put some questions when studying the problem in order to help him in completion
of research process. Such as; What data are to be collected? What characteristics of data are
relevant and need to be studied? What relations are to be explored? What techniques are to be
used for the purpose? and similar other questions crop up in the mind of the researcher who can
well plan his strategy and find answers to all such questions only when the research problem has
been well defined.
Criteria of a good research questions
1. ―F‖: Feasible: you and/or the research team have enough budget, time, number of
participants and appropriate expertise to manage the research.
2. ―I‖: Interesting at least to the investigator.
3. ―N‖: Novel: confirms or refutes previous findings, extends previous findings or provides
new findings
4. ―E‖: Ethical: no harm inflicted, and no benefit denied
10
5. ―R‖: Relevant: to scientific knowledge, to clinical and health policy and/or to future
research directions
Identification of a research problem
Identification of research problem could be carried out through reviewing the literature, group of
experts, personal research experience, and patients.
2. Reviewing literature
When planning a research project, it is essential to know what the current state of knowledge is
in your chosen subject as it is obviously a waste of time to spend months producing knowledge
that is already available or not important to research. Therefore, one of the first steps in planning
a research project is to do a literature review: that is, to search through all the available
information sources in order to track down the latest knowledge, and to assess it for relevance,
quality, controversy and gaps.
Sources could be:
Previous researches.
Data in various organizations
Experts‘ opinions.
Journals and newspapers.
Electronic databases.
3. Objective of research
The aim of research objectives is to find out the truth which is hidden, and which has not been
discovered as yet. Although each research study has its own specific purpose, we may think of
research objectives as falling into a number of following broad groupings:
1. To gain familiarity with a phenomenon or to achieve new facts into it.
2. To reveal accurately the characteristics of a particular individual, situation or a group.
3. To determine the frequency with which something occurs or with which it is associated
with something else.
4. To test a hypothesis and predict the relationship between variables.
11
Types of objectives
There are two types of research objectives, general and specific objectives.
Criteria of good objectives
Objectives should be clear, specific, focused, measurable, attainable, relevant, and refers
to time frame of the study.
Action verbs should be used when stating correct objectives.
4. Research hypothesis
Definition
A hypothesis is a proposition that is stated in a testable form and predicts a particular relationship
between two or more variables. In other words, if we think that a relationship exists, we first
state it as hypothesis and then test the hypothesis in the field. A hypothesis is written in such a
way that it can be proven or disproven by valid and reliable data.
e:g. Retinal detachment is more common in those who have a family history of diabetes.
Characteristics: Hypothesis must possess the following characteristics
1. It should be clear and precise.
2. It should be capable of being tested.
3. Validity of hypothesis should be unknown
4. It should state relationship between variables.
5. It should be limited in scope and must be specific.
6. It should be stated in simple terms.
Importance
The role of the hypothesis is to guide the researcher by specifying the area of the research and to
keep him on the right track.
The hypothesis translates the research question into a prediction of expected outcomes. The
researcher starts with a hypothesis and conducts the study to prove or disprove this hypothesis.
12
5. Research design
The research design is an outline of what the researcher will do from writing the hypothesis and
its operational implications to the final analysis of data. Due to several research designs the
researcher must decide in advance of collection and analysis of data which design would prove
to be more appropriate for his research project. Different types of research design will be
discussed in the next chapter.
6. Sampling design
A sample design is a definite plan for obtaining a sample from a given population. It refers to the
technique or the procedure the researcher would adopt in selecting items for the sample. Sample
design may as well lay down the number of items to be included in the sample i.e., the size of the
sample. Sample design is determined before data are collected. There are many sample designs
from which a researcher can choose. Researcher must select/prepare a sample design which
should be reliable and appropriate for his research study. The importance of sampling is that it
decreases the cost, time, and effort of the researcher in the study.
7. Data collection and analysis
Data collection
The task of data collection begins after a research problem and design has been defined.
Methods of data collection
Observation: It the most commonly used method especially in the study related to
behavioral science.
Interview: Personal interview such as face to face interviews, it is costly and need long
time.
Questionnaires: Used by researcher for collection of data, should be formulated in good
manner, and give to accurate data.
Schedules: It is an interview without a questionnaire.
Data analysis
Data analysis is the most important part of any research. Data analysis summarizes collected
data. It involves the interpretation of data gathered through applying statistical and/or logical
techniques to describe and illustrate, condense and recap, and evaluate data.
13
8. Interpretation and report writing
Interpretation refers to the task of drawing conclusions from the collected facts after an
analytical and/or experimental study. In fact, it is a search for broader meaning of research
findings.
Interpretation has two major aspects
Establish continuity in research through linking the results of a given study with those of
another.
Establishment of some explanatory concepts.
Research report is considered a major component of the research study, the research task
remains incomplete till the report has been presented and/or written. As a matter of fact even the
most brilliant hypothesis, highly well designed and conducted research study, and the most
important findings are of little value unless they are effectively communicated to others. All this
explains the significance of writing research report.
Different steps in writing report
1. Logical analysis of the subject-matter.
2. Preparation of the final outline.
3. Preparation of the rough draft.
4. Rewriting and polishing.
5. Preparation of the final bibliography.
6. Writing the final draft.
A report is typically made up of three main divisions:
14
Fig. 2. Summarizing the research process
15
Activity
Activity
16
Chapter (2)
Research Design I - Descriptive Studies
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Define Epidemiology and recognize its major aims
2. Explain the role of descriptive studies in identifying problems and establishing
hypotheses.
3. Explain how the characteristics of person, place, & time are used to formulate hypotheses
in acute disease outbreaks and in studies of chronic diseases.
4. Identify case reports and case series and explain their uses and their limitations.
5. Describe the design features of an ecologic study and discuss their strengths and
weaknesses.
6. Describe the design features of a cross-sectional study and describe their uses, strengths,
and limitations
Content:
1. Importance of clinical epidemiology in research studies.
2. Descriptive epidemiology (Person-Place-Time) studies.
3. Types of descriptive studies.
3.1. Case report
3.2. Case series
3.3. Ecological study
3.4. Cross-sectional study
17
1. Importance of clinical epidemiology in research studies
Epidemiology
Definition: The study of the distribution and determinants of health-related states and events
in specific population and the application of this study to the control of diseases and other
health problems
1. By distribution, we mean who gets the disease, when and where i.e. Person-Place-
Time.
2. By determinants, we mean causes and factors that influence the disease frequency in a
population.
Clinical Epidemiology
Definition: Science dealing with the use of epidemiological data in clinical settings. It is
usually answering the following questions:
1. Is the patient sick or well? In other words, what is normal and what is
abnormal?
2. What is the cause of the disease? Etiological studies.
3. How to diagnose the disease? What tools can be used to differentiate between distinct
phases or stages of the disease? Diagnostic studies.
4. What is the disease and its complications? Prognostic studies.
5. Is there an effective treatment for that disease? Therapeutic studies.
6. Is there a way to prevent the occurrence of disease in healthy individuals? Preventive
studies.
18
Epidemiologic studies
Why Conduct Studies?
To describe burden of disease or prevalence of risk factors, health behaviors, or other
characteristics of a population that influences the risk of disease.
To determine causes or risk factors for illness.
To determine relative effectiveness of interventions.
Fig. 3. Illustrating different types of epidemiological studies
1. Descriptive studies:
Descriptive studies are usually the first step of an epidemiological investigation
conducted to describe certain phenomenon and its relation to certain exposure
i.e. to generate a hypothesis.
Answer what, who, where, and when.
They include case reports, case series, Ecological, and cross-sectional studies that could
be sometimes classified as analytical, due to the possible associations between exposure
and outcome that could be generated through this study.
2. Analytical studies:
2.1. Observational
These studies are used to assess the association between factors of interest and
19
disease in the population i.e. to test a hypothesis
Answer why and how
They include, case-control, cohort and cross-sectional studies.
2.2. Interventional studies (Experimental)
Where the investigator intervenes actively to affect the outcome. The clinical trial is an
example in which the investigator is testing a new drug for treatment of disease like
hypertension or diabetes. It is classified into clinical trials and community studies.
1. Descriptive Epidemiology (Person- Place-Time) studies
Characterized by who, where, or when in relation to what (outcome). Compiling and analyzing
data by time, place, and person is desirable for several reasons.
First, by looking at the data carefully, the epidemiologist becomes very familiar with
the data. He or she can see what the data can or cannot reveal based on the variables
available, its limitations (for example, the number of records with missing information
for each important variable), and its eccentricities (for example, all cases range in age
from 2 months to 6 years, plus one 17-year-old.).
Second, the epidemiologist learns the extent and pattern of the public health problem
being investigated — which months, which neighborhoods, and which groups of
people have the most and least cases.
Third, the epidemiologist creates a detailed description of the health of a population
that can be easily communicated with tables, graphs, and maps.
Fourth, the epidemiologist can identify areas or groups within the population that have
high rates of disease. This information in turn provides important clues to the causes of
the disease, and these clues can be turned into testable hypotheses.
Types of Descriptive Studies
Case report
A case report is a detailed description of the disease occurrence in a single person. Unusual or
newly observed manifestations may suggest a new hypothesis about the causes or mechanism
of disease.
20
Case series
A case series is a report on the characteristics of a group of patients who all have a particular
disease or condition. Common features among the group give more valid hypotheses about
disease causation. Note that the "series" may be small or large (hundreds or thousands of
cases). However, the chief limitation is that there is no comparison group.
Ecological study
This type of study is concerned with data on groups, not individuals. It is possible to measure
associations between exposures and outcomes in groups and hypotheses generated from such
observation are proposed for more elaborate analytical studies.
e:g. Cancer is more prevalent in high income countries than low income countries.
Cross-sectional study
It assesses the prevalence of disease and the prevalence of risk factors at the same point in
time and provide a "snapshot" of diseases and their potential risk factors simultaneously in a
defined population.
Person characteristics
Age:
The most important factor, some diseases occur exclusively in one age group, while others
predominate in another age but can occur in any age. Many chronic diseases showed progressive
increase with age due to aging itself or cumulative exposure to harmful effect.
The causes of morbidity and mortality differ according to stages of life; during childhood,
infectious diseases especially in unvaccinated populations; teenagers are affected by
unintentional injuries, violence and substance abuse; in young adults, unintentional injuries are
Person Characteristics (age, sex, socio-economic status) of the affected individuals
Place
Characteristics (residence, work, hospital) of the affected individuals
Time
Characteristics (Secular, seasonal, point, cyclic)
21
the leading cause while chronic degenerative diseases predominate in the late stages of life.
Sex
In general, morbidities and mortalities from most diseases are higher in males than females.
Certain conditions are more common among males or females due to anatomical and
physiological differences. Variation in sex distribution could be due to:
A- Sex linked inheritance.
B- Hormonal or reproductive factors.
C- Habits, social factors or environmental exposure.
Race and ethnicity
Black Americans are more liable to develop hypertension and its complications compared to
Black African. Closed groups (e.g. prisons, camping) may be susceptible to certain diseases. The
variations in mortality and morbidity could be due to genetically difference, difference in culture,
socioeconomic status, and availability of medical care.
Marital Status
Married people have lower mortality than singles. Death rates from specific diseases and for all
causes co-morbidity vary from lowest to highest; according to marital status: married, single,
widowed and divorced.
Socioeconomic Status (SES)
The term usually describes the person‘s position in society and is often formulated as a
composite measure of three interrelated dimensions: Income, Education and Occupation. SES
affects perception of the disease and the healthcare seeking behavior of the individual.
Place characteristics
Describing the occurrence of disease by place provides insight into the geographic extent of the
problem and its geographic variation. Characterization by place refers not only to place of
residence but to any geographic location relevant to disease occurrence. Such locations include
place of diagnosis or report, birthplace, site of employment, school district, hospital unit, or
recent travel destinations. The unit may be as large as a continent or country or as small as a
street address, hospital wing, or operating room. Sometimes place refers not to a specific location
22
at all but to a place category such as;
a) Morbidities and mortalities occur with different rates in the different countries. Migrant
studies can differentiate between genetic and environmental causes of these differences.
b) National (within country): differences between regions in the same country. Upper
Egypt, for example suffers from lack of medical and health services compared to urban
cities or Lower Egypt. Moreover, there are differences between urban and rural areas in
the same region.
c) Areas within a city or a village may exhibit different pattern of diseases. In a big village,
regions close to a swamp (water collections) can be more affected by Malaria and
mosquito born infections. Slum regions in big cities usually show high prevalence of
nutritional problems and infectious diseases.
John Snow's famous map shows the spread of cholera near the Broad Street water pump in 1854.
He created this map to show the spread of cholera cases around the Broad Street water pump in
London in 1854
23
Time characteristics
Some diseases emerge at a certain period of time while others emerge at another time.
a) When does the disease occur or rarely?
b) Is the frequency of disease at present differing from the corresponding frequency in past?
Time characteristics of a certain disease may range from hours to decades. Short-term changes in
disease incidence are used to study epidemics of infectious or non-infectious diseases.
1. Secular (long-term) pattern. The long-term trend of disease occurrence, usually by years.
2. Seasonal pattern: respiratory infections in winter compared to gastrointestinal infections
in summer.
3. Point (short term) Epidemic and outbreaks
4. Cyclic trend: Occurrence of measles outbreaks every third year –before the obligatory
vaccination in Egypt, and every 7 years in the past two decades.
Fig.4. Histogram shows each case represented by a square stacked into columns.
Cases of Salmonella Enteriditis — Chicago, February 13–21, by Date and Time of Symptom Onset
24
Important facts about cross-sectional study
Cross-sectional studies measure simultaneously the exposure and health outcome in a
given population and in a given geographical area at a certain time.
A cross-sectional study is an observational study.
Often described as a ―snapshot‖ of a population in a certain point in time because
exposure and outcome are determined simultaneously for each subject.
Cross-sectional is also called prevalence study.
The temporal relationship between exposure and disease cannot be determined.
Cross-sectional studies can be helpful in determining how many people are affected by a
condition and whether the frequency of the occurrence varies across groups or population
characteristics.
Cross-sectional studies are mostly carried out for public health planning. For example,
―Knowledge, attitude and practice (KAP) of family planning methods among women
attending antenatal clinic in area ―x‖ is a cross-sectional study.
Cross-sectional Study Design
1. Define the population for study.
2. Determine the presence or absence of exposure and the presence or absence of disease for
each individual enrolled in the study.
25
For example
we survey a population and for each study participant, we determine at the same time the serum
cholesterol (exposure) and evidence of cardiovascular diseases (outcome). Each study participant
will be in one of the following possible subgroups (a, b, c and d): a. Persons who have been
exposed and have the disease. b. Persons who have been exposed but do not have the disease.
c. Persons who have the disease but have not been exposed. d. Persons who have neither been
exposed nor have the disease.
In a cross-sectional study we can calculate the prevalence of disease and the prevalence
of exposure, using the 2 X 2 table.
Prevalence of disease in exposed compared to non-exposed: a/a+b vs c/c+d
Prevalence of exposure in diseased compared to non-diseased: a/a+c vs b/b+d
26
Advantages of cross-sectional study
1. It is simple, inexpensive and done in a short time.
2. The prevalence rate of disease(s) and exposure(s) can be measured.
3. It is the first step to develop evidence for causal association (generate hypotheses).
4. It is often useful at the time of an epidemic as it helps to determine the extent of the
epidemic in the population.
Disadvantages of cross-sectional study
1. It is not appropriate to study rare diseases or events with short duration.
2. It does not provide solid evidence for causal association as the temporal relationship
between exposure and disease cannot be confirmed objectively (Egg or chicken
dilemma).
3. Use of prevalent cases to detect risk factors may result in wrong conclusions as prevalent
cases may differ from incident cases in term of survival factors (will be discussed in
cohort study).
27
Summary on cross-sectional study steps
1. Defining the population
The first step is, therefore, to define ―the population base‖ not only in terms of total
number, but also its composition in terms of age, sex as well as other socio-
demographic characteristics.
2. Defining the disease or characteristic under study
The epidemiologist must define precisely and accurately the condition being
investigated i.e. an operational definition which is a clear description of the disease
or the phenomenon under study in term of measurable variable(s) in the defined
population.
3. Describing the disease or the characteristic and its associates
Person: Age, sex, occupation, education.
Place: Rural vs. urban, Upper vs. Lower Egypt, closeness to a factory or a
water canal.
Time: Year (secular changes over years), season, month, week, day or even
hour of the day.
4. Measurement of disease
In descriptive studies, the disease under study should be ascertained using the proper
diagnostic tools and techniques.
5. Comparing with known indices (Prevalence Rate)
To judge the rate of disease development, one must compare the calculated rates
with previously recorded or estimated ones. We can also identify groups who are at
higher risk of developing the disease.
6. Formulation of a hypothesis or hypotheses
The importance of the descriptive studies is their use in generating hypotheses about
etiology of the health-related conditions. Theses hypotheses should be subjected to
further investigations using more elaborate methods.
28
Chapter (3)
Research Design II - Analytical Studies-Cohort and Case Control
Studies
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Define and explain the distinguishing features of a cohort study
2. Identify the risk factors.
3. Determine different types and measurements derived from cohort study.
4. Define and explain the distinguishing features of a case-control study
5. Describe and identify when case control studies are desirable.
6. Estimate and interpret the measuring of risk in both designs.
7. Identify the potential strengths and limitations of both designs
Content:
1. Cohort study
(Definition-characteristics-design-types-measurement-advantages-disadvantages).
2. Case control study
(Definition-purpose-characteristics-steps-measurement-advantages-disadvantages)
29
Analytical epidemiology
Is concerned with the search for causes and effects, or the why and the how. Epidemiologists use
analytic epidemiology to quantify (measure) the association between exposures and outcomes
and to test hypotheses about causal relationships. It has been said that epidemiology by itself can
never prove that a particular exposure caused a particular outcome. However, epidemiology
provides sufficient evidence to take appropriate control and prevention measures.
Analytic studies test hypotheses about exposure outcome relationships.
Measure the association between exposure and outcome.
Include a control group.
What is risk factor? And does it differ from disease cause?
A risk factor is an attribute, exposure (physical, chemical or biological.) or behavior that
increases the probability of an individual to have a disease. When the risk factor is
unpreventable/modifiable such as age, sex and race, some authors call it a risk attribute. Any risk
factor alone is not sufficient to cause a disease but requires the presence of other risk factors.
Component cause and concept of risk factor
Factor (A): Present in all component, so it called a necessary factor.
Disease occurs due to the combination of more than one risk factor.
30
Characteristics of disease etiology
The etiology of any disease is multi-factorial, i.e., the development of a disease needs
the contribution of more than one risk factor.
Each disease can be caused by a number of sufficient causes.
Each sufficient cause is sufficient to produce the disease.
Each sufficient cause consists of a combination of many Risk Factors (component
causes) that work in different combinations or sequence.
Component causes change over time and in different populations.
A Risk factor that is present in all component causes is a necessary factor.
1. Cohort study
Definition
A well-defined group of individuals who share a common characteristic or experience.
Example: Pregnant diabetics is a cohort, individuals born at specific year is a birth cohort. Other
names: longitudinal study or follow-up study.
Characteristics of cohort study
Participants are classified according to exposure status and followed-up over time to
ascertain outcome.
Can be used to find multiple outcomes from a single exposure.
Appropriate for rare exposures
Ensures temporality (exposure occurs before observed outcome)
Cohort study design
Etiologic studies (cohort) require at least two
groups. One group, the index group, is exposed
to the factor thought to influence occurrence of
the study outcome. The other group, the
referent or control group remains unexposed
to provide a reference for comparison.
31
Types of cohort studies
Prospective
Group participants according to current exposure and follow-up into the future to determine if
outcome occurs.
Retrospective cohort studies
At the time that the study is conducted, potential exposure and outcomes have already occurred
in the past
N.B: “Reconstructive Cohort Study”: (is a combination of both prospective and retrospective
studies) You may assemble a cohort that started at a point of time in the past and continue to
follow the cohort members for a period of time from now to a time-point in the future. e.g: A
cohort of doctors graduated in 1980-1990 is assembled from the medical school records and
followed till 2020 for causes of death.
32
Measurement in cohort study
A- Absolute risk (Incidence rate)
Table 1: Relation between Smoking and Hypertension
Smokers Nonsmokers Total population
Hypertension 80 (20%) 30 (5%) 110 (11%)
Free from hypertension 320 570 890
Total 400 600 1000
Measures of incidence (measure of disease frequency) among exposed and among non-
exposed: (Cumulative Incidence)
1. Incidence of hypertension in smokers =80/400 =20%
2. Incidence of hypertension in non-smokers =30/600 = 5%
3. Incidence of hypertension in the population = 110/1000=11%
B- Risk ratio or relative risk
Measures of association (The relative risk or risk ratio)
1. Relative risk in our example = 20/5 =4 which means that smokers are at a higher risk of
developing hypertension four time the risk of non-smokers.
2. Risk of hypertension among smokers is 4 TIMES the risk among non-smokers.
If the relative risk = 1, the exposure is not associated with disease, in other words,
the exposure is not a risk for the disease.
If the relative risk is >1 then the incidence in exposed exceeds that in unexposed and
the exposure is a risk factor for the disease.
Lastly, if the relative risk is <1, this means that the incidence in exposed is less than
in unexposed and the exposure is rather a ―protective factor‖ than a risk of disease or
in other words, absence of this exposure is a risk factor for the disease.
Relative risk is a measure of the strength of the association between exposure and outcome
and indicates etiological relationship between exposure and outcome, i.e., the higher the
relative risk the stronger the etiological association.
33
Advantages of cohort studies
1. Incidence rate and Relative Risk can be calculated
2. Temporal relationship between exposure and outcome are preserved
3. Several possible outcomes related to a single exposure can be studied simultaneously,
4. No recall bias (see case- control study)
5. Dose-response effect can be studied
6. Suitable for rare exposure
Disadvantages of cohort studies
1. Cohort studies involve a large number of people.
2. It takes a long time to complete the study.
3. Unsuitable for uncommon diseases or diseases with low incidence in the population.
4. Loss of individuals during follow-up may be due to travelling, migration, death or loss of
interest.
5. Expensive in term of cost and effort consumed.
6. Ethical problems.
34
2. Case-control study
Definition
A case-control study is an epidemiological study design in which individuals with an event or
condition/disease of interest, cases, are identified and then compared with individuals without the
event or condition of interest, controls, as regard one or more exposures.
Case-control studies are the most common type of observational analytical studies constituting
about 90% of all epidemiological studies.
Purpose
To study rare diseases
To study multiple exposures that may be related to a single outcome
Study Subjects
Participants selected based on outcome status:
Case-subjects have outcome of interest (cancer).
Control-subjects do not have outcome of interest.
When to conduct a case-control study?
The outcome of interest is rare (cancer) When the disease or outcome has a long
induction and latent period (i.e., a long time between exposure and the eventual causal
manifestation of disease).
Multiple exposures may be associated with a single outcome.
Funding or time is limited.
35
Characteristics of case control study
1. Both exposure (risk factor) and outcome (disease) have occurred before the start of the
study. (Exposures are assessed in a retrospective way and that is why case-control studies
are called ―Retrospective Studies‖ )
2. Being relatively easy and inexpensive, it is commonly the first approach to test causal
relationship hypotheses
Steps to conduct case-control study design
1- Selection of cases (case definition)
It involves diagnostic criteria and eligibility criteria such as the case must be newly diagnosed
within a specific period of time ―incident case‖. The sources of cases may be: Hospital or clinic,
Population-based or community where new cases are reported to health departments, registries,
hospital record departments, etc.
NB: The use of prevalent cases will examine the factors of survival of the disease and not the
risk factors of its etiology.
2- Selection of controls
The controls must be free from the disease under the study.
Control group with condition(s) related to the exposure under study may change the relationship
between that exposure and the disease under study.
If we chose a control group of patients from chest clinic for cases of lung cancer, we may end up
with no association between smoking and lung cancer. A relatively high proportion of the
controls chosen from chest clinics are most likely to be smokers as they are suffering from other
diseases related to smoking e.g. gastritis, peptic erosions or ulcer…….
On the other hand, if we chose athletes control for the same study, an overestimation of the
association between smoking and lung cancer will result, as athletes are most likely to be non-
smokers (not like normal).
Selection of control is one of the difficult tasks in case-control studies and is the source of
introducing many errors (bias).
36
3. Matching
Controls are similar to cases with regard to certain selected variables, e.g. age and sex which are
known to influence the outcome of the disease and which if not adequately matched could distort
the results.
The size of the control group should be at least equal to the size of the case group or more but
use of more than 3 controls for each case will not add to the efficiency of the study.
Measurement of exposure
Information about Exposure (examples: smoking, dietary intake of fat, exposure to
asbestos, hormonal contraceptive intake) should be obtained in precisely the same
manner both for cases and controls.
As the human memory is very selective, recall errors may occur. Women who had a child
with congenital malformation will have a very good recall of all events that occurred
during pregnancy and delivery compared to the approximately complete forgetfulness of
women who had normal babies (recall bias).
Sometimes ascertainment of exposure may be affected by previous knowledge of the data
collector about the disease status of the individual: the interviewer may explore history of
smoking more deeply in cases of bronchogenic carcinoma than healthy controls
(interviewer bias).
It is better to measure exposure with an objective and validated method (biological
marker), but this may not be feasible in many situations.
Measurement in case control study
Cases Controls
Smokers 30 (a) 15 (c)
Non-smokers 10 (b) 45 (d)
Total 40 60
The following could be measured from the case control study
Proportion of exposure among cases, smokers among hypertensive = 75%
Proportion of exposure in the controls, smokers among normotensives =25%.
37
How to measure the strength of the association between smoking and hypertension?
1. Apparently, we cannot use the relative risk as we cannot measure the incidence of
disease among exposed and unexposed.
2. A measure of association can be calculated from the case-control study, called the Odds
ratio (OR). It is the ratio between the Odds of exposure in cases and the Odds of
exposure in controls.
Odds = the ratio between probability of having a characteristic and the probability of not
having that characteristic.
Odds of exposure in cases = probability of exposure/probability of non-exposure
= (30/40) ÷ (10/40) = 30/10
Odds of exposure in controls = probability of exposure/probability of non-exposure
= (15/60) ÷ (45/60) = 15/45
Odds ratio= 30/10 ÷ 15/45 = (30 x 45) / (10 x 15) = 9
Odds ratio is a measure of the strength of the association between the risk factor and
outcome and is an approximation of the relative risk, when prevalence of the disease in
the general population is low and the risk ratio is low.
If the disease prevalence is high, odds ratio will overestimate the relative risk.
From the above example we can conclude that hypertensives have 9 times the risk to be
smokers than the normotensives.
A simple way to calculate the Odds ratio is to arrange the 2 by 2 table so that the upper-
left corner includes the exposed cases and labeled as follows:
Odds ratio = AD/BC, In the previous example = 30*45 / 10*15 =9 This is why it is called
cross-product ratio.
38
Advantages of case control study
1. Relatively easy to carry out
2. Rapid and inexpensive (compared with cohort study)
3. Particularly suitable to investigate rare diseases
4. No ethical problem
5. Allows the study of several etiological factors for a single disease, e.g., smoking, physical
activity in myocardial infarction
6. No attrition problems (Loss of individuals during follow-up) because case control studies
do not require follow-up of individuals into the future.
Disadvantages of case control study
1. Recall bias e.g. relies on memory or past records
2. Selection of an appropriate control group may be difficult.
3. We cannot measure incidence rates and so relative risk cannot be calculated.
4. Odds ratio is an estimate of the relative risk only with diseases of low prevalence.
―Egg or chicken problem‖ Sometimes it is difficult to ascertain which comes first: the etiologic
factor or the disease especially for non-incident cases (e.g. physical activity and obesity; which
one becomes first?).
39
Activity
40
Chapter (4)
Applied Intervention Studies (clinical trial)
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Understand different phases of a clinical trial.
2. Identify the concept of randomization, blinding and the different types of blinding
3. Identify some ethical considerations while conducting clinical trials
4. Calculate measures of treatment effects in clinical trials
Content:
1. Definition of clinical trial.
2. Objectives of clinical trials.
3. Phases of clinical trial.
4. Types of clinical trial
5. Steps of carrying out clinical trial.
6. Ethical issues
41
Clinical Trials
1. Definition
It is one of the interventional studies. It is a prospective study to assess the effect of one or more
intervention (therapeutic) in a group of patients against a control in human beings.
A controlled clinical trial compares the outcomes of a treated group with a comparable group of
patients receiving the control treatment. The intervention being tested is often a drug treatment
but may also be a non-drug treatment such as surgery.
2. Objectives of clinical trial
Discovering new treatments for life threatening diseases.
Discovering new ways to detect, diagnose, and reduce the risk of disease.
Help researchers and physicians to decide if the benefits of the new treatments outweigh
the side effects.
To overcome the problem of drug resistance.
3. Phases of clinical trial
Phase I: Pharmacology and toxicology
First stage of testing in human beings.
Less than 30 healthy volunteers are involved in the clinical trial.
Duration of clinical trial: 6-12 months.
The researcher follows up the safety, tolerability, absorption, distribution, metabolism,
and execration of tested drug in the study group.
Aim of phase I
To determine the maximum tolerated dose (MTD) of the new drug.
Phase II: Initial investigation of treatment effect
It is a therapeutic exploratory Trial
It starts after the completion of phase I and detection of the MTD
Less than 100 patients are involved in the clinical trial.
Duration of clinical trial: 6 months to several years.
42
Aim of phase II
To determine efficacy and safety of tested drug.
To determine optimum dose (Dose efficacy relationship- therapeutic dose regimen-
duration of therapy-frequency of administration-therapeutic window)
Phase III: Clinical evaluation of treatment
It is a therapeutic confirmatory trial.
It starts after the completion of phase I and phase II
From 100‘s to 3000 patients are involved in the clinical trial.
Duration of clinical trial: Takes a long time, up to 5 years.
Aim of phase III
To compare the efficacy of the tested drug against existing therapy in larger number of
patients
To assess overall and relative therapeutic value of the new drug (Efficacy and Safety).
Phase IV: Post Marketing Surveillance (PMS)
Start after the end of clinical trial activities (Phases I-II-III) and the approval of the drug
from the U.S FDA.
No fixed duration / patient population.
Aim of phase II
Detect rare and long-term adverse drug reactions and drug interactions during usage of
patients.
Explore new uses of drugs.
4. Types of clinical trial
Preventive: look for better ways to prevent a disease in people who have never had the disease
or to prevent the disease from returning. Approaches may include medicines, vaccines, or
lifestyle changes.
Screening: test new ways for detecting diseases or health conditions.
43
Diagnostic: study or compare tests or procedures for diagnosing a particular disease or
condition.
Treatment: test new treatments, new combinations of drugs, or new approaches to surgery or
radiation therapy.
Behavioral: evaluate or compare ways to promote behavioral changes designed to improve
health.
Quality of life (or supportive care trials): explore and measure ways to improve the comfort
and quality of life of people with conditions or illnesses.
Types of clinical trials (in relation to comparison groups)
1. One Arm clinical trial: One group of patients will receive the treatment, without
control. We will assess the effect of the treatment by comparing the state of the
participants before and after the new treatment.
2. Two arms clinical trial: This is the classical clinical trial. It is also called controlled
clinical trial. One group will receive the new treatment; meanwhile the other group will
receive the old treatment or the placebo.
Placebo
Placebo is an inert compound randomly allocated to subjects in a clinical trial.
Placebo arm is a true control for an intervention: -Assess relative effect of intervention –
relative risk - Assess risk for adverse events
Placebo arms are not ethical if there is an established standard treatment/management.
5. Steps of carrying out clinical trial
5.1. The protocol
Clinical trials follow a plan known as a protocol. The protocol is carefully designed to define the
benefits and risks to participants and answer specific research questions. A protocol describes the
following:
The goal of the study.
Who is eligible to share in the trial.
Protections against risks to participants.
44
Inform the participants about tests, procedures, and treatments.
How long the trial is expected to last.
What information will be gathered.
5.2. Selection of study groups
Researchers follow clinical trials guidelines when deciding who can participate, in a study.
Factors that allow you to take part in a clinical trial are called "inclusion criteria." Those that
exclude or prevent participation are "exclusion criteria." These criteria are based on factors such
as age, gender, the type and stage of a disease, treatment history, and other medical conditions.
Randomization is a statistical procedure by which the participants are allocated into two similar
groups usually called ―study‖ and ―control‖ groups, to receive or not to receive a new preventive
or therapeutic intervention. It is done to allow comparability between both groups. Thus, any
observed differences in outcome are likely to result from differences in treatment effect.
Randomization is an attempt to eliminate‖ selection bias‖ and allow for fair comparison.
5.3. Blinding
Importance
▪ Blinding is used to prevent conscious or unconscious bias in the design of a clinical trial
and how it is carried out.
▪ It is used to ensure the objectivity of trial results.
Types of blinding
a. Single blinded trial: The trial is planned so that the participant is not aware whether
he/she belongs to the study or control group.
b. Double blinded trial: The trial is planned so that neither the doctor nor the participant is
aware of the group allocation and the treatment received.
c. Triple blinded trial: This goes one step further. The participant, the investigator, and the
person judging the outcome or the person analyzing the data are all not aware‖ blind‖.
NB: The two drugs should be identical in shape, color, taste and the container (if possible).
45
NB: Unblinded trials are only done under certain conditions as surgical procedures where
blinding is impossible or if ethically not permitted
Types of bias
a. Participant bias: Who may subjectively feel better or report improvement if they knew
that they were receiving a new form of treatment.
b. Observer bias: when measuring the outcome of a therapeutic trial the investigator may be
influenced if he knows earlier the particular therapy to which the patient has been subjected.
c. Evaluation bias: when the data analyst subconsciously gives a report of the outcome of
the trial in favor of the new or old drug.
5.4. Assessment
The final step in clinical trial is assessment in terms of positive results as reduction in incidence
rate or severity of the disease or increase in survival time or negative results as adverse events
among treated and control groups.
Relative risk (measure the reduced risk of developing the disease after receiving the
treatment) = Incidence rate in treatment group/Incidence rate in placebo or control group
which.
Number needed to treat (NNT) (It is the number of patients needed to be treated with the
new treatment to have one favorable outcome) = 1/ Absolute Risk Reduction
6. Ethical issues
Stopping rules: If severe and unexpected side effects or complications occur. Or when the
benefit from the intervention becomes evident and undeniable.
Standard care protocol: Should be applied to all participants in both groups
Informed consent: Should be read, agreed upon and signed by each participant.
46
Fig.5. Shows the Clinical trial flow chart
Steps of clinical trial
47
Fig.6. Summary of chapter 4
48
Chapter (5)
Protocol Writing
After identifying and defining the research problem, researcher must arrange his ideas in order
and write them in the form of an experimental plan or what can be described as ‗Research
Protocol‘. This is essential specially for new researcher because of the following:
(a) It helps researcher to organize his ideas in a form possible for him to look for flaws and
inadequacies, if any.
(b) It provides a list of what must be done and which materials have to be collected as
a preliminary step.
(c) It is a document that can be given to others for comment.
Research protocol must contain the following items
1. Research objective should be clearly stated in a line or two which tells exactly what the
researcher expects to do.
2. The problem to be studied by researcher must be clearly stated so that one may know
what information is to be obtained for solving the problem.
3. Each major concept which researcher wants to measure should be defined in operational
terms in context of the research project.
4. The protocol should contain the method to be used in solving the problem. An overall
description of the approach to be adopted is usually given and assumptions, if any, of the
concerning method to be used are clearly mentioned in the research protocol.
5. The protocol must also state the details of the techniques to be adopted. For instance, if
interview method is to be used, an account of the nature of the contemplated interview
procedure should be given. Similarly, if tests are to be given, the conditions under which
they are to be administered should be specified along with the nature of instruments to be
used. If public records are to be consulted as sources of data, the fact should be recorded
in the research protocol. Procedure for quantifying data should also be written out in all
details.
49
6. A clear mention of the population to be studied should be made. If the study happens to
be sample based, the research protocol should state the sampling plan i.e., how the
sample is to be identified. The method of identifying the sample should be such that
generalization from the sample to the original population is feasible.
7. The protocol must also contain the methods to be used in processing the data. Statistical
and other methods to be used must be indicated in the protocol. Such methods should not
be left until the data have been collected. This part of the protocol may be reviewed by
experts in the field, for they can often suggest changes that result in substantial saving of
time and effort.
8. Results of pilot test, if any, should be reported. Time and cost budgets for the research
project should also be prepared and laid down in the protocol itself.
50
Chapter (6)
Source of Data and Types of Variables
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Identify the sources of data.
2. Define variables.
3. Differentiate between a concept and a variable.
4. Identify different types of variables.
5. Recognize the differences between coding, scaling, and scoring.
Content:
I. Sources of data
1. Census.
Definition.
Importance.
2. Registration of births and deaths.
3. Notification.
4. Hospital records.
5. Other health records.
II. Variables
1. Definition of variables.
2. Difference between concept and variable.
3. Types of variables.
4. Coding, scaling, and scoring.
51
I. Sources of data
There are
1. Census.
2. Registration of births and deaths.
3. Notification of diseases.
4. Hospital records.
In and out door patients.
5. Other health records.
Mother and child health centers.
Records of school health services.
Records of occupation health units, hospitals, etc.
1. Census
Definition
Census is defined as instantaneous enumeration or counting of population at specified time,
census is taken in most of the world at a regular interval usually every 10 years.
In Egypt, the last census was carried out in 2018 and the total population was 98.2 million.
Importance
1. Estimate the total number of populations.
2. Provide features of the population regarding, age, sex distribution, occupation,
socioeconomic classes, etc.
3. Provides the necessary denominator for calculating vital statistical such as birth and
death.
4. Important in strategic planning.
2. Registration of birth and deaths
Births
Registration of births is compulsory in most .countries in Egypt, births are to be notified within
10 days of occurrence. Further, before admission of a child to school, production of birth
certificate is mandatory. In development countries birth certificate contain a lot of information
52
useful to the epidemiologist, such as birth weight, congenital malformation, complication during
pregnancy of mother, blood group. The more the recorded information, the greater its usefulness.
Deaths
Deaths are to be notified in Egypt within 24 hours. These deaths are to be medically certified as
the cause of the death. Death certificate is the foundation of modern epidemiology. Death
certificates also tell us about the frequency and distribution of many diseases.
The cause and age of death are most important items in this certificate, they have to correctly
recorded for the national and international comparison.
The internationally agreed form of death certificate known as the ‗international death certificate‘
recommended by the WHO.
3. Notification
Notification was first introduced for the of control infectious disease. It is valuable source of
information regarding the incidence of certain specific diseases in the community. Lists of
notifiable diseases vary from country to country. Usually diseases which are considered to be
serious menaces to public health are included in the list of notifiable diseases, this list can be
found in statistical report of Ministry of Health.
Notification has following limitations
1. It covers only a small part of the total sickness in the community.
2. Many cases (atypical cases, subclinical cases) escape from notification.
3. Not uniform throughout the world.
Despite the above limitation, notification provides valuable information about disease frequency
and distribution. It also provides early warning of epidemics.
4. Hospital records
They are basic and primary source of information about diseases prevalent in the community
The main disadvantage of the record
1. They are highly selective (i.e mild cases may not go to the hospital).
2. Population served by the hospital (population at the risk) cannot be defined. That is
hospital statistics provide only numerator, but not denominator.
53
5. Other health prerecords
A lot of information is also found in the records of mother and child health centers, school health
services, occupational health services, etc…. Certain diseases are recorded in many countries
where they are common (viz. leprosy, cancer, T.B).
II. Variables
1. Definition
A variable can be defined as qualities, properties, characteristics of persons, things, or situations
that change or vary, and that can be measured in a research study. A variable is a property that
takes on different values.
It is also defined as any characteristics, number, or quantity that can be measured or counted. A
variable may also be called a data item.
2. Difference between a concept and a variable
Data related to concepts are subjective, while in variable is objective
Data related to concepts can‘t be measured, where variable can be measured (very
important difference).
Data related to concepts among people isn‘t the same, but in variable it is specific.
Examples of concept data are, effectiveness, satisfaction, sadness, while in variables are age,
height and weight.
N.B. Concepts can be converted to variables, so it can be analyzed.
Example illustrating the change of concept to variable:
1. Concept: Rich/poor.
2. Indicator: Income/value to assess.
3. Change to variable: Total income per year/ total values of cars and homes.
4. Measure: Consider rich when income is more than X per year and poor is less than X per
year.
54
3. Types of variables
Variable can be classified by different ways:
3.1. Qualitative and quantitative (according to measurement scale).
3.2. Dependent, independent, and extraneous (according to causal relationship).
Fig. 7. Illustrates different types of variable
3.1. Qualitative and quantitative
3.1.1. Qualitative or Categorical variables
Definition
It is a non-numerical value. It has values that describe a 'quality' or 'characteristic' of a data unit,
like 'what type' or 'which category'. When asked about the blood group, there are four possible
appropriate mutually exclusive answers A, B, AB and O and the individual will choose the one
that applies to him. Mutually exclusive = cannot occur together.
Categorical variables can be classified into:
A. Ordinal variable: Observations can take a value that can be logically ordered or ranked. The
categories associated with ordinal variables can be ranked higher or lower than another, but do
55
not necessarily establish a numeric difference between each category. Examples of ordinal
categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium,
large, extra-large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).
B. Nominal variable: Observations can take a value that is not able to be organized in a logical
sequence. Examples of nominal categorical variables include sex, business type, eye color, and
religion.
3.1.2. Quantitative or numeric variables
Definition
They have values that describe a measurable quantity as a number, like 'how many' or 'how
much'. Therefore, numeric variables are quantitative variables. Examples include the number of
children per family, number of molar teeth in the mouth, number of beds in a hospital, number of
fingers per hand and RBCs count.
Numeric variables can be classified into:
A. Continuous variables: Numbers with fractions = measurements. Examples include
temperature, systolic blood pressure, age, height, and fasting blood sugar.
B. Discrete variables: Observations can take a value based on a count of the values. A discrete
variable cannot take the value of a fraction between one value and the next closest
value. Examples of discrete variables include the number of registered cars, number of factories
in certain locations, and number of children in a family.
3.2. Independent, dependent, and extraneous
In research terminology, change variables are called independent variables, outcome/effect
variables are called dependent variables, the unmeasured variables are called extraneous
variables.
A. Independent (Cause): the cause/risk factor supposed to be responsible for bringing change(s)
in an outcome.
B. Dependent (Effect/ outcome): The effect brought by the independent variable.
56
C. Extraneous (Confounding): A variable that is associated with both the problem and the
possible cause of the problem. It may either strengthen or weaken the apparent relationship
between an outcome and possible cause.
Example: Age and height relationship. The independent variable is the age, while the dependent
variable is the height.
In a survey to study the relationship between cigarette smokers‘ mothers and the weight of their
newborn. The independent variable is the mother‘s smoking habit, while the dependent variable
is the newborn weight. While other extraneous variables may be number of smoked cigarettes,
diet, age, exercise, etc. All the variables that might affect this relationship either positively or
negatively are extraneous variables.
4. Coding, scaling, and scoring
Coding process
It is important in statistical analysis. Computer statistical programs can deal better with numbers
(quantitative data). While qualitative data a process of coding should be done. Coding = giving
numeric codes to different categories of the variable.
Gender: Male=1 and Female=2.
Scaling
Likert Scale
A Likert Scale is a type of rating scale used to measure attitudes, preferences, and subjective
reactions.
Example: Family planning is a good practice: Strongly disagree = 1, Disagree = 2, Neither
agree nor disagree = 3, Agree = 4, Strongly agree = 5.
Four to seven items are usually used in the scale. Dozens of variations are possible on themes
like agreement, frequency, quality and importance for example:
- Agreement: Strongly agree to strongly disagree.
- Frequency: Often to never.
- Quality: Very good to very bad.
- Importance: Very important to unimportant.
57
A Visual Analogue Scale (VAS)
A measurement instrument that measure a characteristic or attitude and range across a continuum
of values and cannot easily be directly measured. It is often used in epidemiologic and clinical
research to measure the intensity or frequency of various symptoms. For example, the degree of
pain that a patient feels ranges from none to severe pain.
The VAS can be dealt with as a continuous quantitative variable or it can be coded into no, mild,
moderate…etc., i.e., an ordinal qualitative variable.
Scoring
Item responses may be summed to create a score for a group of items.
Example
Patient Satisfaction Questionnaire usually filled after the client received the medical service and
used to evaluate the quality of the health services provided by the institute. The questionnaire
includes a list of questions in the form of Likert scale. The summed score of all the questions
will reflect the level of client satisfaction.
58
Chapter (7)
Data Presentation
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Determine different means for data presentation.
2. Select the suitable data presentation mean as per different variable type.
3. Capability of constructing proper tables and graphs.
Content:
1. Studying proper tables and graphs.
2. Creating intervals in the tables.
3. Measuring of data from total row and/or column.
59
Data Presentation
The Huge data collected during a research must be represented in a suitable format to provide the
needed information to take the suitable decision. These data should be organized in a suitable
form. Three formats of data presentation are available such as tables, graphs and numbers.
1. Tables
It is a suitable method for data presentation.
Aim: to arrange the data in simple, concise and readable form.
Characteristics:
1. The table should be self-explanatory.
2. Can be used in quantitative and qualitative data.
3. Heading and different columns should be clearly defined with units of
measurements.
4. The columns and/or rows should be calculated.
5. The length of table should be suitable.
6. Any explanation on the table should be placed under it as a foot note.
7. Intervals between variables should be as possible of the same width, except
with intervals containing zeros.
Types of tables
A. Frequently distribution table
A frequency distribution table consists of two columns: the first is the class of the
classifying variable (that may be categorical or categories of a continuous variable) and the
second is the number of observations belonging to this category. A third column contains the
percentages.
Table 2: Frequency distribution of social class among women
Social class Frequency (No.) %
Low social class 10 40.0
Middle social class 6 24.0
High social class 9 36.0
Total 25 100.0
It is clear now from table 1 that the highest 40% of these women are belonging to the low social
class
60
B. Contingency table
It is used to explain the relationship between two categorical variables. Table 2 presents
cross-tabulation of two categorical variables. Women are classified according to the
socioeconomic class into three categories and the presence of anemia into two categories.
The result shows 8 women were anemic and belonged to the low social class. They represent
80% of women of the low social class women. Similarly, 33.3% of women of high social
class were not anemic.
Table 3: Frequency distribution of anemia in women according to social classes
Social Class Anemia
no. (%)
No anemia
no. (%)
Total
no. (%)
Low social class 8 (80.0) 2 (20.0) 10 (40.0)
Middle social class 3 (50.0) 3 (50.0) 6 (24.0)
High social class 6 (66.7) 3 (33.3) 9 (36.0)
Total 17 (68.0) 8 (32.0) 25 (100.0)
Cross-tabulation can be done for more than two variables. The relationship between social
class and anemia can be examined in rural and urban regions, the so-called three-way
contingency table. Four-way contingency table is constructed between socioeconomic status,
the presence of anemia, urban-rural residency and Upper-Lower Egypt residency and so on.
Table 3 presents cross-tabulation between two variables: the age is transformed into ordinal
qualitative variable through grouping into three groups and the presence of anemia, a binary
variable.
Table 4: Age distribution of women with or without anemia
Age groups in years Anemia
No. %
No anemia
No. %
Total
No. %
25- 2 11.5 6 75.0 8 32.0
30- 10 59.0 1 12.5 11 44.0
35-49 5 29.5 1 12.5 6 24.0
Total 17 68.0 8 32.0 25 100.0
61
Creating the intervals
Uses:
In changing the quantitative variable into groups.
Intervals should be as possible of the same width, except with intervals containing
zeros. 5-12 intervals should be enough in most cases.
Each interval has an open end, e.g. in table 3. 25 - means women whose age is 25
years to any age below 30, the beginning of the next interval. Alternatively, the
interval may have an open beginning, i.e. -29 = age from the end of the previous
interval up to 29 years.
Percentages
Can be calculated from the row total or the column total, with different meanings. In table 3,
the percentages are taken from the columns‘ total, so we can say that 59% of women with
anemia are aged between 30 and 34. In Table 2, 80% of women from low social class had
anemia; the percentage was taken from the row. However, the percentage in a certain
direction may have no meaning at all according to the design of the study.
2. Graphs
Aim: Graphs are more capable of gaining attention, stressing a certain phenomenon and
giving a quick idea about the general situation.
Characteristics: Graphs should be accurate, simple, clear and well designed.
Types of graphs: According to the type of variable graphs are classified into
A. Categorical data:
1. Bar chart: the frequencies or the relative frequencies of the different groups are
represented by rectangles of the same width and based on the x-axis. The heights of these
rectangles measured on the y-axis are proportional to the relative frequencies of the groups.
62
Types of bar charts:
Simple bar chart (relative frequency of one categorical variable).
Fig.8. Simple bar chart illustrating the causes of renal failure in hemodialysis patients
Composite or compound bar chart (cross tabulated two categorical variable).
Fig. 9. Composite Bar Chart Illustrating the Relation between Education Level and Residence
44%
38%
14%
4%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Hypertension DM Infection Others
Causes of renal failure
60
20
10 8
2
35
25
10
15 15
0
10
20
30
40
50
60
70
Never at school Primary Preparatory Secondary University
Level of education in rural and urban sample of working women
Rural Urban
63
Stacked bar chart (either a single variable or cross tabulation of two or more variables).
Fig. 10. Stacked bar chart illustrating levels of education in urban and rural women
In the stacked bar chart, we have one column for each category of one categorical variable
presenting 100% that is then divided into portions according to the categories of the other
categorical variable.
2. Pie chart: a circle whose area represents the total frequency and subdivided into
segments presenting proportionally the different categories.
Fig 11. Pie graph illustrating prevalence of different eye disesase
60
35
20
25
10
10
8
15
2
15
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rural Urban
Level of education in a rural and urban sample of working
women
Never at school Primary Preparatory Secondary University
64
B. Quantitative data
1. Histogram: Area representation of the relative frequency of a variable using rectangles
adjacent to each other, the width of each rectangle = the width of the corresponding
interval Figure5 . The area of each rectangle represents the relative frequency in that
interval.
2. Frequency polygon: a line connecting the midpoints of the tops of the rectangles of the
histogram Figure 5.
Fig. 12. Shows histogram and frequency polygon for distribution of student
3. Other graphical presentation: such as ogive and scatter diagram (discussed later).
Fig. 13 Shows ogiva chart. Fig. 14. Shows scatter diagram.
65
Fig. 15. Summarize different types of data presentation
Presentation of data
Tables
Frequance distribution
table
Contingency table (corss tabulation)
Graphs
Categorical data
Bar chart
Simple Composite Stacked
Pie chart
Quanitative data
Histogram Frequency polygon
Others
Ogiva Scatter
66
Chapter (8)
Descriptive Statistics
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Identify qualitative and quantitative variables.
2. Calculate and interpret measures of central tendency
3. Calculate and interpret measures of dispersion.
4. Differentiate between measures of central tendency and measures of dispersion.
Content:
1. Study qualitative/categorical variables.
(Count data, proportion, ratio, rate).
2. Study quantitative variables.
2.1. Measurement of central tendency.
(Mid-range, mode, median, arithmetic mean).
2.2. Measurement of dispersion.
(Range, deviation from the mean, variance, standard deviation, coefficient of
variation).
67
1. Qualitative variables
1. Count data
They are points representing occurrences in term of time or space.
Number of males with HIV infection
Number of microscopes in a bacteriological lab,
Number of patients cures
2. Proportion
It is a fraction in which the numerator is included in the denominator
Proportion of males in a class = number of males/(number of males + number of females)=
A/(A+B).
There are 50 anemic females in a sample of 150 of above 40 years‘ female.
What is the proportion of anemia in this sample? = 50/150 = 30%
3. Ratio
It is a fraction in which the numerator is not included in the denominator.
The numerators and denominators of a ratio can be related or unrelated. In other words, you
are free to use a ratio to compare the number of males in a population with the number of
females.
In epidemiology, ratios are used as both descriptive measures and as analytic tools. As a
descriptive measure, ratios can describe the male-to-female ratio of participants in a study.
As an analytic tool, ratios can be calculated for occurrence of illness, injury, or death
between two groups. These ratio measures, including risk ratio (relative risk).
4. Rate
It is the instantaneous change in one quantity per unit change in another quantity usually it
is time or space.
There is no upper limit to its value. Attack rate of flu = 125 cases per week
In epidemiology, rates are particularly useful for comparing disease frequency in different
locations, at different times, or among different groups of persons with potentially different
sized populations; that is, a rate is a measure of risk.
68
For epidemiologists, a rate describes how quickly disease occurs in a population, for
example, 70 new cases of breast cancer per 1,000 women per year. This measure conveys a
sense of the speed with which disease occurs in a population and seems to imply that this
pattern has occurred and will continue to occur for the foreseeable future. This rate is an
incidence rate.
2. Quantitative variables
2.1. Measures of central tendency
Definition: A Measure of central tendency is a single value representing all data. They include
(the Midrange, the Mode, the Median and the Mean)
1. Midrange
It is calculated by adding the smallest and largest observation together then divided by 2.
Interpretation: It represents the average of the two extreme observations.
Advantages: It is easy to calculate.
Disadvantages: Affected by the presence of extreme values.
Example: The following are scores of 11 students obtained in English class
10 8 14 15 7 3 3 8 12 10 9
From the given example: Midrange = (3 + 15) / 2 = 9.
Assuming the last value is 31 instead of 15 Midrange = (3 +31 )/2 = 17 which clearly does
not accurately estimate the central tendency of the data.
2. Mode
The mode is the value that occurs most frequently in a data.
Mode may be bimodal with two modes and other data sets do not have a mode because
each value only occurs once.
The mode is rarely used as a summary measure.
From the previous example the mode is 3.
69
3. Median
The median is the middle value of an arranged/ordered distribution.
It divides the series into two halves; in one half all items are less than median, whereas in
the other half all items have values higher than median, after arranging the values in an
ascending or descending order.
Steps to calculate the median: we need to rank the observations either in an ascending or
descending order. Then we look at the number of observations (n).
1. If n is odd, there is one median whose rank is (n+1)/2 (note: it is not the value, you must
go to this rank, then the value of this rank is the median).
2. If n is even, there are two medians whose ranks are (n/2) and the (n/2 +1) and the
average of the values of these two ranks is taken as the median.
Example
What is the median of the following scores: 10 8 14 15 7 3 3 8 12 10 9
Arrange in ascending or descending order :
15 14 12 10 10 9 8 8 7 3 3
Calculate the order of the median:
middle = (N + 1) / 2 = (11 + 1) / 2 = 6
The median = 9
Advantages
The median is a measure of location.
The median is not affected by extremes values; if the smallest value wrongly written
smaller (e.g. 14 instead of 15) or the largest value wrongly written larger, it would not
change the value of the median.
Disadvantages
It may be difficult to order a large number of observations by hand; however, the
computer software solved this problem.
The median does not use all data set values, so it may not be representative as a summary
measure.
70
4. Arithmetic mean
The arithmetic mean often simply called the mean or average, of a set of values is
calculated by adding up all the values and dividing this sum by the number of values in the
set.
This is expressed by the following symbols: where 𝑥 (pronounced ―x bar‖) signifies the
mean; xi is each values in the data set; n is the number of these values; and Σ, (the Greek
uppercase ‗sigma‘) denotes ―the sum of‖, and the sub and superscripts on the Σ indicate
that we sum the values from i = 1 to i = n.
Example: Using the given example of students‘ scores follows: 10 8 14 15 7 3 3 8 12
10 9 Mean = (10 + 8 + 14 +……+9 )/11 = 9
Advantages
All the values of the data set are included in the calculation of the mean
The mean is the main measure used in inferential statistics.
Disadvantage
It is sensitive to extreme values. For example, replacing 15 by 31 in the above example will
yield a mean of: (10+ 8 + 14 +……+ 31 )/12 = 10.45.
2.2. Measures of scatter/ Dispersion
1. Range
A simple measure is the range, which is the difference between the largest and smallest
observations. As with the midrange, the range is affected by extreme values
Example: Using the previous example, Range = 15 – 3 = 12.
If in the previous example 15 is replaced by 31. Range = 31 – 3 = 28.
2. Deviation from the mean
A measure of scatter calculated by finding the differences between the mean and individual
observations and dividing the sum difference by n where n is the number of observations.
71
Table 5: Calculation of deviation from the mean
Xi 𝑋𝑖 - 𝑋 Abs (𝑋𝑖 - 𝑋 ) 0.2 -1.475 1.475
0.3 -1.375 1.375
0.6 -1.075 1.075
0.7 -0.975 0.975
0.8 -0.875 0.875
1.5 -0.175 0.175
1.7 0.025 0.025
1.8 0.125 0.125
1.9 0.225 0.225
1.9 0.225 0.225
2 0.325 0.325
2 0.325 0.325
2.1 0.425 0.425
2.8 1.125 1.125
3.1 1.425 1.425
3.4 1.725 1.725
Sum 0 11.9
The difference between the mean and individual observations is calculated as follows:
d = 𝛴 (𝑥i - 𝑥) / n.
But if the differences were added up, the positive would exactly balance the negative and their
sum would be zero, so we take the absolute mean deviations |Di| / n.
The absolute value = the value ignoring the sign, so |1.725|=1.725 and |-1.495|=1.4959.
Example:
Urinary concentration of lead in 16 rural children (µmol/24 h) as follows: 0.2, 0.3, 0.6, 0.7, 0.8,
1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4.
Mean (𝑥 = 1.675).
Absolute mean deviations = |Di| / n = 11.9/16 = 0.744.
3. The variance
The differences of each observation from the mean of all the observations.
Instead of summing the absolute difference, here we square the differences (to remove the
negative sign) and then sum them. The sum of the squares is then divided by the number of
observations minus one to give the mean of the squares.
72
Variance:
Where u= 𝑋 = mean
Example
Urinary concentration of lead in 16 rural children (µmol/24 h) as follows: 0.2, 0.3, 0.6, 0.7, 0.8,
1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4.
Table 6:Calculation of variance
Xi 𝑋𝑖 - 𝑋 (𝑋𝑖 - 𝑋
0.2 -1.475 2.176
0.3 -1.375 1.891
0.6 -1.075 1.156
0.7 -0.975 0.951
0.8 -0.875 0.766
1.5 -0.175 0.031
1.7 0.025 0.001
1.8 0.125 0.016
1.9 0.225 0.051
1.9 0.225 0.051
2 0.325 0.106
2 0.325 0.106
2.1 0.425 0.181
2.8 1.125 1.266
3.1 1.425 2.031
3.4 1.725 2.976
Sum 0.0 13.750
The calculation of the variance is illustrated in the table beside. The readings are set out in
column (1). In column (2) the difference between each reading and the mean (𝑥) is recorded. The
differences are squared and summed. The sum of the squares of the differences (or deviations)
from the mean, 13.75, is now divided by the total number of observation minus one, to give the
variance.
S2 = 13.75/15 = 0.917 (µmol/24 h)2.
Why (n-1) as a divider in calculation of variance? The reason for this is that we usually
rely on sample data to estimate the variance of the population. It is shown theoretically
that we obtain a better sample estimate of the population variance if we divide by (n -1).
The units of the variance are the square of the units of the original observations, e.g. if the
variable is weight measured in kg, the units of the variance are kg2
N
X
2
2
73
4. Standard deviation
Standard deviation (s) is the square root of the variance. It brings the measurements back to the
units we started with.
In a sample of ―n‖ observations, it is calculated as: It is evaluated in the same units as the raw
data.
Using the given example, the variance is calculated (see before) and the square root of the
variance provides the standard deviation (SD): s = √𝛴(𝑥𝑖 - )2 / 𝑛 - 1) ) = √0.917 = 0.957
µmol/24h.
5. Coefficient of variation
If we divide the standard deviation by the mean and express this quotient as a percentage, we
obtain the coefficient of variation.
CV (𝑥) = standard deviation (s) / mean (𝑥) %
It is a measure of variability of the observation around its mean. It is independent of the unit of
measurement.
Example: If a group of men 30 – 40 years of age has a mean weight of 80 Kg and s of 20
Kg, while their heights have a mean of 165 cm and s of 30 cm. Can the variation in weight and
height be compared for this group?
Answer: CV weight= 20/80 *100= 25%, CV height = 30/165 *100 = 18%
We can conclude that the variation in weight is more than the variation in height.
74
Fig. 16. Illustrating the different types of descriptive statistics
75
Activity
76
Chapter (9)
Applied Statistics (Normal Distribution Curve)
Intended Learning Outcomes:
By the end of this lecture student should be able to:
1. Understand the properties of a normal distribution curve.
2. Know the practical applications of the standard normal model
Content:
1. Definition of normal distribution curve
2. Properties of a normal distribution curve.
3. Distribution of data in normal distribution:
4. Practical Applications of the Standard Normal Model.
77
Normal Distribution Curve
Definition: A normal distribution is an arrangement of a data set in which most values cluster in
the middle of the range and the rest taper off symmetrically toward either extreme. A normal
distribution, sometimes called the bell curve.
For example, the bell curve is seen in tests. The bulk of students will score the average (C), while
smaller numbers of students will score a B or D. An even smaller percentage of students score an
F or an A. This creates a distribution that resembles a bell (hence the nickname). The bell curve is
symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.
Can be used in:
Heights of people.
Measurement errors.
Blood pressure.
Points on a test.
IQ scores.
Salaries.
The empirical rule tells you what percentage of your data falls within a certain number
of standard deviations from the mean:
68% of the data falls within one standard deviation of the mean.
95% of the data falls within two standard deviations of the mean.
99.7% of the data falls within three standard deviations of the mean.
78
The standard deviation controls the spread of the distribution. A smaller standard deviation
indicates that the data is tightly clustered around the mean; the normal distribution will be taller.
A larger standard deviation indicates that the data is spread out around the mean; the normal
distribution will be flatter and wider.
Properties of a normal distribution
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the
right.
The total area under the curve is 1.
Distribution of data in normal distribution:
One way of figuring out how data are distributed is to plot them in a graph. If the data is evenly
distributed, you may come up with a bell curve. A bell curve has a small percentage of the points
on both tails and the bigger percentage on the inner part of the curve. In the standard normal
model, about 5 percent of your data would fall into the ―tails‖ (colored darker orange in the image
below) and 90 percent will be in between. For example, for test scores of students, the normal
distribution would show 2.5 percent of students getting very low scores and 2.5 percent
getting very high scores. The rest will be in the middle; not too high or too low. The shape of the
standard normal distribution looks like this:
79
Practical Applications of the Standard Normal Model
The standard normal distribution could help you figure out which subject you are getting good
grades in and which subjects you must exert more effort into due to low scoring percentages.
Once you get a score in one subject that is higher than your score in another subject, you might
think that you are better in the subject where you got the higher score. This is not always true.
You can only say that you are better in a particular subject if you get a score with a certain
number of standard deviations above the mean. The standard deviation tells you how tightly your
data is clustered around the mean; It allows you to compare different distributions that have
different types of data — including different means.
For example, if you get score of 90 in math and 95 in English, you might think that you are better
in English than in math. However, in math your score is 2 standard deviation above the mean. In
English it is only one standard deviation above the mean. It tells you that in math your score is far
higher than most of the students (your score falls into the tail), based on this data you actually
performed in Math than in English.
80
- Basic Epidemiology (WHO) http://apps.who.int/iris/bitstream/10665/43/41/1/9241547073eng.pdf Basic epidemiology: Chapter 1: What is epidemiology Chapter 3: Types of studies
- Introduction to Epidemiology|Public Health 101 Series - CDC https://www.cdc.gov/training/publichealth101/epidemiology.html
References