summary of data collection and analysis process
TRANSCRIPT
-
8/8/2019 Summary of Data Collection and Analysis Process
1/31
Data Collection and Analysis:
An Introduction
Presented to:
Prof. Jiri Militky & Prof. Lubos Hes
By
Muhammad Mushtaq Ahmed Mangat
Textile Faculty
Technical University Liberec
Sep 09, 2010
-
8/8/2019 Summary of Data Collection and Analysis Process
2/31
Table of Contents, Tables and Figures
Part One: Statistics Definition and Functions ..................................................................................... 5
Descriptive stat .....................................................................................................................................5
Inferential stat .......................................................................................................................................5
Populations and samples ...................................................................................................................... 5
Types of samples ..................................................................................................................................5
Number systems ................................................................................................................................... 6
Data ...................................................................................................................................................... 6
Variable ............................................................................................................................................... 6
Independent variable ........................................................................................................................... 6
Dependent variable ..............................................................................................................................6
Univariate data .................................................................................................................................... 6
Bivariate Data ......................................................................................................................................6
Multivariate data .................................................................................................................................. 6
Discrete quantitative data .................................................................................................................... 7
Continuous Quantitative data ..............................................................................................................7
Ordinal .................................................................................................................................................7
Nominal ................................................................................................................................................ 7
Time Series Data ................................................................................................................................. 7
Cross-sectional data .............................................................................................................................7
Primary ................................................................................................................................................7
Secondary ............................................................................................................................................7
Data Arranging and Presentation .........................................................................................................7
List of data ...........................................................................................................................................7
Data Frequency ..................................................................................................................................... 7
Part Two ...............................................................................................................................................8
Frequency Table ...................................................................................................................................8
Pie Chart ...............................................................................................................................................8
-
8/8/2019 Summary of Data Collection and Analysis Process
3/31
Bar Chart .............................................................................................................................................. 8
Area Charts ..........................................................................................................................................9
Line Charts ........................................................................................................................................ 10
............................................................................................................................................................10
............................................................................................................................................................10
............................................................................................................................................................11
Dot Plot ...............................................................................................................................................11
Histogram ........................................................................................................................................... 11
Histogram with normal curve ............................................................................................................. 12
Radar Charts ......................................................................................................................................12
Map chart ............................................................................................................................................12
Stem and Leaf Plot ............................................................................................................................. 13
Box and Whisker Plot (or Boxplot) ...................................................................................................13
Polygon charts ................................................................................................................................... 13
Range .................................................................................................................................................. 14
Arithmetic mean ................................................................................................................................. 14
Geometric mean ................................................................................................................................. 14
Trimmed Mean ...................................................................................................................................14
Median ................................................................................................................................................ 15
Mode ...................................................................................................................................................15
Percentiles .......................................................................................................................................... 15
Extremes and quartiles ...................................................................................................................... 15
Variance ..............................................................................................................................................15
............................................................................................................................................................15
Standard deviation () ........................................................................................................................ 15
-
8/8/2019 Summary of Data Collection and Analysis Process
4/31
Variance sum law .............................................................................................................................. 15
Percentile summary ............................................................................................................................ 15
Normal Distribution ........................................................................................................................... 16
Skewed Distribution ...........................................................................................................................16
Kurtosis ............................................................................................................................................. 17
Sampling distribution ........................................................................................................................ 17
Standard error (standard deviation of sampling) ................................................................................17
Normal Distribution and central limit theorem .................................................................................. 17
Characteristics of Normal distributions ..............................................................................................18
Binomial distribution .........................................................................................................................19
Bivariate and Multivariate analysis .................................................................................................... 19
Correlation .......................................................................................................................................... 19
Hypotheses Testing ............................................................................................................................ 19
Null hypotheses .................................................................................................................................. 19
The research Hypotheses (alternative Hypotheses) ...........................................................................19
Two Tail and One Tail Test .............................................................................................................. 20
Hypotheses Testing Methods ............................................................................................................ 20
Z Score Test ........................................................................................................................................20
Types of Z test ....................................................................................................................................20
Z value calculation ............................................................................................................................. 21
t Test (William Sealy Gosset, 1908) .................................................................................................. 21
One sample t test: ............................................................................................................................... 21
P value ............................................................................................................................................... 21
Correlation ......................................................................................................................................... 22
Regression Analysis ........................................................................................................................... 22
............................................................................................................................................................26
Example of regression analysis ......................................................................................................... 26
Explanation of model: ........................................................................................................................ 28
-
8/8/2019 Summary of Data Collection and Analysis Process
5/31
Multinomial logistic regression .......................................................................................................... 28
Results of Multinomial Logistic Regression ...................................................................................... 29
Chi-Square Test ..................................................................................................................................29
Crosstabs ............................................................................................................................................ 30
Part One: Statistics Definition and Functions
Statistics is an art and science of collecting and understanding data. Main functions:
1. Gathering
2. Arranging
3. Analyzing
4. Exploring the data
5. Estimate the unknown quantity
6. Presenting results
7. Interpreting results
8. Making available for decisions
9. Designing plan for data collection
10.Hypotheses testing
Descriptive stat
Descriptive statistics are used to describe the main features of a collection of data in quantitative
terms (en.wikipedia.org/wiki/Descriptive_statistics)
Inferential stat
A statistical inference is a conclusion made on the basis of data which is subject to random variation
of some kind, possibly observation errors or sampling variation
(en.wikipedia.org/wiki/Inferential_statistics)
Populations and samples
The populationfrom which the sample is drawn and sample --- that is, a small subset of a larger set
Types of samples
Random sample, Stratified sample, Quota sample, Purposive sample, Convenience sample
http://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Descriptive_statistics&sa=X&ei=lzKKTLvWNo-Sswa1uPzMAQ&ved=0CAUQpAMoAA&usg=AFQjCNFTpIz7WAE6eXmDAc8c2cJ0dJFzNQhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Inferential_statistics&sa=X&ei=ZjKKTOi7Ds-SswbE_6SuAg&ved=0CAcQpAMoAA&usg=AFQjCNGUZdarWwFAJaX1oHRofWfO3LL8gAhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Descriptive_statistics&sa=X&ei=lzKKTLvWNo-Sswa1uPzMAQ&ved=0CAUQpAMoAA&usg=AFQjCNFTpIz7WAE6eXmDAc8c2cJ0dJFzNQhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Inferential_statistics&sa=X&ei=ZjKKTOi7Ds-SswbE_6SuAg&ved=0CAcQpAMoAA&usg=AFQjCNGUZdarWwFAJaX1oHRofWfO3LL8gA -
8/8/2019 Summary of Data Collection and Analysis Process
6/31
Number systems
Natural : 0, 1, 2, 3, 4, 5, 6, 7, ..., n
Integers: n, ..., 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, ..., n
Positive integers: 1, 2, 3, 4, 5, ..., n
Rational: a/b where a and b are integers and b is not zero (3/4)
Real: The limit of a convergent sequence of rational numbers (-1.23, 1.234)Complex: a + bi where a and b are real numbers and i is the square root of 1
Prime numbers: anatural numberthat has exactly two distinct natural numberdivisors: 1 and itself
(1,3,5,7,11)
Irrational number:The irrational numbers are in fact precisely those infinite decimals which are not
repeating (7/22 Pai)
Data
Data refers to any kind of recorded information
Variable
A piece of information recorded for every item is called a variable
Independent variable
A variable, which can be exploited during experiment
Dependent variable
A variable affected by the exploitation of independent variable
Univariate data
It is a data set which one piece of information has recorded for each item.
Bivariate Data
Such data sets have exactly two pieces of information recorded for each item
Multivariate data
Such data sets have three or more pieces of information recorded for each item
http://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/1_(number)http://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/1_(number) -
8/8/2019 Summary of Data Collection and Analysis Process
7/31
Discrete quantitative data
A discrete variable can assume values only from a list of specific numbers e.g. number of people,
number of class rooms.
Continuous Quantitative data
It could be any number (value) e.g. weight of students, weather temperature
Ordinal
In this there is a meaningful order e.g. 1 to 5 where 1 is the dull and 5 is full bright
Nominal
Where there is no meaningful order e.g. name of different departments
Time Series Data
Data recorded in a meaningful sequence e.g. daily report of stock exchange, weekly temperature of
a patient
Cross-sectional data
Data collected at point of time e.g. grades of students in first term
Primary
Data collected for a specific purpose
Secondary
Previously collected data for another use
Data Arranging and Presentation
List of data
It is the simplest kind of data. It represents some kind of information.
Data Frequency
Frequency of data shows how often the various values occur in the data set. Normally presented in
shape of histogram
(source:http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtab and results of
Google image research)
http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtabhttp://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtabhttp://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtab -
8/8/2019 Summary of Data Collection and Analysis Process
8/31
Part Two
Part Two
Central Tendency and Data Spread
Frequency Table
Score Frequency Frequency (%)
0 4 13%
1 3 10%2 5 17%
3 5 17%
4 6 20%
5 7 23%
Pie Chart
Bar Chart
-
8/8/2019 Summary of Data Collection and Analysis Process
9/31
Area Charts
-
8/8/2019 Summary of Data Collection and Analysis Process
10/31
Line Charts
-
8/8/2019 Summary of Data Collection and Analysis Process
11/31
Dot Plot
Useful to identify any outliers, line of values also useful for this purpose.
Histogram
-
8/8/2019 Summary of Data Collection and Analysis Process
12/31
Histogram with normal curve
Radar Charts
Map chart
-
8/8/2019 Summary of Data Collection and Analysis Process
13/31
Stem and Leaf Plot
Box and Whisker Plot (or Boxplot)
Polygon charts
-
8/8/2019 Summary of Data Collection and Analysis Process
14/31
Variability means the extent to which data values differ from each other.
Diversity, dispersion, spread and uncertainty have the same meanings
Population - parameter Sample - statistic
size: N n
Mean mu x x bar
median n/a Mor ~x x tilde
proportion pi p
(p in text) ( p in text)
spread:
variance 2 sigma squared s2 s squared = (x -
x )2/(n - 1)
standard deviation sigma s
zscore = (x - mean)/sd Z z
correlation coefficient rho r
Slope 1 beta 1 b1
intercept 0 beta naught b0
In this report for simplicity we will use only signs of population
Range
Highest values-smallest value
Arithmetic mean
Geometric mean
Trimmed Mean
In this case some extreme values are removed for unbiased mean
-
8/8/2019 Summary of Data Collection and Analysis Process
15/31
Median
Halfway point of data set (n+1)/2 in case of odd number, in case of even number mean of two
middle values
Mode
The most common category
Percentiles
Percentiles are summary measures expressing ranks as percentage 0% to 100% rather than 1 to n.
These are used:
To indicate the data value at a given percentage
To indicate the percentage ranking of a given data value
Extremes and quartiles
Extremes the smallest and largest
Quartiles defines 25% and 75%
Variance
For population
For samples
Standard deviation ()
It is square root of variance and tells average distance from the mean value
Variance sum law
Percentile summary
Value attained by a given percentage after they have been ordered from smallest to largest.
-
8/8/2019 Summary of Data Collection and Analysis Process
16/31
Standard Deviation
It is an indication how different the numbers are from one another.
Normal Distribution
It is an idealized, smooth, bell-shaped histogram with all of the randomness removed.
It represents an ideal set that has lots of numbers concentrated in the middle.
It is common for statistical procedures to assume that the data set is reasonably approximated by a
normal distribution. Example with 5 and standard deviation:
Skewed Distribution
It is neither symmetric nor normal, because data values trail off more sharply on one side the on the
other. Pearson suggest following equation to measure skewness1:
Now more commonly used equation:
Negative Positive
1 Online Statistics: An Interactive Multimedia Course of Study
-
8/8/2019 Summary of Data Collection and Analysis Process
17/31
Kurtosis
Sampling distribution
It is a distribution of the statistic forall possible samples of a given size from a population. It is
highly dependent on the distribution of population.
Standard error (standard deviation of sampling)
Mean of sampling distribution is equal to the mean of population.
M =
Variance of sampling distribution is as under:
Standard distribution of sampling is referred as standard error of the quantity.
Normal Distribution and central limit theorem
Repeated means from a population which may not be normally distributed will be normally
distributed. Large sample size will have higher normal distribution2.
2 Online Statistics: An Interactive Multimedia Course of Study
-
8/8/2019 Summary of Data Collection and Analysis Process
18/31
Following figures are different mean and SD.
Characteristics of Normal distributions
1. Symmetric around their mean.
2. Mean, median and mode at same point
3. Area under normal curve is 1.00
4. Dense in center and thin at tails
5. Mean and SD are used for it
6. 68.27% data is within one SD
7. 95.45% data is within 2 SD
8. 99.73 % data is within 3 SD
9. 1.96 Z has 95% area
10. 1.68 Z has 90% area
-
8/8/2019 Summary of Data Collection and Analysis Process
19/31
Binomial distribution
There is only one outcome of each trial and each trial is mutually exclusive for example of head and
tail of coin.
Bivariate and Multivariate analysis
Bivariate analysis deals with the association or relationship between two set of data of two different
variables, whereas, multivariate deals with data of more than two sets of variable to have joint
effect. It is used to test hypothesis and identify the strength of correlation between or simply
dependency one variable on the other.
Correlation
1. A causal, complementary, parallel, or reciprocal relationship, especially a structural,
functional, or qualitative correspondence between two comparable entities: a correlation
between drug abuse and crime.
2. Statistics. The simultaneous change in value of two numerically valued random
variables: the positive correlation between cigarette smoking and the incidence of lung
cancer; the negative correlation between age and normal vision.
3. An act of correlating or the condition of being correlated3.
Hypotheses Testing
Statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative
(for or against the hypothesis) which minimizes certain risks
Null hypotheses
It is denoted by Ho and represents the default possibility about the population that you will accept
unless you have convincing evidence to the contrary.
The research Hypotheses (alternative Hypotheses)
It is denoted by Ha and will be accepted if there is a convincing evidence that would rule out the
null hypotheses as a reasonable possibilityExample:
Ho: a = 0
3http://www.answers.com/topic/correlation
http://www.answers.com/topic/correlationhttp://www.answers.com/topic/correlation -
8/8/2019 Summary of Data Collection and Analysis Process
20/31
Ha: a 0
Two Tail and One Tail Test
One tail test: population mean is greater/lesser that the sample mean,
Two Tail Test
In this case researcher claims that the sample mean may be different than the population mean
(greater or lesser).
Hypotheses Testing Methods
Z test
t Test
p value
Z Score Test
Considering central limit theorem lots of statistic analysis are possible since distribution is normal.
Z-tests are better if the sample size is not too small. It tells distance in standard deviation form from
the mean of a data set.
Z-test is a statistical test where normal distribution is applied and is basically used for dealing with
problems relating to large samples when n 30 (http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5)
Types of Z test
1. Z test for single proportion to test hypothesis on a specific value of proportion, Ho: P=Po.
2. For two different groups of data, drinking habits of male and female
3. Test the specific value on a population. It is used when sample size >30 and standard
deviation is known.
http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5 -
8/8/2019 Summary of Data Collection and Analysis Process
21/31
4. Test of variance on a specific value of population variance.
5. Test of equality of two sets of variable when sample size >304.
Z value calculation
Formula of Z value:
Z value will be used to find the corresponding P value in table and will be compared with critical Z
value and if the P value is less than alpha, we reject the null hypothesis.
t Test (William Sealy Gosset, 1908)
For:
1. Single sample t test
2. Two independent samples t test
3. Compared groups t test (before treatment and after treatment)
4. For checking of regression line, is it equal to zero or not.
Useful for small samples, less than 30.
Assumption:
Data should by having normality which can be checked by using histogram and equality of variance
by using levenes test.
One sample t test:
P value
P values indicates the probability if the test statistics are properly distributed under normal curve as
it was assumed in null hypothesis. The smaller p value supports to not accept the null hypothesis.
More common is 0.05(95%) significance; however 0.1 and .01 are also used.
4Choudhury, Amit (2009). Z-Test. Retrieved [Date of Retrieval] from Experiment Resources: http://www.experiment-resources.com/z-test.html
Read more:http://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8C
http://en.wikipedia.org/wiki/William_Sealy_Gossethttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://en.wikipedia.org/wiki/William_Sealy_Gosset -
8/8/2019 Summary of Data Collection and Analysis Process
22/31
Correlation
1. For parametric statistic (Pearson's product-moment correlation)
2. For nonparametric statistic (Spearman's rank correlation).5
Following equation is used to measure coefficient of correlation
6
:
Another equation to calculate correlation coefficient:
Regression AnalysisRegression analysis is a process to find the best fit line to explain the relationship between the independent and
dependent variable. It is written as:
Simple regression:
Y=b0+ b1X+
Multiple regression:
Y=b0+ b1X1++b2X2+b3X3+.bnXn+
5http://www.answers.com/topic/correlation-coefficient6 Online Statistics: An Interactive Multimedia Course of Study
http://www.answers.com/topic/correlation-coefficienthttp://www.answers.com/topic/correlation-coefficient -
8/8/2019 Summary of Data Collection and Analysis Process
23/31
Where:
b0= interception on Y axis
Y= value of dependent variable
b1b3=coefficient of independent value
X=independent variable
=noise or effect of unknown variable (it may be ignored)
Assumption for regression analysis:
1. The sample true representative f population
2. Linearity in the data
3. Existence ofhomoscedasticity
Equation for regression
Slope line:
http://en.wikipedia.org/wiki/Homoscedasticityhttp://en.wikipedia.org/wiki/Homoscedasticityhttp://en.wikipedia.org/wiki/Homoscedasticity -
8/8/2019 Summary of Data Collection and Analysis Process
24/31
For intercept
Example taken from http://faculty.uncfsu.edu/dwallace/lesson%2018.pdf
-
8/8/2019 Summary of Data Collection and Analysis Process
25/31
-
8/8/2019 Summary of Data Collection and Analysis Process
26/31
Example of regression analysis
Model Summary
Mode
l R R Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1 .733a .537 .512 12.92502 .537 21.659 3 56 .000
a. Predictors: (Constant), Thermal Conductivity at Dry State Wm^1K^-1, Sample Thickness at Dry State (mm),
Thermal Resistance at Dry StateK.m2W^-1)
-
8/8/2019 Summary of Data Collection and Analysis Process
27/31
ANOVAb
Model
Sum of
Squares df
Mean
Square F Sig.
1 Regressio
n10854.967 3 3618.322 21.659 .000a
Residual 9355.145 56 167.056
Total 20210.112 59
a. Predictors: (Constant), Thermal Conductivity at Dry State
Wm^1K^-1, Sample Thickness at Dry State (mm), Thermal
Resistance at Dry StateK.m2W^-1)
b. Dependent Variable: Thermal Absorbtivity at Dry
StateW.m^-2.s1/2. K-1)
-
8/8/2019 Summary of Data Collection and Analysis Process
28/31
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
Collinearity
Statistics
B Std. Error Beta Tolerance VIF
1 (Constant) -80.481 173.186 -.465 .644
Thermal Resistance at
Dry StateK.m2W^-1)12696.547 10202.604 1.786 1.244 .219 .004 249.080
Sample Thickness at
Dry State (mm)-334.270 199.663 -1.937 -1.674 .100 .006 161.903
Thermal Conductivity
at Dry State
Wm^1K^-1
6192.067 3260.818 .774 1.899 .063 .050 20.091
a. Dependent Variable: Thermal Absorbtivity
Explanation of model:
Adjusted R square=.512(51.2%) means that in dependent variable 51.2% changes are due to these
independent variables. Significant F change shows that model is significant. Standardized
coefficient are coefficients of independent variables. Their significance values describe the
significance of these variables in the regression equation. Less than 0.05 tells that variable is
significant.
Multinomial logistic regression7
It is used for:
1. Analyze relationship between non-metric dependent and metric dichotomous independent
variable
2. It compares the multiple group through a combination of a binary logistic regression
It used to predict:
1. Coefficients for each of the two comparison
2. Three equations one for each group defined by the dependent variable
7 Source: www.utexas.edu/.../MultinomialLogisticRegression_BasicRelationships.ppt SW388R7Data Analysis & Computers II
-
8/8/2019 Summary of Data Collection and Analysis Process
29/31
3. A comparison is possible between group membership and actual group to find measure of
classification accuracy
Requirements of Multinomial logistic regression analysis:
1. Dependent variable should be non-metric the independent variables should metric or
dichotomous
2. Dichotomous, nominal, and ordinal variables can satisfy the requirements
Results of Multinomial Logistic Regression
1. Overall relationship between independent variables and grouped defined by the
dependent variables
2. Difference follows a chi-square distribution and used for significance testing
Examples:
1. Influence of father professional and education on occupancy preference2. Effect of food and exercise on a certain disease3. Selection of brands based on gender and age Chi Square Test
Chi-Square Test
Chi- square test is used to find association between two sets of variable written in the form of a
matrix, two way table8:
Where:
X2= Chi-square value
O= observed frequency
E= expected frequency
Example:
Short Tall Total
8 Source:http://science.jrank.org/pages/1401/Chi-Square-Test.html
http://science.jrank.org/pages/1401/Chi-Square-Test.htmlhttp://science.jrank.org/pages/1401/Chi-Square-Test.htmlhttp://science.jrank.org/pages/1401/Chi-Square-Test.html -
8/8/2019 Summary of Data Collection and Analysis Process
30/31
Male 24 20 44
Female 36 5 41
Total 60 25 85
Expected value are calculated by using probability rules:
Probability that a person is short: 60/85=0.706
Probability that a person is male: 44/85=0.518
A person is male and short: 0.706*0.518=.366
Expected frequency of such person who are male and short: 0.366*85= 31.1
(we can calculate all other values by using this method)
Observed
values
Expected
values
(O-E)2/E
24 31.1 1.62
36 12.9 3.91
20 28.9 1.74
5 12.1 4.17
85 85 X2=11.4
Degree of freedom= (row-1)(column -1)= (2-1)(2-1)=1
values from Chi sq distribution:
For p=0.05, value ofX2= 3.84, whereas our value is 11.4, which is quite high.
It shows that we have to accept the null hypothesis that there is an association between
male and female and their height.
Crosstabs
It is a non parametric test and used to measure the association between two categories by
controlling other categories.
-
8/8/2019 Summary of Data Collection and Analysis Process
31/31
Example:
People having high salaries are more likely to go on vocation as compared to people having low
salaries.
Most commonly Pearson chi-square, likelihood-ratio chi-square are used for test of significance.
Results from SPSS:
Interpretation of the results: difference is by chance and there is no difference in services offerd by
different stores.