summary of data collection and analysis process

8/8/2019 Summary of Data Collection and Analysis Process

1/31

Data Collection and Analysis:

An Introduction

Presented to:

Prof. Jiri Militky & Prof. Lubos Hes

By

Muhammad Mushtaq Ahmed Mangat

Textile Faculty

Technical University Liberec

Sep 09, 2010


2/31

Table of Contents, Tables and Figures

Part One: Statistics Definition and Functions ..................................................................................... 5

Descriptive stat .....................................................................................................................................5

Inferential stat .......................................................................................................................................5

Populations and samples ...................................................................................................................... 5

Types of samples ..................................................................................................................................5

Number systems ................................................................................................................................... 6

Data ...................................................................................................................................................... 6

Variable ............................................................................................................................................... 6

Independent variable ........................................................................................................................... 6

Dependent variable ..............................................................................................................................6

Univariate data .................................................................................................................................... 6

Bivariate Data ......................................................................................................................................6

Multivariate data .................................................................................................................................. 6

Discrete quantitative data .................................................................................................................... 7

Continuous Quantitative data ..............................................................................................................7

Ordinal .................................................................................................................................................7

Nominal ................................................................................................................................................ 7

Time Series Data ................................................................................................................................. 7

Cross-sectional data .............................................................................................................................7

Primary ................................................................................................................................................7

Secondary ............................................................................................................................................7

Data Arranging and Presentation .........................................................................................................7

List of data ...........................................................................................................................................7

Data Frequency ..................................................................................................................................... 7

Part Two ...............................................................................................................................................8

Frequency Table ...................................................................................................................................8

Pie Chart ...............................................................................................................................................8


3/31

Bar Chart .............................................................................................................................................. 8

Area Charts ..........................................................................................................................................9

Line Charts ........................................................................................................................................ 10

............................................................................................................................................................10

............................................................................................................................................................10

............................................................................................................................................................11

Dot Plot ...............................................................................................................................................11

Histogram ........................................................................................................................................... 11

Histogram with normal curve ............................................................................................................. 12

Radar Charts ......................................................................................................................................12

Map chart ............................................................................................................................................12

Stem and Leaf Plot ............................................................................................................................. 13

Box and Whisker Plot (or Boxplot) ...................................................................................................13

Polygon charts ................................................................................................................................... 13

Range .................................................................................................................................................. 14

Arithmetic mean ................................................................................................................................. 14

Geometric mean ................................................................................................................................. 14

Trimmed Mean ...................................................................................................................................14

Median ................................................................................................................................................ 15

Mode ...................................................................................................................................................15

Percentiles .......................................................................................................................................... 15

Extremes and quartiles ...................................................................................................................... 15

Variance ..............................................................................................................................................15

............................................................................................................................................................15

Standard deviation () ........................................................................................................................ 15


4/31

Variance sum law .............................................................................................................................. 15

Percentile summary ............................................................................................................................ 15

Normal Distribution ........................................................................................................................... 16

Skewed Distribution ...........................................................................................................................16

Kurtosis ............................................................................................................................................. 17

Sampling distribution ........................................................................................................................ 17

Standard error (standard deviation of sampling) ................................................................................17

Normal Distribution and central limit theorem .................................................................................. 17

Characteristics of Normal distributions ..............................................................................................18

Binomial distribution .........................................................................................................................19

Bivariate and Multivariate analysis .................................................................................................... 19

Correlation .......................................................................................................................................... 19

Hypotheses Testing ............................................................................................................................ 19

Null hypotheses .................................................................................................................................. 19

The research Hypotheses (alternative Hypotheses) ...........................................................................19

Two Tail and One Tail Test .............................................................................................................. 20

Hypotheses Testing Methods ............................................................................................................ 20

Z Score Test ........................................................................................................................................20

Types of Z test ....................................................................................................................................20

Z value calculation ............................................................................................................................. 21

t Test (William Sealy Gosset, 1908) .................................................................................................. 21

One sample t test: ............................................................................................................................... 21

P value ............................................................................................................................................... 21

Correlation ......................................................................................................................................... 22

Regression Analysis ........................................................................................................................... 22

............................................................................................................................................................26

Example of regression analysis ......................................................................................................... 26

Explanation of model: ........................................................................................................................ 28


5/31

Multinomial logistic regression .......................................................................................................... 28

Results of Multinomial Logistic Regression ...................................................................................... 29

Chi-Square Test ..................................................................................................................................29

Crosstabs ............................................................................................................................................ 30

Part One: Statistics Definition and Functions

Statistics is an art and science of collecting and understanding data. Main functions:

1. Gathering

2. Arranging

3. Analyzing

4. Exploring the data

5. Estimate the unknown quantity

6. Presenting results

7. Interpreting results

8. Making available for decisions

9. Designing plan for data collection

10.Hypotheses testing

Descriptive stat

Descriptive statistics are used to describe the main features of a collection of data in quantitative

terms (en.wikipedia.org/wiki/Descriptive_statistics)

Inferential stat

A statistical inference is a conclusion made on the basis of data which is subject to random variation

of some kind, possibly observation errors or sampling variation

(en.wikipedia.org/wiki/Inferential_statistics)

Populations and samples

The populationfrom which the sample is drawn and sample --- that is, a small subset of a larger set

Types of samples

Random sample, Stratified sample, Quota sample, Purposive sample, Convenience sample
http://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Descriptive_statistics&sa=X&ei=lzKKTLvWNo-Sswa1uPzMAQ&ved=0CAUQpAMoAA&usg=AFQjCNFTpIz7WAE6eXmDAc8c2cJ0dJFzNQhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Inferential_statistics&sa=X&ei=ZjKKTOi7Ds-SswbE_6SuAg&ved=0CAcQpAMoAA&usg=AFQjCNGUZdarWwFAJaX1oHRofWfO3LL8gAhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Descriptive_statistics&sa=X&ei=lzKKTLvWNo-Sswa1uPzMAQ&ved=0CAUQpAMoAA&usg=AFQjCNFTpIz7WAE6eXmDAc8c2cJ0dJFzNQhttp://www.google.com.pk/url?q=http://en.wikipedia.org/wiki/Inferential_statistics&sa=X&ei=ZjKKTOi7Ds-SswbE_6SuAg&ved=0CAcQpAMoAA&usg=AFQjCNGUZdarWwFAJaX1oHRofWfO3LL8gA


6/31

Number systems

Natural : 0, 1, 2, 3, 4, 5, 6, 7, ..., n

Integers: n, ..., 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, ..., n

Positive integers: 1, 2, 3, 4, 5, ..., n

Rational: a/b where a and b are integers and b is not zero (3/4)

Real: The limit of a convergent sequence of rational numbers (-1.23, 1.234)Complex: a + bi where a and b are real numbers and i is the square root of 1

Prime numbers: anatural numberthat has exactly two distinct natural numberdivisors: 1 and itself

(1,3,5,7,11)

Irrational number:The irrational numbers are in fact precisely those infinite decimals which are not

repeating (7/22 Pai)

Data

Data refers to any kind of recorded information

Variable

A piece of information recorded for every item is called a variable

Independent variable

A variable, which can be exploited during experiment

Dependent variable

A variable affected by the exploitation of independent variable

Univariate data

It is a data set which one piece of information has recorded for each item.

Bivariate Data

Such data sets have exactly two pieces of information recorded for each item

Multivariate data

Such data sets have three or more pieces of information recorded for each item
http://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/1_(number)http://en.wikipedia.org/wiki/Natural_numberhttp://en.wikipedia.org/wiki/Divisorhttp://en.wikipedia.org/wiki/1_(number)


7/31

Discrete quantitative data

A discrete variable can assume values only from a list of specific numbers e.g. number of people,

number of class rooms.

Continuous Quantitative data

It could be any number (value) e.g. weight of students, weather temperature

Ordinal

In this there is a meaningful order e.g. 1 to 5 where 1 is the dull and 5 is full bright

Nominal

Where there is no meaningful order e.g. name of different departments

Time Series Data

Data recorded in a meaningful sequence e.g. daily report of stock exchange, weekly temperature of

a patient

Cross-sectional data

Data collected at point of time e.g. grades of students in first term

Primary

Data collected for a specific purpose

Secondary

Previously collected data for another use

Data Arranging and Presentation

List of data

It is the simplest kind of data. It represents some kind of information.

Data Frequency

Frequency of data shows how often the various values occur in the data set. Normally presented in

shape of histogram

(source:http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtab and results of

Google image research)
http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtabhttp://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtabhttp://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#freqtab


8/31

Part Two

Part Two

Central Tendency and Data Spread

Frequency Table

Score Frequency Frequency (%)

0 4 13%

1 3 10%2 5 17%

3 5 17%

4 6 20%

5 7 23%

Pie Chart

Bar Chart


9/31

Area Charts


10/31

Line Charts


11/31

Dot Plot

Useful to identify any outliers, line of values also useful for this purpose.

Histogram


12/31

Histogram with normal curve

Radar Charts

Map chart


13/31

Stem and Leaf Plot

Box and Whisker Plot (or Boxplot)

Polygon charts


14/31

Variability means the extent to which data values differ from each other.

Diversity, dispersion, spread and uncertainty have the same meanings

Population - parameter Sample - statistic

size: N n

Mean mu x x bar

median n/a Mor ~x x tilde

proportion pi p

(p in text) ( p in text)

spread:

variance 2 sigma squared s2 s squared = (x -

x )2/(n - 1)

standard deviation sigma s

zscore = (x - mean)/sd Z z

correlation coefficient rho r

Slope 1 beta 1 b1

intercept 0 beta naught b0

In this report for simplicity we will use only signs of population

Range

Highest values-smallest value

Arithmetic mean

Geometric mean

Trimmed Mean

In this case some extreme values are removed for unbiased mean


15/31

Median

Halfway point of data set (n+1)/2 in case of odd number, in case of even number mean of two

middle values

Mode

The most common category

Percentiles

Percentiles are summary measures expressing ranks as percentage 0% to 100% rather than 1 to n.

These are used:

To indicate the data value at a given percentage

To indicate the percentage ranking of a given data value

Extremes and quartiles

Extremes the smallest and largest

Quartiles defines 25% and 75%

Variance

For population

For samples

Standard deviation ()

It is square root of variance and tells average distance from the mean value

Variance sum law

Percentile summary

Value attained by a given percentage after they have been ordered from smallest to largest.


16/31

Standard Deviation

It is an indication how different the numbers are from one another.

Normal Distribution

It is an idealized, smooth, bell-shaped histogram with all of the randomness removed.

It represents an ideal set that has lots of numbers concentrated in the middle.

It is common for statistical procedures to assume that the data set is reasonably approximated by a

normal distribution. Example with 5 and standard deviation:

Skewed Distribution

It is neither symmetric nor normal, because data values trail off more sharply on one side the on the

other. Pearson suggest following equation to measure skewness1:

Now more commonly used equation:

Negative Positive

1 Online Statistics: An Interactive Multimedia Course of Study


17/31

Kurtosis

Sampling distribution

It is a distribution of the statistic forall possible samples of a given size from a population. It is

highly dependent on the distribution of population.

Standard error (standard deviation of sampling)

Mean of sampling distribution is equal to the mean of population.

M =

Variance of sampling distribution is as under:

Standard distribution of sampling is referred as standard error of the quantity.

Normal Distribution and central limit theorem

Repeated means from a population which may not be normally distributed will be normally

distributed. Large sample size will have higher normal distribution2.

2 Online Statistics: An Interactive Multimedia Course of Study


18/31

Following figures are different mean and SD.

Characteristics of Normal distributions

1. Symmetric around their mean.

2. Mean, median and mode at same point

3. Area under normal curve is 1.00

4. Dense in center and thin at tails

5. Mean and SD are used for it

6. 68.27% data is within one SD

7. 95.45% data is within 2 SD

8. 99.73 % data is within 3 SD

9. 1.96 Z has 95% area

10. 1.68 Z has 90% area


19/31

Binomial distribution

There is only one outcome of each trial and each trial is mutually exclusive for example of head and

tail of coin.

Bivariate and Multivariate analysis

Bivariate analysis deals with the association or relationship between two set of data of two different

variables, whereas, multivariate deals with data of more than two sets of variable to have joint

effect. It is used to test hypothesis and identify the strength of correlation between or simply

dependency one variable on the other.

Correlation

1. A causal, complementary, parallel, or reciprocal relationship, especially a structural,

functional, or qualitative correspondence between two comparable entities: a correlation

between drug abuse and crime.

2. Statistics. The simultaneous change in value of two numerically valued random

variables: the positive correlation between cigarette smoking and the incidence of lung

cancer; the negative correlation between age and normal vision.

3. An act of correlating or the condition of being correlated3.

Hypotheses Testing

Statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative

(for or against the hypothesis) which minimizes certain risks

Null hypotheses

It is denoted by Ho and represents the default possibility about the population that you will accept

unless you have convincing evidence to the contrary.

The research Hypotheses (alternative Hypotheses)

It is denoted by Ha and will be accepted if there is a convincing evidence that would rule out the

null hypotheses as a reasonable possibilityExample:

Ho: a = 0

3http://www.answers.com/topic/correlation
http://www.answers.com/topic/correlationhttp://www.answers.com/topic/correlation


20/31

Ha: a 0

Two Tail and One Tail Test

One tail test: population mean is greater/lesser that the sample mean,

Two Tail Test

In this case researcher claims that the sample mean may be different than the population mean

(greater or lesser).

Hypotheses Testing Methods

Z test

t Test

p value

Z Score Test

Considering central limit theorem lots of statistic analysis are possible since distribution is normal.

Z-tests are better if the sample size is not too small. It tells distance in standard deviation form from

the mean of a data set.

Z-test is a statistical test where normal distribution is applied and is basically used for dealing with

problems relating to large samples when n 30 (http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5)

Types of Z test

1. Z test for single proportion to test hypothesis on a specific value of proportion, Ho: P=Po.

2. For two different groups of data, drinking habits of male and female

3. Test the specific value on a population. It is used when sample size >30 and standard

deviation is known.
http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5http://www.experiment-resources.com/z-test.html#ixzz0zCnm9iX5


21/31

4. Test of variance on a specific value of population variance.

5. Test of equality of two sets of variable when sample size >304.

Z value calculation

Formula of Z value:

Z value will be used to find the corresponding P value in table and will be compared with critical Z

value and if the P value is less than alpha, we reject the null hypothesis.

t Test (William Sealy Gosset, 1908)

For:

1. Single sample t test

2. Two independent samples t test

3. Compared groups t test (before treatment and after treatment)

4. For checking of regression line, is it equal to zero or not.

Useful for small samples, less than 30.

Assumption:

Data should by having normality which can be checked by using histogram and equality of variance

by using levenes test.

One sample t test:

P value

P values indicates the probability if the test statistics are properly distributed under normal curve as

it was assumed in null hypothesis. The smaller p value supports to not accept the null hypothesis.

More common is 0.05(95%) significance; however 0.1 and .01 are also used.

4Choudhury, Amit (2009). Z-Test. Retrieved [Date of Retrieval] from Experiment Resources: http://www.experiment-resources.com/z-test.html

Read more:http://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8C
http://en.wikipedia.org/wiki/William_Sealy_Gossethttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://www.experiment-resources.com/z-test.html#ixzz0zCqH1V8Chttp://en.wikipedia.org/wiki/William_Sealy_Gosset


22/31

Correlation

1. For parametric statistic (Pearson's product-moment correlation)

2. For nonparametric statistic (Spearman's rank correlation).5

Following equation is used to measure coefficient of correlation

6

:

Another equation to calculate correlation coefficient:

Regression AnalysisRegression analysis is a process to find the best fit line to explain the relationship between the independent and

dependent variable. It is written as:

Simple regression:

Y=b0+ b1X+

Multiple regression:

Y=b0+ b1X1++b2X2+b3X3+.bnXn+

5http://www.answers.com/topic/correlation-coefficient6 Online Statistics: An Interactive Multimedia Course of Study
http://www.answers.com/topic/correlation-coefficienthttp://www.answers.com/topic/correlation-coefficient


23/31

Where:

b0= interception on Y axis

Y= value of dependent variable

b1b3=coefficient of independent value

X=independent variable

=noise or effect of unknown variable (it may be ignored)

Assumption for regression analysis:

1. The sample true representative f population

2. Linearity in the data

3. Existence ofhomoscedasticity

Equation for regression

Slope line:
http://en.wikipedia.org/wiki/Homoscedasticityhttp://en.wikipedia.org/wiki/Homoscedasticityhttp://en.wikipedia.org/wiki/Homoscedasticity


24/31

For intercept

Example taken from http://faculty.uncfsu.edu/dwallace/lesson%2018.pdf


25/31


26/31

Example of regression analysis

Model Summary

Mode

l R R Square

Adjusted R

Square

Std. Error of

the Estimate

Change Statistics

R Square

Change F Change df1 df2

Sig. F

Change

1 .733a .537 .512 12.92502 .537 21.659 3 56 .000

a. Predictors: (Constant), Thermal Conductivity at Dry State Wm^1K^-1, Sample Thickness at Dry State (mm),

Thermal Resistance at Dry StateK.m2W^-1)


27/31

ANOVAb

Model

Sum of

Squares df

Mean

Square F Sig.

1 Regressio

n10854.967 3 3618.322 21.659 .000a

Residual 9355.145 56 167.056

Total 20210.112 59

a. Predictors: (Constant), Thermal Conductivity at Dry State

Wm^1K^-1, Sample Thickness at Dry State (mm), Thermal

Resistance at Dry StateK.m2W^-1)

b. Dependent Variable: Thermal Absorbtivity at Dry

StateW.m^-2.s1/2. K-1)


28/31

Coefficientsa

Model

Unstandardized

Coefficients

Standardized

Coefficients

t Sig.

Collinearity

Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) -80.481 173.186 -.465 .644

Thermal Resistance at

Dry StateK.m2W^-1)12696.547 10202.604 1.786 1.244 .219 .004 249.080

Sample Thickness at

Dry State (mm)-334.270 199.663 -1.937 -1.674 .100 .006 161.903

Thermal Conductivity

at Dry State

Wm^1K^-1

6192.067 3260.818 .774 1.899 .063 .050 20.091

a. Dependent Variable: Thermal Absorbtivity

Explanation of model:

Adjusted R square=.512(51.2%) means that in dependent variable 51.2% changes are due to these

independent variables. Significant F change shows that model is significant. Standardized

coefficient are coefficients of independent variables. Their significance values describe the

significance of these variables in the regression equation. Less than 0.05 tells that variable is

significant.

Multinomial logistic regression7

It is used for:

1. Analyze relationship between non-metric dependent and metric dichotomous independent

variable

2. It compares the multiple group through a combination of a binary logistic regression

It used to predict:

1. Coefficients for each of the two comparison

2. Three equations one for each group defined by the dependent variable

7 Source: www.utexas.edu/.../MultinomialLogisticRegression_BasicRelationships.ppt SW388R7Data Analysis & Computers II


29/31

3. A comparison is possible between group membership and actual group to find measure of

classification accuracy

Requirements of Multinomial logistic regression analysis:

1. Dependent variable should be non-metric the independent variables should metric or

dichotomous

2. Dichotomous, nominal, and ordinal variables can satisfy the requirements

Results of Multinomial Logistic Regression

1. Overall relationship between independent variables and grouped defined by the

dependent variables

2. Difference follows a chi-square distribution and used for significance testing

Examples:

1. Influence of father professional and education on occupancy preference2. Effect of food and exercise on a certain disease3. Selection of brands based on gender and age Chi Square Test

Chi-Square Test

Chi- square test is used to find association between two sets of variable written in the form of a

matrix, two way table8:

Where:

X2= Chi-square value

O= observed frequency

E= expected frequency

Example:

Short Tall Total

8 Source:http://science.jrank.org/pages/1401/Chi-Square-Test.html
http://science.jrank.org/pages/1401/Chi-Square-Test.htmlhttp://science.jrank.org/pages/1401/Chi-Square-Test.htmlhttp://science.jrank.org/pages/1401/Chi-Square-Test.html


30/31

Male 24 20 44

Female 36 5 41

Total 60 25 85

Expected value are calculated by using probability rules:

Probability that a person is short: 60/85=0.706

Probability that a person is male: 44/85=0.518

A person is male and short: 0.706*0.518=.366

Expected frequency of such person who are male and short: 0.366*85= 31.1

(we can calculate all other values by using this method)

Observed

values

Expected

values

(O-E)2/E

24 31.1 1.62

36 12.9 3.91

20 28.9 1.74

5 12.1 4.17

85 85 X2=11.4

Degree of freedom= (row-1)(column -1)= (2-1)(2-1)=1

values from Chi sq distribution:

For p=0.05, value ofX2= 3.84, whereas our value is 11.4, which is quite high.

It shows that we have to accept the null hypothesis that there is an association between

male and female and their height.

Crosstabs

It is a non parametric test and used to measure the association between two categories by

controlling other categories.


31/31

Example:

People having high salaries are more likely to go on vocation as compared to people having low

salaries.

Most commonly Pearson chi-square, likelihood-ratio chi-square are used for test of significance.

Results from SPSS:

Interpretation of the results: difference is by chance and there is no difference in services offerd by

different stores.

summary of data collection and analysis process

Documents