Download - Rodar
Dr. M.Subbiah M Sc, M Phil, Ph D, PGDOR, BPSM.
Purpose The simpler explanation is the preferable one - Occam’s Razor
• To present some key ideas from statistics that have proven to be useful tools for data analysis.
• It is an introduction to a specific statistical technique and idea
• These tools are shown by demonstration, rather than through mathematical / rigorous proofs
D2D Quick Steps
Statistics• Quite similar techniques• Commonly use many of the same techniques• Statistical Software include most of the tools• In the past,
– Data collected by hands in note books– Sometimes mistakenly recorded– Not too much data
• But Still,– Worth to apply even today (and in future too) – Virtually to any area, Agriculture to Psychology to
Astronomy to Business
Computational PowerAllure of Statistics
• Advent of computing power simplified analysis to a greater extent
• Helps to make sense of large quantities of data that are beyond the ability to handle in raw format
• Statistical software provide– Computationally feasible algorithms– Little or no human intervention– Platform to handle large volume of data
• Need is,– Appropriate tools and method of interpretation– Passion to expand knowledge horizon– Interactive learning experience
Computer SoftwareInevitable today!
• Ample number of built in functions• Able to handle different data types• Amenable to have data transformation / Recoding• Achievable Speed / Accuracy • Collaborative platforms • Possible to write own functions
Articulation
• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis
– Within Data– Through Data– Beyond Data
Two Illustrations
3333 cases 21 variables
Variables’ NatureState Nominal Eve Mins Metric
Account Length Nominal Eve Calls Count
Area Code Nominal Eve Charge Metric
Phone Nominal Night Mins Metric
Int'l Plan Binary Night Calls Count
VMail Plan Binary Night Charge Metric
VMail Message Count Intl Mins Metric
Day Mins Metric Intl Calls Metric
Day Calls Count Intl Charge Metric
Day Charge Metric CustServ Calls Polytomus
Churn Binary
Day Mins
Day Calls
Eve Mins
Eve Calls
Churn
Does this bring all explicit relationship between variables?
Our Experience
• Educational Domain• Customer Behavior – Who is customer• Parlance from marketing research• Collected over 3 months• Cleaning / organizing 1 month• Setting Objectives 1 month
– Domain Expert + Data Analyst
DescriptionGeographical location Perceived risks Ability Parental teacher and peer encouragement Socioeconomic status Significant others involvement Personal involvement Optimum stimulation level Perceived knowledge Self-confidence Motivation Aspiration Value for education Need for cognition Perceived benefits College attributes Perceived costs Types of sources Time pressure
External information search
Attitude towards search
Extent of search
Depth of search
Importance of search
Perceived knowledge
Self-confidence
Value for education
Perceived benefits
Extent of search
Does this bring all explicit relationship between variables?
Recall - Articulation
• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis
– Within Data– Through Data– Beyond Data
Data• Information
– any fact assumed to be a matter of direct observation.
– a series of observations, measurements, or facts
• Numbers
Data - Example
Data Classification
• Continuous – Metric– Measurable
• Discrete – Non Metric– Count data– Categorical (2 or more categories)
• Nominal• Ordinal• Interval• Ratio
What is a Variable?
• Simply, something that varies.• Specifically, variables represent persons or
objects that can be manipulated, controlled, or merely measured for the sake of research.
• Variation: How much a variable varies. Those with little variation are called constants.
Independent Variables
• These variables are ones that are more or less controlled.
• Scientists manipulate these variables as they see fit.
• They still vary, but the variation is relatively known or taken into account.
• Often there are many in a given study.
Dependent Variables• Not controlled or manipulated in any way, but
are simply measured or registered.
• Vary in relation to the independent variables
• There can be any number of dependent
variables,
• Usually there is one to isolate reason for
variation.
Independent V. Dependent
• Intentionally manipulated
• Controlled• Vary at known rate• Cause
• Intentionally left alone
• Measured• Vary at unknown
rate• Effect
Number and nature of IV and DV are matters of concern
Example: What affects a student’s arrival to class?
• Type of School – Arts v. Science V. Engg
• Type of Student– Athlete? Gender? GPA?
• Time – Bedtime, Waking, Arrival
• Mode of Transportation• Parents’ Education• Parents’ Occupation• Students’ Motivation ………………………………………….
Within Data
WHY?
•Precise Presentation
•Understand feature of data
•Set Analysis plan
•Test Assumption for Statistical inference
Within Data• Numeric Form
– Summary Statistics– Tables
• Pictorial Form– Charts – Figures
• Useful for Quicker Inference• Tool for eyeball judgment• Methods differ as per type of data
Within Data
Within Data
• For Metric Data– Box Plot– Normal Plot (Advanced Users)– Histogram
Within Data Statistic
Experience
Mean 15.5200
95% Confidence Interval for Mean
Lower Bound 14.6464
Upper Bound 16.3936
Median 15.0000Variance 39.256Std. Deviation 6.26545Minimum 5.00Maximum 29.00Range 24.00Interquartile Range 11.00Skewness .061Kurtosis -1.334
Within Data
Compare with summary
Within Data
Within Data
• For Non Metric Data– Bar– Line– Pie– Area
Bar
Line
Pie
Area
Summary for a categorical data
Frequency Percent
HINDU 145 72.5
CHRISTIAN 31 15.5
MUSLIM 24 12.0
Total 200 100.0
Bivariate Case
X
Y
Metric Categorical
Metric Scatter PlotsCorrelation
Categorical Multiple Box Cross tabulation Multiple bar
Scatter Plot
Scatter Matrix
Correlations
Correlations
Experience Age NT AD
Experience 1 .926 .309 .337
Age .926 1 .370 .429
NT .309 .370 1 .665
AD .337 .429 .665 1
GENDER * RELIGION CROSS TABULATION
Count
RELIGION
Total
HINDU CHRISTIAN MUSLIM
GENDERMale 40 16 8 64
Female 105 15 16 136
Total 145 31 24 200
GENDER Statistic
AGE
Male
Mean 25.44
95% Confidence Interval for MeanLower Bound 24.10Upper Bound 26.77
5% Trimmed Mean 25.12Median 24.00Variance 28.567Std. Deviation 5.345Minimum 19Maximum 39Range 20Interquartile Range 9Skewness .735Kurtosis -.424
Female
Mean 26.22
95% Confidence Interval for MeanLower Bound 25.14Upper Bound 27.30
5% Trimmed Mean 25.85Median 24.00Variance 40.662Std. Deviation 6.377Minimum 19Maximum 42Range 23Interquartile Range 10Skewness .832Kurtosis -.482
GENDER * RELIGION * ECONOMIC STATUS CROSS TABULATION
Count
ECONOMIC STATUSRELIGION
TotalHINDU CHRISTIAN MUSLIM
LESS THAN RS 10,000GENDER
Male 1 1 1 3
Female 2 1 1 4
Total 3 2 2 7
RS 10,000 – RS 20,000
GENDERMale 10 2 2 14
Female 15 3 2 20
Total 25 5 4 34
MORE THAN RS 20,000
GENDERMale 29 13 5 47
Female 88 11 13 112
Total 117 24 18 159
TotalGENDER
Male 40 16 8 64
Female 105 15 16 136
Total 145 31 24 200
Tri variable Case
?
Through Data
• WHAT ?
– Making generalization
– Inference about Population
• Beware of Uncertainty involved
– Testing Your Research Hypotheses
– Check Statistical Significance
Through Data
• Estimation
– Point
– Interval
• Hypotheses testing – Binary Decisions
• Tests for Significance – P Value
Through Data
• Point Estimation– Any suitable summary statistic– Support with measure of dispersion
• Interval Estimation– Elaborates the knowledge about uncertainty– Mainly based on “long run” assumptions– Recently finds extensive usage– Alternate way to test hypotheses
Danger of Partial Point Estimation
N Mean
Student 1 10 50
Student 2 10 50
Student 3 10 50
Danger of Partial Point Estimation50 46 12650 54 050 49 050 51 1250 38 750 62 050 44 18050 60 1550 55 15050 41 10
Actual Scores
Better Portrayal
N Mean SD
Student 1 10 50 .00000
Student 2 10 50 7.92
Student 3 10 50 71.72
Interval Estimation
• Mostly based on a point estimation
• Together with SD (termed as SE – Standard
Error)
• Provides a better scope to project uncertainty
• Understand the extent of population values
• Direction of population values
Interval Estimation
Sig Mean Difference
Std. Error Difference
95% Confidence Interval of the Difference
Lower Upper
.494 .62649 .91396 -1.17587 2.42884
Sig Mean Difference
Std. Error Difference
95% Confidence Interval of the Difference
Lower Upper
.004 .12346 .042802 .039486 .207439
NHST and more…• T-tests
• Analysis of Variance (ANOVA)
• Chi Square tests
• Tests for proportions
• Regression – Beyond Data
• Data Reduction
– Principal Components
– Factor Analysis
• Segmentation
– Cluster
– Discriminant
• Hierarchical Linear Modeling (HLM)
Statistical Tests Overview
• Two hypotheses are evaluated: the null (H0) and the alternative (H1)
• The amount of evidence required to “prove” the alternative may be stated in terms of a confidence level
• We Don’t “Accept” the Null Hypothesis
Statistical Tests Overview
• Reject the null hypothesis (p-value <= α) and conclude that the alternative hypothesis is true at the pre-determined confidence level
• Fail to reject the null hypothesis (p-value > α) and conclude that there is not enough evidence to state that the alternative is true at the pre-determined confidence level
summarizing the P value• Once a threshold P value for statistical significance is set,
every result is either statistically significant or is not statistically significant
P value Wording Summary
>0.05 Not significant ns
0.01 to 0.05 Significant *
0.001 to 0.01 Very significant **
< 0.001 Extremely significant ***
Actual P value could be reported.
Choosing a statistical testGoal Metric (Normal
Population)Rank, Score, or Measurement (Non-Normal)
Binomial(Two Possible Outcomes)
Describe one group Mean, SD Median, interquartile range
Proportion
Compare one group to a hypothetical value
One-sample t test Wilcoxon test Chi-squareorBinomial test
Compare two unpaired groups
Unpaired t test Mann-Whitney test Fisher's test(chi-square for large samples
Compare two paired groups
Paired t test Wilcoxon test McNemar's test
Choosing a statistical testGoal Metric (Normal
Population)Rank, Score, or Measurement (Non-Normal)
Binomial(Two Possible Outcomes)
Compare three or more unmatched groups
One-way ANOVA Kruskal-Wallis test Chi-square test
Quantify association between two variables
Pearson correlation Spearman correlation
Contingency coefficients
Predict value from another measured variable
Simple linear regression
Nonparametric regression
Simple logistic regression
Predict value from several measured or binomial variables
Multiple linear regression
Multiple logistic regression
Quick Case Study
Data – Snap ShotID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
1 4.1 0.6 6.9 4.7 2.4 2.3 5.2 0 32 4.2 1 0 1 12 1.8 3 6.3 6.6 2.5 4 8.4 1 43 4.3 0 1 0 13 3.4 5.2 5.7 6 4.3 2.7 8.2 1 48 5.2 0 1 1 24 2.7 1 7.1 5.9 1.8 2.3 7.8 1 32 3.9 0 1 1 15 6 0.9 9.6 7.8 3.4 4.6 4.5 0 58 6.8 1 0 1 36 1.9 3.3 7.9 4.8 2.6 1.9 9.7 1 45 4.4 0 1 1 27 4.6 2.4 9.5 6.6 3.5 4.5 7.6 0 46 5.8 1 0 1 18 1.3 4.2 6.2 5.1 2.8 2.2 6.9 1 44 4.3 0 1 0 29 5.5 1.6 9.4 4.7 3.5 3 7.6 0 63 5.4 1 0 1 3
10 4 3.5 6.5 6 3.7 3.2 8.7 1 54 5.4 0 1 0 211 2.4 1.6 8.8 4.8 2 2.8 5.8 0 32 4.3 1 0 0 112 3.9 2.2 9.1 4.6 3 2.5 8.3 0 47 5 1 0 1 213 2.8 1.4 8.1 3.8 2.1 1.4 6.6 1 39 4.4 0 1 0 114 3.7 1.5 8.6 5.7 2.7 3.7 6.7 0 38 5 1 0 1 115 4.7 1.3 9.9 6.7 3 2.6 6.8 0 54 5.9 1 0 0 316 3.4 2 9.7 4.7 2.7 1.7 4.8 0 49 4.7 1 0 0 317 3.2 4.1 5.7 5.1 3.6 2.9 6.2 0 38 4.4 1 1 1 218 4.9 1.8 7.7 4.3 3.4 1.5 5.9 0 40 5.6 1 0 0 2
Nature of variableID ID Respondents' IdentityX1 DeliverySpeed MetricX2 PriceLevel MetricX3 PriceFlexibility MetricX4 MfrImage MetricX5 OverallService MetricX6 SalesforceImage MetricX7 ProductQuality MetricX8 FirmSize Nonmetric 1=large, 0 =smallX9 UsageLevel MetricX10 SatisfactionLevel MetricX11 SpecBuying Nonmetric 1=total, 0 = specificX12 ProcurementStructure Nonmetric 1=centre, 0 = decentrX13 IndustryType Nonmetric 1= industry A, 0 = othersX14 BuyingSituationType Nonmetric 1 = new task, 2 = modified rebuy 3 = straight rebuy
Sample Plan
• Test a hypothetical value for X1 = 2.5 or 3.5• Test for a group difference for Firm size X8
with respect to X1 X2 X4• ANOVA for group difference for Firm size X14
with respect to X2• Regression X1 to X4 (IVs) and X5 (DV)• Test for Associations
One Sample Test with X1One-Sample Test
7.685 99 .000 1.0150 .7529 1.2771X1
t df Sig. (2-tai led)Mean
Difference Lower Upper
95% ConfidenceInterval of the Difference
Test Value = 2.5
One-Sample Test
.114 99 .910 .0150 -.2471 .2771X1
t df Sig. (2-tai led)Mean
Difference Lower Upper
95% ConfidenceInterval of the Difference
Test Value = 3.5
Two Sample Test with X8Independent Samples Test
.934 .336 -8.045 98 .000 -1.6917 .21029 -2.10897 -1.27436
-8.074 84.766 .000 -1.6917 .20953 -2.10828 -1.27506
1.582 .211 4.687 98 .000 1.0392 .22171 .59919 1.47914
4.564 75.987 .000 1.0392 .22767 .58571 1.49262
Equal variances assumed
Equal variances notassumed
Equal variances assumed
Equal variances notassumed
X1
X2
F Sig.
Levene's Test forEquality of Variances
t df Sig. (2-tai led)Mean
DifferenceStd. ErrorDifference Lower Upper
95% ConfidenceInterval of the Difference
t-test for Equality of Means
Independent Samples Test
6.549 .012 .374 98 .709 .0867 .23196 -.37365 .54698
.405 97.990 .686 .0867 .21406 -.33814 .51147
Equal variances assumed
Equal variances notassumed
X4
F Sig.
Levene's Test forEquali ty of Variances
t df Sig. (2-tai led)Mean
DifferenceStd. ErrorDifference Lower Upper
95% ConfidenceInterval of the Difference
t-test for Equali ty of Means
ANOVA OUTPUTS
Descriptives
X2
34 2.0941 .95122 .16313 1.7622 2.4260 .40 3.70
32 3.1813 1.36510 .24132 2.6891 3.6734 .70 5.40
34 1.8647 .80862 .13868 1.5826 2.1468 .20 4.00
100 2.3640 1.19566 .11957 2.1268 2.6012 .20 5.40
A
B
C
Total
N Mean Std. Deviation Std. Error Lower Bound Upper Bound
95% Confidence Interval forMean
Minimum Maximum
ANOVA
X2
32.325 2 16.163 14.356 .000
109.205 97 1.126
141.530 99
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
POST HOC ANALYSISMultiple Comparisons
Dependent Variable: X2
LSD
-1.0871* .26133 .000 -1.6058 -.5685
.2294 .25734 .375 -.2813 .7402
1.0871* .26133 .000 .5685 1.6058
1.3165* .26133 .000 .7979 1.8352
-.2294 .25734 .375 -.7402 .2813
-1.3165* .26133 .000 -1.8352 -.7979
(J) X14
A
B
C
A
B
C
A
B
C
(I) X14
A
B
C
MeanDifference (I-J) Std. Error Sig. Lower Bound Upper Bound
95% Confidence Interval
The mean difference is significant at the .05 level.*.
Regression Output
Coefficientsa
-.278 .105 -2.653 .009 -.486 -.070
.505 .011 .888 47.904 .000 .484 .526
.518 .012 .824 44.044 .000 .494 .541
.011 .011 .019 .989 .325 -.011 .032
.021 .011 .032 1.941 .055 .000 .043
(Constant)
X1
X2
X3
X4
Model
1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval forB
Dependent Variable: X5a.
Test for association – Contingency Table
X8 * X14 Crosstabulation
Count
10 16 34 60
24 16 0 40
34 32 34 100
Small
Large
X8
Total
A B C
X14
Total
Chi Square Test for association
Chi-Square Tests
37.255a 2 .000
49.047 2 .000
34.941 1 .000
100
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.80.
a.
?
My RODAR • Conceptualize Your Research
• Know Your Objectives for the analysis
• Familiar with all aspects of Your Data
• Develop Your Analysis plan suitable for the research questions– stay true to that
• Aware statistical software that can best reflect Your Results
• Master in your analyses
• Have openness to additional investigations– Caution: Limitations given the data and the programs you are using
THANK YOU !
9952726863
Happy Computing
Questions Comments…