rodar

77
Dr. M.Subbiah M Sc, M Phil, Ph D, PGDOR, BPSM.

Upload: subbiah-phd

Post on 22-May-2015

36 views

Category:

Education


0 download

DESCRIPTION

My Research My Objectives My Data My Analyses My Result

TRANSCRIPT

Page 1: Rodar

Dr. M.Subbiah M Sc, M Phil, Ph D, PGDOR, BPSM.

Page 2: Rodar

Purpose The simpler explanation is the preferable one - Occam’s Razor

• To present some key ideas from statistics that have proven to be useful tools for data analysis.

• It is an introduction to a specific statistical technique and idea

• These tools are shown by demonstration, rather than through mathematical / rigorous proofs

Page 3: Rodar

D2D Quick Steps

Page 4: Rodar
Page 5: Rodar

Statistics• Quite similar techniques• Commonly use many of the same techniques• Statistical Software include most of the tools• In the past,

– Data collected by hands in note books– Sometimes mistakenly recorded– Not too much data

• But Still,– Worth to apply even today (and in future too) – Virtually to any area, Agriculture to Psychology to

Astronomy to Business

Page 6: Rodar

Computational PowerAllure of Statistics

• Advent of computing power simplified analysis to a greater extent

• Helps to make sense of large quantities of data that are beyond the ability to handle in raw format

• Statistical software provide– Computationally feasible algorithms– Little or no human intervention– Platform to handle large volume of data

• Need is,– Appropriate tools and method of interpretation– Passion to expand knowledge horizon– Interactive learning experience

Page 7: Rodar

Computer SoftwareInevitable today!

• Ample number of built in functions• Able to handle different data types• Amenable to have data transformation / Recoding• Achievable Speed / Accuracy • Collaborative platforms • Possible to write own functions

Page 8: Rodar

Articulation

• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis

– Within Data– Through Data– Beyond Data

Page 9: Rodar

Two Illustrations

Page 10: Rodar

3333 cases 21 variables

Page 11: Rodar

Variables’ NatureState Nominal Eve Mins Metric

Account Length Nominal Eve Calls Count

Area Code Nominal Eve Charge Metric

Phone Nominal Night Mins Metric

Int'l Plan Binary Night Calls Count

VMail Plan Binary Night Charge Metric

VMail Message Count Intl Mins Metric

Day Mins Metric Intl Calls Metric

Day Calls Count Intl Charge Metric

Day Charge Metric CustServ Calls Polytomus

Churn Binary

Page 12: Rodar

Day Mins

Day Calls

Eve Mins

Eve Calls

Churn

Does this bring all explicit relationship between variables?

Page 13: Rodar

Our Experience

• Educational Domain• Customer Behavior – Who is customer• Parlance from marketing research• Collected over 3 months• Cleaning / organizing 1 month• Setting Objectives 1 month

– Domain Expert + Data Analyst

Page 14: Rodar

DescriptionGeographical location Perceived risks Ability Parental teacher and peer encouragement Socioeconomic status Significant others involvement Personal involvement Optimum stimulation level Perceived knowledge Self-confidence Motivation Aspiration Value for education Need for cognition Perceived benefits College attributes Perceived costs Types of sources Time pressure

External information search

Attitude towards search

Extent of search

Depth of search

Importance of search

Page 15: Rodar

Perceived knowledge

Self-confidence

Value for education

Perceived benefits

Extent of search

Does this bring all explicit relationship between variables?

Page 16: Rodar

Recall - Articulation

• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis

– Within Data– Through Data– Beyond Data

Page 17: Rodar

Data• Information

– any fact assumed to be a matter of direct observation.

– a series of observations, measurements, or facts

• Numbers

Page 18: Rodar

Data - Example

Page 19: Rodar

Data Classification

• Continuous – Metric– Measurable

• Discrete – Non Metric– Count data– Categorical (2 or more categories)

• Nominal• Ordinal• Interval• Ratio

Page 20: Rodar

What is a Variable?

• Simply, something that varies.• Specifically, variables represent persons or

objects that can be manipulated, controlled, or merely measured for the sake of research.

• Variation: How much a variable varies. Those with little variation are called constants.

Page 21: Rodar

Independent Variables

• These variables are ones that are more or less controlled.

• Scientists manipulate these variables as they see fit.

• They still vary, but the variation is relatively known or taken into account.

• Often there are many in a given study.

Page 22: Rodar

Dependent Variables• Not controlled or manipulated in any way, but

are simply measured or registered.

• Vary in relation to the independent variables

• There can be any number of dependent

variables,

• Usually there is one to isolate reason for

variation.

Page 23: Rodar

Independent V. Dependent

• Intentionally manipulated

• Controlled• Vary at known rate• Cause

• Intentionally left alone

• Measured• Vary at unknown

rate• Effect

Number and nature of IV and DV are matters of concern

Page 24: Rodar

Example: What affects a student’s arrival to class?

• Type of School – Arts v. Science V. Engg

• Type of Student– Athlete? Gender? GPA?

• Time – Bedtime, Waking, Arrival

• Mode of Transportation• Parents’ Education• Parents’ Occupation• Students’ Motivation ………………………………………….

Page 25: Rodar

Within Data

WHY?

•Precise Presentation

•Understand feature of data

•Set Analysis plan

•Test Assumption for Statistical inference

Page 26: Rodar

Within Data• Numeric Form

– Summary Statistics– Tables

• Pictorial Form– Charts – Figures

• Useful for Quicker Inference• Tool for eyeball judgment• Methods differ as per type of data

Page 27: Rodar

Within Data

Page 28: Rodar

Within Data

• For Metric Data– Box Plot– Normal Plot (Advanced Users)– Histogram

Page 29: Rodar

Within Data Statistic

Experience

Mean 15.5200

95% Confidence Interval for Mean

Lower Bound 14.6464

Upper Bound 16.3936

Median 15.0000Variance 39.256Std. Deviation 6.26545Minimum 5.00Maximum 29.00Range 24.00Interquartile Range 11.00Skewness .061Kurtosis -1.334

Page 30: Rodar

Within Data

Compare with summary

Page 31: Rodar

Within Data

Page 32: Rodar

Within Data

• For Non Metric Data– Bar– Line– Pie– Area

Page 33: Rodar

Bar

Page 34: Rodar

Line

Page 35: Rodar

Pie

Page 36: Rodar

Area

Page 37: Rodar

Summary for a categorical data

Frequency Percent

HINDU 145 72.5

CHRISTIAN 31 15.5

MUSLIM 24 12.0

Total 200 100.0

Page 38: Rodar

Bivariate Case

X

Y

Metric Categorical

Metric Scatter PlotsCorrelation

Categorical Multiple Box Cross tabulation Multiple bar

Page 39: Rodar

Scatter Plot

Page 40: Rodar

Scatter Matrix

Page 41: Rodar

Correlations

Correlations

Experience Age NT AD

Experience 1 .926 .309 .337

Age .926 1 .370 .429

NT .309 .370 1 .665

AD .337 .429 .665 1

Page 42: Rodar

GENDER * RELIGION CROSS TABULATION

Count

RELIGION

Total

HINDU CHRISTIAN MUSLIM

GENDERMale 40 16 8 64

Female 105 15 16 136

Total 145 31 24 200

Page 43: Rodar
Page 44: Rodar

GENDER Statistic

AGE

Male

Mean 25.44

95% Confidence Interval for MeanLower Bound 24.10Upper Bound 26.77

5% Trimmed Mean 25.12Median 24.00Variance 28.567Std. Deviation 5.345Minimum 19Maximum 39Range 20Interquartile Range 9Skewness .735Kurtosis -.424

Female

Mean 26.22

95% Confidence Interval for MeanLower Bound 25.14Upper Bound 27.30

5% Trimmed Mean 25.85Median 24.00Variance 40.662Std. Deviation 6.377Minimum 19Maximum 42Range 23Interquartile Range 10Skewness .832Kurtosis -.482

Page 45: Rodar
Page 46: Rodar
Page 47: Rodar

GENDER * RELIGION * ECONOMIC STATUS CROSS TABULATION

Count

ECONOMIC STATUSRELIGION

TotalHINDU CHRISTIAN MUSLIM

LESS THAN RS 10,000GENDER

Male 1 1 1 3

Female 2 1 1 4

Total 3 2 2 7

RS 10,000 – RS 20,000

GENDERMale 10 2 2 14

Female 15 3 2 20

Total 25 5 4 34

MORE THAN RS 20,000

GENDERMale 29 13 5 47

Female 88 11 13 112

Total 117 24 18 159

TotalGENDER

Male 40 16 8 64

Female 105 15 16 136

Total 145 31 24 200

Page 48: Rodar

Tri variable Case

?

Page 49: Rodar

Through Data

• WHAT ?

– Making generalization

– Inference about Population

• Beware of Uncertainty involved

– Testing Your Research Hypotheses

– Check Statistical Significance

Page 50: Rodar

Through Data

• Estimation

– Point

– Interval

• Hypotheses testing – Binary Decisions

• Tests for Significance – P Value

Page 51: Rodar

Through Data

• Point Estimation– Any suitable summary statistic– Support with measure of dispersion

• Interval Estimation– Elaborates the knowledge about uncertainty– Mainly based on “long run” assumptions– Recently finds extensive usage– Alternate way to test hypotheses

Page 52: Rodar

Danger of Partial Point Estimation

N Mean

Student 1 10 50

Student 2 10 50

Student 3 10 50

Page 53: Rodar

Danger of Partial Point Estimation50 46 12650 54 050 49 050 51 1250 38 750 62 050 44 18050 60 1550 55 15050 41 10

Actual Scores

Page 54: Rodar

Better Portrayal

N Mean SD

Student 1 10 50 .00000

Student 2 10 50 7.92

Student 3 10 50 71.72

Page 55: Rodar

Interval Estimation

• Mostly based on a point estimation

• Together with SD (termed as SE – Standard

Error)

• Provides a better scope to project uncertainty

• Understand the extent of population values

• Direction of population values

Page 56: Rodar

Interval Estimation

Sig Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference

Lower Upper

.494 .62649 .91396 -1.17587 2.42884

Sig Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference

Lower Upper

.004 .12346 .042802 .039486 .207439

Page 57: Rodar

NHST and more…• T-tests

• Analysis of Variance (ANOVA)

• Chi Square tests

• Tests for proportions

• Regression – Beyond Data

• Data Reduction

– Principal Components

– Factor Analysis

• Segmentation

– Cluster

– Discriminant

• Hierarchical Linear Modeling (HLM)

Page 58: Rodar

Statistical Tests Overview

• Two hypotheses are evaluated: the null (H0) and the alternative (H1)

• The amount of evidence required to “prove” the alternative may be stated in terms of a confidence level

• We Don’t “Accept” the Null Hypothesis

Page 59: Rodar

Statistical Tests Overview

• Reject the null hypothesis (p-value <= α) and conclude that the alternative hypothesis is true at the pre-determined confidence level

• Fail to reject the null hypothesis (p-value > α) and conclude that there is not enough evidence to state that the alternative is true at the pre-determined confidence level

Page 60: Rodar

summarizing the P value• Once a threshold P value for statistical significance is set,

every result is either statistically significant or is not statistically significant

P value Wording Summary

>0.05 Not significant ns

0.01 to 0.05 Significant *

0.001 to 0.01 Very significant **

< 0.001 Extremely significant ***

Actual P value could be reported.

Page 61: Rodar

Choosing a statistical testGoal Metric (Normal

Population)Rank, Score, or Measurement (Non-Normal)

Binomial(Two Possible Outcomes)

Describe one group Mean, SD Median, interquartile range

Proportion

Compare one group to a hypothetical value

One-sample t test Wilcoxon test Chi-squareorBinomial test

Compare two unpaired groups

Unpaired t test Mann-Whitney test Fisher's test(chi-square for large samples

Compare two paired groups

Paired t test Wilcoxon test McNemar's test

Page 62: Rodar

Choosing a statistical testGoal Metric (Normal

Population)Rank, Score, or Measurement (Non-Normal)

Binomial(Two Possible Outcomes)

Compare three or more unmatched groups

One-way ANOVA Kruskal-Wallis test Chi-square test

Quantify association between two variables

Pearson correlation Spearman correlation

Contingency coefficients

Predict value from another measured variable

Simple linear regression

Nonparametric regression

Simple logistic regression

Predict value from several measured or binomial variables

Multiple linear regression

Multiple logistic regression

Page 63: Rodar

Quick Case Study

Page 64: Rodar

Data – Snap ShotID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

1 4.1 0.6 6.9 4.7 2.4 2.3 5.2 0 32 4.2 1 0 1 12 1.8 3 6.3 6.6 2.5 4 8.4 1 43 4.3 0 1 0 13 3.4 5.2 5.7 6 4.3 2.7 8.2 1 48 5.2 0 1 1 24 2.7 1 7.1 5.9 1.8 2.3 7.8 1 32 3.9 0 1 1 15 6 0.9 9.6 7.8 3.4 4.6 4.5 0 58 6.8 1 0 1 36 1.9 3.3 7.9 4.8 2.6 1.9 9.7 1 45 4.4 0 1 1 27 4.6 2.4 9.5 6.6 3.5 4.5 7.6 0 46 5.8 1 0 1 18 1.3 4.2 6.2 5.1 2.8 2.2 6.9 1 44 4.3 0 1 0 29 5.5 1.6 9.4 4.7 3.5 3 7.6 0 63 5.4 1 0 1 3

10 4 3.5 6.5 6 3.7 3.2 8.7 1 54 5.4 0 1 0 211 2.4 1.6 8.8 4.8 2 2.8 5.8 0 32 4.3 1 0 0 112 3.9 2.2 9.1 4.6 3 2.5 8.3 0 47 5 1 0 1 213 2.8 1.4 8.1 3.8 2.1 1.4 6.6 1 39 4.4 0 1 0 114 3.7 1.5 8.6 5.7 2.7 3.7 6.7 0 38 5 1 0 1 115 4.7 1.3 9.9 6.7 3 2.6 6.8 0 54 5.9 1 0 0 316 3.4 2 9.7 4.7 2.7 1.7 4.8 0 49 4.7 1 0 0 317 3.2 4.1 5.7 5.1 3.6 2.9 6.2 0 38 4.4 1 1 1 218 4.9 1.8 7.7 4.3 3.4 1.5 5.9 0 40 5.6 1 0 0 2

Page 65: Rodar

Nature of variableID ID Respondents' IdentityX1 DeliverySpeed MetricX2 PriceLevel MetricX3 PriceFlexibility MetricX4 MfrImage MetricX5 OverallService MetricX6 SalesforceImage MetricX7 ProductQuality MetricX8 FirmSize Nonmetric 1=large, 0 =smallX9 UsageLevel MetricX10 SatisfactionLevel MetricX11 SpecBuying Nonmetric 1=total, 0 = specificX12 ProcurementStructure Nonmetric 1=centre, 0 = decentrX13 IndustryType Nonmetric 1= industry A, 0 = othersX14 BuyingSituationType Nonmetric 1 = new task, 2 = modified rebuy 3 = straight rebuy

Page 66: Rodar

Sample Plan

• Test a hypothetical value for X1 = 2.5 or 3.5• Test for a group difference for Firm size X8

with respect to X1 X2 X4• ANOVA for group difference for Firm size X14

with respect to X2• Regression X1 to X4 (IVs) and X5 (DV)• Test for Associations

Page 67: Rodar

One Sample Test with X1One-Sample Test

7.685 99 .000 1.0150 .7529 1.2771X1

t df Sig. (2-tai led)Mean

Difference Lower Upper

95% ConfidenceInterval of the Difference

Test Value = 2.5

One-Sample Test

.114 99 .910 .0150 -.2471 .2771X1

t df Sig. (2-tai led)Mean

Difference Lower Upper

95% ConfidenceInterval of the Difference

Test Value = 3.5

Page 68: Rodar

Two Sample Test with X8Independent Samples Test

.934 .336 -8.045 98 .000 -1.6917 .21029 -2.10897 -1.27436

-8.074 84.766 .000 -1.6917 .20953 -2.10828 -1.27506

1.582 .211 4.687 98 .000 1.0392 .22171 .59919 1.47914

4.564 75.987 .000 1.0392 .22767 .58571 1.49262

Equal variances assumed

Equal variances notassumed

Equal variances assumed

Equal variances notassumed

X1

X2

F Sig.

Levene's Test forEquality of Variances

t df Sig. (2-tai led)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the Difference

t-test for Equality of Means

Independent Samples Test

6.549 .012 .374 98 .709 .0867 .23196 -.37365 .54698

.405 97.990 .686 .0867 .21406 -.33814 .51147

Equal variances assumed

Equal variances notassumed

X4

F Sig.

Levene's Test forEquali ty of Variances

t df Sig. (2-tai led)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the Difference

t-test for Equali ty of Means

Page 69: Rodar

ANOVA OUTPUTS

Page 70: Rodar

Descriptives

X2

34 2.0941 .95122 .16313 1.7622 2.4260 .40 3.70

32 3.1813 1.36510 .24132 2.6891 3.6734 .70 5.40

34 1.8647 .80862 .13868 1.5826 2.1468 .20 4.00

100 2.3640 1.19566 .11957 2.1268 2.6012 .20 5.40

A

B

C

Total

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Minimum Maximum

Page 71: Rodar

ANOVA

X2

32.325 2 16.163 14.356 .000

109.205 97 1.126

141.530 99

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 72: Rodar

POST HOC ANALYSISMultiple Comparisons

Dependent Variable: X2

LSD

-1.0871* .26133 .000 -1.6058 -.5685

.2294 .25734 .375 -.2813 .7402

1.0871* .26133 .000 .5685 1.6058

1.3165* .26133 .000 .7979 1.8352

-.2294 .25734 .375 -.7402 .2813

-1.3165* .26133 .000 -1.8352 -.7979

(J) X14

A

B

C

A

B

C

A

B

C

(I) X14

A

B

C

MeanDifference (I-J) Std. Error Sig. Lower Bound Upper Bound

95% Confidence Interval

The mean difference is significant at the .05 level.*.

Page 73: Rodar

Regression Output

Coefficientsa

-.278 .105 -2.653 .009 -.486 -.070

.505 .011 .888 47.904 .000 .484 .526

.518 .012 .824 44.044 .000 .494 .541

.011 .011 .019 .989 .325 -.011 .032

.021 .011 .032 1.941 .055 .000 .043

(Constant)

X1

X2

X3

X4

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval forB

Dependent Variable: X5a.

Page 74: Rodar

Test for association – Contingency Table

X8 * X14 Crosstabulation

Count

10 16 34 60

24 16 0 40

34 32 34 100

Small

Large

X8

Total

A B C

X14

Total

Page 75: Rodar

Chi Square Test for association

Chi-Square Tests

37.255a 2 .000

49.047 2 .000

34.941 1 .000

100

Pearson Chi-Square

Continuity Correction

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.80.

a.

?

Page 76: Rodar

My RODAR • Conceptualize Your Research

• Know Your Objectives for the analysis

• Familiar with all aspects of Your Data

• Develop Your Analysis plan suitable for the research questions– stay true to that

• Aware statistical software that can best reflect Your Results

• Master in your analyses

• Have openness to additional investigations– Caution: Limitations given the data and the programs you are using

Page 77: Rodar

THANK YOU !

DR. [email protected]

9952726863

Happy Computing

Questions Comments…