Download - Rodar

Dr. M.Subbiah M Sc, M Phil, Ph D, PGDOR, BPSM.

Purpose The simpler explanation is the preferable one - Occam’s Razor

• To present some key ideas from statistics that have proven to be useful tools for data analysis.

• It is an introduction to a specific statistical technique and idea

• These tools are shown by demonstration, rather than through mathematical / rigorous proofs

D2D Quick Steps

Statistics• Quite similar techniques• Commonly use many of the same techniques• Statistical Software include most of the tools• In the past,

– Data collected by hands in note books– Sometimes mistakenly recorded– Not too much data

• But Still,– Worth to apply even today (and in future too) – Virtually to any area, Agriculture to Psychology to

Astronomy to Business

Computational PowerAllure of Statistics

• Advent of computing power simplified analysis to a greater extent

• Helps to make sense of large quantities of data that are beyond the ability to handle in raw format

• Statistical software provide– Computationally feasible algorithms– Little or no human intervention– Platform to handle large volume of data

• Need is,– Appropriate tools and method of interpretation– Passion to expand knowledge horizon– Interactive learning experience

Computer SoftwareInevitable today!

• Ample number of built in functions• Able to handle different data types• Amenable to have data transformation / Recoding• Achievable Speed / Accuracy • Collaborative platforms • Possible to write own functions

Articulation

• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis

– Within Data– Through Data– Beyond Data

Two Illustrations

3333 cases 21 variables

Variables’ NatureState Nominal Eve Mins Metric

Account Length Nominal Eve Calls Count

Area Code Nominal Eve Charge Metric

Phone Nominal Night Mins Metric

Int'l Plan Binary Night Calls Count

VMail Plan Binary Night Charge Metric

VMail Message Count Intl Mins Metric

Day Mins Metric Intl Calls Metric

Day Calls Count Intl Charge Metric

Day Charge Metric CustServ Calls Polytomus

Churn Binary

Day Mins

Day Calls

Eve Mins

Eve Calls

Churn

Does this bring all explicit relationship between variables?

Our Experience

• Educational Domain• Customer Behavior – Who is customer• Parlance from marketing research• Collected over 3 months• Cleaning / organizing 1 month• Setting Objectives 1 month

– Domain Expert + Data Analyst

DescriptionGeographical location Perceived risks Ability Parental teacher and peer encouragement Socioeconomic status Significant others involvement Personal involvement Optimum stimulation level Perceived knowledge Self-confidence Motivation Aspiration Value for education Need for cognition Perceived benefits College attributes Perceived costs Types of sources Time pressure

External information search

Attitude towards search

Extent of search

Depth of search

Importance of search

Perceived knowledge

Self-confidence

Value for education

Perceived benefits

Extent of search

Does this bring all explicit relationship between variables?

Recall - Articulation

• Objective of the study• Data availability• Data Upside / Downside• Usefulness of variables• Exploring Relations - Three eyes of Analysis

– Within Data– Through Data– Beyond Data

Data• Information

– any fact assumed to be a matter of direct observation.

– a series of observations, measurements, or facts

• Numbers

Data - Example

Data Classification

• Continuous – Metric– Measurable

• Discrete – Non Metric– Count data– Categorical (2 or more categories)

• Nominal• Ordinal• Interval• Ratio

What is a Variable?

• Simply, something that varies.• Specifically, variables represent persons or

objects that can be manipulated, controlled, or merely measured for the sake of research.

• Variation: How much a variable varies. Those with little variation are called constants.

Independent Variables

• These variables are ones that are more or less controlled.

• Scientists manipulate these variables as they see fit.

• They still vary, but the variation is relatively known or taken into account.

• Often there are many in a given study.

Dependent Variables• Not controlled or manipulated in any way, but

are simply measured or registered.

• Vary in relation to the independent variables

• There can be any number of dependent

variables,

• Usually there is one to isolate reason for

variation.

Independent V. Dependent

• Intentionally manipulated

• Controlled• Vary at known rate• Cause

• Intentionally left alone

• Measured• Vary at unknown

rate• Effect

Number and nature of IV and DV are matters of concern

Example: What affects a student’s arrival to class?

• Type of School – Arts v. Science V. Engg

• Type of Student– Athlete? Gender? GPA?

• Time – Bedtime, Waking, Arrival

• Mode of Transportation• Parents’ Education• Parents’ Occupation• Students’ Motivation ………………………………………….

Within Data

WHY?

•Precise Presentation

•Understand feature of data

•Set Analysis plan

•Test Assumption for Statistical inference

Within Data• Numeric Form

– Summary Statistics– Tables

• Pictorial Form– Charts – Figures

• Useful for Quicker Inference• Tool for eyeball judgment• Methods differ as per type of data

Within Data

Within Data

• For Metric Data– Box Plot– Normal Plot (Advanced Users)– Histogram

Within Data Statistic

Experience

Mean 15.5200

95% Confidence Interval for Mean

Lower Bound 14.6464

Upper Bound 16.3936

Median 15.0000Variance 39.256Std. Deviation 6.26545Minimum 5.00Maximum 29.00Range 24.00Interquartile Range 11.00Skewness .061Kurtosis -1.334

Within Data

Compare with summary

Within Data

Within Data

• For Non Metric Data– Bar– Line– Pie– Area

Summary for a categorical data

Frequency Percent

HINDU 145 72.5

CHRISTIAN 31 15.5

MUSLIM 24 12.0

Total 200 100.0

Bivariate Case

X

Y

Metric Categorical

Metric Scatter PlotsCorrelation

Categorical Multiple Box Cross tabulation Multiple bar

Scatter Plot

Scatter Matrix

Correlations

Correlations

Experience Age NT AD

Experience 1 .926 .309 .337

Age .926 1 .370 .429

NT .309 .370 1 .665

AD .337 .429 .665 1

GENDER * RELIGION CROSS TABULATION

Count

RELIGION

Total

HINDU CHRISTIAN MUSLIM

GENDERMale 40 16 8 64

Female 105 15 16 136

Total 145 31 24 200

GENDER Statistic

AGE

Male

Mean 25.44

95% Confidence Interval for MeanLower Bound 24.10Upper Bound 26.77

5% Trimmed Mean 25.12Median 24.00Variance 28.567Std. Deviation 5.345Minimum 19Maximum 39Range 20Interquartile Range 9Skewness .735Kurtosis -.424

Female

Mean 26.22

95% Confidence Interval for MeanLower Bound 25.14Upper Bound 27.30

5% Trimmed Mean 25.85Median 24.00Variance 40.662Std. Deviation 6.377Minimum 19Maximum 42Range 23Interquartile Range 10Skewness .832Kurtosis -.482

GENDER * RELIGION * ECONOMIC STATUS CROSS TABULATION

Count

ECONOMIC STATUSRELIGION

TotalHINDU CHRISTIAN MUSLIM

LESS THAN RS 10,000GENDER

Male 1 1 1 3

Female 2 1 1 4

Total 3 2 2 7

RS 10,000 – RS 20,000


Female 15 3 2 20

Total 25 5 4 34

MORE THAN RS 20,000


Female 88 11 13 112

Total 117 24 18 159

TotalGENDER

Male 40 16 8 64

Female 105 15 16 136

Total 145 31 24 200

Tri variable Case

?

Through Data

• WHAT ?

– Making generalization

– Inference about Population

• Beware of Uncertainty involved

– Testing Your Research Hypotheses

– Check Statistical Significance

Through Data

• Estimation

– Point

– Interval

• Hypotheses testing – Binary Decisions

• Tests for Significance – P Value

Through Data

• Point Estimation– Any suitable summary statistic– Support with measure of dispersion

• Interval Estimation– Elaborates the knowledge about uncertainty– Mainly based on “long run” assumptions– Recently finds extensive usage– Alternate way to test hypotheses

Danger of Partial Point Estimation

N Mean

Student 1 10 50

Student 2 10 50

Student 3 10 50

Danger of Partial Point Estimation50 46 12650 54 050 49 050 51 1250 38 750 62 050 44 18050 60 1550 55 15050 41 10

Actual Scores

Better Portrayal

N Mean SD

Student 1 10 50 .00000

Student 2 10 50 7.92

Student 3 10 50 71.72

Interval Estimation

• Mostly based on a point estimation

• Together with SD (termed as SE – Standard

Error)

• Provides a better scope to project uncertainty

• Understand the extent of population values

• Direction of population values

Interval Estimation

Sig Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference

Lower Upper

.494 .62649 .91396 -1.17587 2.42884

Sig Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference

Lower Upper

.004 .12346 .042802 .039486 .207439

NHST and more…• T-tests

• Analysis of Variance (ANOVA)

• Chi Square tests

• Tests for proportions

• Regression – Beyond Data

• Data Reduction

– Principal Components

– Factor Analysis

• Segmentation

– Cluster

– Discriminant

• Hierarchical Linear Modeling (HLM)

Statistical Tests Overview

• Two hypotheses are evaluated: the null (H0) and the alternative (H1)

• The amount of evidence required to “prove” the alternative may be stated in terms of a confidence level

• We Don’t “Accept” the Null Hypothesis

Statistical Tests Overview

• Reject the null hypothesis (p-value <= α) and conclude that the alternative hypothesis is true at the pre-determined confidence level

• Fail to reject the null hypothesis (p-value > α) and conclude that there is not enough evidence to state that the alternative is true at the pre-determined confidence level

summarizing the P value• Once a threshold P value for statistical significance is set,

every result is either statistically significant or is not statistically significant

P value Wording Summary

>0.05 Not significant ns

0.01 to 0.05 Significant *

0.001 to 0.01 Very significant **

< 0.001 Extremely significant ***

Actual P value could be reported.

Choosing a statistical testGoal Metric (Normal

Population)Rank, Score, or Measurement (Non-Normal)

Binomial(Two Possible Outcomes)

Describe one group Mean, SD Median, interquartile range

Proportion

Compare one group to a hypothetical value

One-sample t test Wilcoxon test Chi-squareorBinomial test

Compare two unpaired groups

Unpaired t test Mann-Whitney test Fisher's test(chi-square for large samples

Compare two paired groups

Paired t test Wilcoxon test McNemar's test

Choosing a statistical testGoal Metric (Normal

Population)Rank, Score, or Measurement (Non-Normal)

Binomial(Two Possible Outcomes)

Compare three or more unmatched groups

One-way ANOVA Kruskal-Wallis test Chi-square test

Quantify association between two variables

Pearson correlation Spearman correlation

Contingency coefficients

Predict value from another measured variable

Simple linear regression

Nonparametric regression

Simple logistic regression

Predict value from several measured or binomial variables

Multiple linear regression

Multiple logistic regression

Quick Case Study

Data – Snap ShotID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

1 4.1 0.6 6.9 4.7 2.4 2.3 5.2 0 32 4.2 1 0 1 12 1.8 3 6.3 6.6 2.5 4 8.4 1 43 4.3 0 1 0 13 3.4 5.2 5.7 6 4.3 2.7 8.2 1 48 5.2 0 1 1 24 2.7 1 7.1 5.9 1.8 2.3 7.8 1 32 3.9 0 1 1 15 6 0.9 9.6 7.8 3.4 4.6 4.5 0 58 6.8 1 0 1 36 1.9 3.3 7.9 4.8 2.6 1.9 9.7 1 45 4.4 0 1 1 27 4.6 2.4 9.5 6.6 3.5 4.5 7.6 0 46 5.8 1 0 1 18 1.3 4.2 6.2 5.1 2.8 2.2 6.9 1 44 4.3 0 1 0 29 5.5 1.6 9.4 4.7 3.5 3 7.6 0 63 5.4 1 0 1 3

10 4 3.5 6.5 6 3.7 3.2 8.7 1 54 5.4 0 1 0 211 2.4 1.6 8.8 4.8 2 2.8 5.8 0 32 4.3 1 0 0 112 3.9 2.2 9.1 4.6 3 2.5 8.3 0 47 5 1 0 1 213 2.8 1.4 8.1 3.8 2.1 1.4 6.6 1 39 4.4 0 1 0 114 3.7 1.5 8.6 5.7 2.7 3.7 6.7 0 38 5 1 0 1 115 4.7 1.3 9.9 6.7 3 2.6 6.8 0 54 5.9 1 0 0 316 3.4 2 9.7 4.7 2.7 1.7 4.8 0 49 4.7 1 0 0 317 3.2 4.1 5.7 5.1 3.6 2.9 6.2 0 38 4.4 1 1 1 218 4.9 1.8 7.7 4.3 3.4 1.5 5.9 0 40 5.6 1 0 0 2

Nature of variableID ID Respondents' IdentityX1 DeliverySpeed MetricX2 PriceLevel MetricX3 PriceFlexibility MetricX4 MfrImage MetricX5 OverallService MetricX6 SalesforceImage MetricX7 ProductQuality MetricX8 FirmSize Nonmetric 1=large, 0 =smallX9 UsageLevel MetricX10 SatisfactionLevel MetricX11 SpecBuying Nonmetric 1=total, 0 = specificX12 ProcurementStructure Nonmetric 1=centre, 0 = decentrX13 IndustryType Nonmetric 1= industry A, 0 = othersX14 BuyingSituationType Nonmetric 1 = new task, 2 = modified rebuy 3 = straight rebuy

Sample Plan

• Test a hypothetical value for X1 = 2.5 or 3.5• Test for a group difference for Firm size X8

with respect to X1 X2 X4• ANOVA for group difference for Firm size X14

with respect to X2• Regression X1 to X4 (IVs) and X5 (DV)• Test for Associations

One Sample Test with X1One-Sample Test

7.685 99 .000 1.0150 .7529 1.2771X1

t df Sig. (2-tai led)Mean

Difference Lower Upper

95% ConfidenceInterval of the Difference

Test Value = 2.5

One-Sample Test

.114 99 .910 .0150 -.2471 .2771X1


Difference Lower Upper


Test Value = 3.5

Two Sample Test with X8Independent Samples Test

.934 .336 -8.045 98 .000 -1.6917 .21029 -2.10897 -1.27436

-8.074 84.766 .000 -1.6917 .20953 -2.10828 -1.27506

1.582 .211 4.687 98 .000 1.0392 .22171 .59919 1.47914

4.564 75.987 .000 1.0392 .22767 .58571 1.49262

Equal variances assumed

Equal variances notassumed



X1

X2

F Sig.

Levene's Test forEquality of Variances


DifferenceStd. ErrorDifference Lower Upper


t-test for Equality of Means

Independent Samples Test

6.549 .012 .374 98 .709 .0867 .23196 -.37365 .54698

.405 97.990 .686 .0867 .21406 -.33814 .51147



X4

F Sig.

Levene's Test forEquali ty of Variances


DifferenceStd. ErrorDifference Lower Upper


t-test for Equali ty of Means

ANOVA OUTPUTS

Descriptives

X2

34 2.0941 .95122 .16313 1.7622 2.4260 .40 3.70

32 3.1813 1.36510 .24132 2.6891 3.6734 .70 5.40

34 1.8647 .80862 .13868 1.5826 2.1468 .20 4.00

100 2.3640 1.19566 .11957 2.1268 2.6012 .20 5.40

A

B

C

Total

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Minimum Maximum

ANOVA

X2

32.325 2 16.163 14.356 .000

109.205 97 1.126

141.530 99

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

POST HOC ANALYSISMultiple Comparisons

Dependent Variable: X2

LSD

-1.0871* .26133 .000 -1.6058 -.5685

.2294 .25734 .375 -.2813 .7402

1.0871* .26133 .000 .5685 1.6058

1.3165* .26133 .000 .7979 1.8352

-.2294 .25734 .375 -.7402 .2813

-1.3165* .26133 .000 -1.8352 -.7979

(J) X14

A

B

C

A

B

C

A

B

C

(I) X14

A

B

C

MeanDifference (I-J) Std. Error Sig. Lower Bound Upper Bound

95% Confidence Interval

The mean difference is significant at the .05 level.*.

Regression Output

Coefficientsa

-.278 .105 -2.653 .009 -.486 -.070

.505 .011 .888 47.904 .000 .484 .526

.518 .012 .824 44.044 .000 .494 .541

.011 .011 .019 .989 .325 -.011 .032

.021 .011 .032 1.941 .055 .000 .043

(Constant)

X1

X2

X3

X4

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval forB

Dependent Variable: X5a.

Test for association – Contingency Table

X8 * X14 Crosstabulation

Count

10 16 34 60

24 16 0 40

34 32 34 100

Small

Large

X8

Total

A B C

X14

Total

Chi Square Test for association

Chi-Square Tests

37.255a 2 .000

49.047 2 .000

34.941 1 .000

100

Pearson Chi-Square

Continuity Correction

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.80.

a.

?

My RODAR • Conceptualize Your Research

• Know Your Objectives for the analysis

• Familiar with all aspects of Your Data

• Develop Your Analysis plan suitable for the research questions– stay true to that

• Aware statistical software that can best reflect Your Results

• Master in your analyses

• Have openness to additional investigations– Caution: Limitations given the data and the programs you are using

THANK YOU !

DR. [email protected]

9952726863

Happy Computing

Questions Comments…

mailto:[email protected]

Download - Rodar

Top Related