correlation and causality
Post on 22-Feb-2016
Embed Size (px)
DESCRIPTIONThis Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared. Correlation and Causality. Correlation is a precondition for causality– but by itself it is not sufficient to show causality. X. Y. - PowerPoint PPT Presentation
Testing relationships between two metric variables: Correlation
Testing relationships between two nominal variables: Chi-Squared 1Correlation and CausalityCorrelation is a precondition for causality but by itself it is not sufficient to show causality
Highly correlated variables help us predict what one variable will be based on the other, but this ability has nothing to do with establishing which one causes the other (if at all).2Conditions of CausalityCovariation/Correlation
No confounding covariates
Logical time ordering
Mechanism to explain how X causes YXYSpurious relationshipshome power consumption causes ice cream sales to go up.3Causation and Causal PathsDirect causal paths
Indirect causationXYXZYXYReciprocal causation is not especially common, but it shows up when time order issue (earlier slide) is not clear. For example, effect of Government foreign policy (X) on Government popularity (Y). It is entirely reasonable to believe that XY as well as YX.4Just having more data does not address the problem of causality.There are pundits in the fields of big data and data science that argue that having enough data solves the problem of causality because we are collecting so much information.
Correlations are a way of catching a scientist's attention. Models and mechanisms that explain relationships are how we make the predictions that not only advance science, but generate practical applications.
Three ways to summarize this data: point of averagesHorizontal SDVertical SD6The z-score (review)Standardizing a group of scores changes the scale to one of standard deviation units.
Allows for comparisons with scores that were originally on a different scale.
Tells us where a score is located within a distribution specifically, how many standard deviation units the score is above or below the mean.
The z-score is important for thinking about individual scores on any distribution. Observed value minus mean, over standard deviation.7Correlation: Checking for simple linear relationships using standard units (z-scores)Pearsons correlation coefficientMeasures the extent to which two metric or interval-type variables are linearly relatedStatistic is Pearson r
The correlation coefficient is the average of the products of the corresponding z-scores.8The correlation coefficient
r = average of (x in standard units) x (y in standard units)
The correlation coefficient is a pure number without units. It ranges from -1 to 1. Zero means no correlation.Thus, the correlation coefficient is not affected by:Change in scale Interchanging variablesAdding the same number to one of the variablesMultiplying all values of one variable by a positive number9CorrelationsRanges from z-1 to 1, where 1 = perfect linear relationship between the two variables. 0 = no linear relationship between variablesPositive relationsNegative relations
The correlation coefficient is a measure of clustering around a single line, relative to the standard deviations.
Remember: correlation ONLY measures linear relationships, not all relationships.This is why the scatterplot is essential; you can get a coefficient even if there is not a good linear fit.
10InterpretationCorrelation is a proportional measure; does not depend on specific measurements
Correlation interpretation:Direction (+/-)Magnitude of Effect (-1 to 1); shown as rHorizontal line means no correlation.Statistical Significance (p