business statistics - qbm117
DESCRIPTION
Business Statistics - QBM117. Scatter diagrams and measures of association. Objectives. To introduce briefly, the topic of regression and correlation. To explore relationships between two variables using the graphical technique of scatter diagrams. - PowerPoint PPT PresentationTRANSCRIPT
Business Statistics - QBM117
Scatter diagrams and measures of association
Objectives
To introduce briefly, the topic of regression and correlation.
To explore relationships between two variables using the graphical technique of scatter diagrams.
To introduce two measures of association which can be used to measure the amount of association between two variables.
Regression and correlation: measuring and predicting relationships
Regression and correlation shows us how to summarise the relationship between two factors, based on a bivariate (two variables) set of data.
Correlation is a measure of the strength of the relationship between the two variables;
Regression helps us to predict one variable from the other.
In earlier modules we learnt to look at data, compute and interpret probabilities, draw random samples and perform statistical inference. Now we apply these concepts to explore relationships between several variables.
In our earlier studies we learnt to summarise univariate (single variable) data using statistical summaries such as the mean, to describe the centre and the standard deviation to describe the variability.
With bivariate data we could use these same statistics to summarise each variable separately, however the payoff comes from studying them both together, to explore the relationship between them.
Economists and business operators are often interested in relationships between two quantitative variables.
Exploring relationships using scatterplots
For example
How does advertising affect sales in my business?
If I increase the price on this product, what effect will this have on demand?
What effect are inflation rates having on unemployment rates, on the price of petrol, on the price of new homes etc?
Exploring relationships using scatterplots and correlations
is the relationship between the two variables linear or non linear?
are there any outliers in the data?
what is the strength of the relationship between the two variables? etc.
Scatterplots provide useful insights into the structure of the data such as
Correlation is a summary measure of the strength of the relationship. It is both helpful and limited.
If the scatterplot shows either a well behaved linear relationship or no relationship at all, then the correlation provides an excellent summary of the relationship;
If however there are problems with the data such as, a non linear relationship or outliers in the data, the correlation can be misleading.
Therefore correlation on its own has limited use as its interpretation depends on the type of relationship in the data.
The Scatterplot
is simply a plot of all the data.
If one variable is seen as causing, affecting, or influencing the other, then it is plotted on the x (horizontal) axis. This variable is referred to as the independent variable. The variable that is affected or influenced by the other, is plotted on the y (vertical) axis. This variable is referred to as the dependent variable.
If neither causes, affects or influences the other, it does not matter which one is plotted where.
Correlation measures the strength of the relationship between the two variables
Correlation, denoted (rho) for a population and r for a sample, varies from –1 to +1, summarising the strength of the relationship in the data.
A correlation of 1 indicates a perfect straight-line relationship, with higher values of one variable associated with perfectly predictable higher values of the other variable.
A correlation of –1 indicates a perfect inverse straight-line relationship, with one variable decreasing as the other increases.
For correlations between –1 and 1, the size of the correlation indicates the strength of the relationship while the sign (+ or -) indicates the direction (increasing or decreasing).
A correlation of 0 generally indicates no relationship, just randomness.
Correlations must be interpreted with caution as nonlinear structures and outliers can distort the usual interpretation.
Correlation measures how close the data points are to being exactly on a tilted straight line. It has nothing to do with the steepness (slope) of the line.
Interpreting Correlation
r = 1• A perfect straight line
tilting up to the right
r = 0• No overall tilt• No relationship?
r = – 1• A perfect straight line
tilting down to the right
X
Y
X
Y
X
Y
X
Y
X
Y
X
Y
Various types of relationships
A linear relationship is observed when
the scatterplot shows points bunched randomly around a straight line.
The points could be tightly bunched, falling almost exactly on a line, or more likely, they will be well scattered, forming a ‘cloud’ of points.
Example: Exploring TV Ratings
People Meters vs. Nielsen Index• Two measures of the market share of 10 TV
shows• Correlation is r = 0.974
• Very strong positive association (since r is close to 1)
• Linear relationship• Straight line
with scatter
• Increasing relationship• Tilts up and to the right
10
20
30
10 20 30Nielsen Index
Peop
le M
eter
s
Example: Merger Deals
Dollars vs. Deals• For mergers and acquisitions by investment
bankers• 134 deals worth $63 billion by Goldman Sachs
• Correlation is r = 0.790• Strong positive association
• Linear relationship• Straight line
with scatter
• Increasing relationship• Tilts up and to the right
0
20
40
60
80
0 50 100 150 200
Deals
Dol
lars
(B
illi
ons)
Example: Mortgage Rates & Fees
Interest Rate vs. Loan Fee• For mortgages
• If the interest rate is lower, does the bank make it up with a higher loan fee?
• Correlation is r = – 0.654• Negative association
• Linear relationship• Straight line
with scatter
• Decreasing relationship• Tilts down and to the right
7%
8%
0% 1% 2% 3%Loan fee
Inte
rest
rat
e
Various types of relationships
No relationship is observed when
the scatterplot shows a random scatter of points with no tilt either upward or downward.
The points could look like a ‘cloud’ of points that is either circular or oval shaped.
The oval could be either up and down or left and right but it is not tilted (as you move from left to right).
Example: The Stock Market
Today’s vs. Yesterday’s Percent Change• Is there momentum?
• If the market was up yesterday, is it more likely to be up today? Or is each day’s performance independent?
• Correlation is r = 0.11• A weak relationship?
• No relationship?• Tilt is neither
up nor down -3%
-2%
-1%
0%
1%
2%
3%
-3% -2% -1% 0% 1% 2% 3%
Yesterday's change
Toda
y's
chan
ge
Various types of relationships
A non linear relationship is observed when
the scatterplot shows points bunched around a curve, rather than a straight line.
Correlation and regression analysis must be used with care on nonlinear data sets.
For most problems we first transform one or both of the variables, to obtain a linear relationship, then we fit a regression.
Call Price vs. Strike Price• For stock options
• “Call Price” is the price of the option contract to buy stock at the “Strike Price”
• The right to buy at a lower strike price has more value
• A nonlinear relationship• Not a straight line:
A curved relationship
• Correlation r = – 0.895• A negative relationship:
Higher strike price goes
with lower call price
Example: Stock Options
$0
$25
$50
$75
$100
$450 $500 $550 $600 $650
Strike Price
Cal
l Pric
e
Example: Maximizing Yield
Output Yield vs. Temperature• For an industrial process
• With a “best” optimal temperature setting
• A nonlinear relationship• Not a straight line:
A curved relationship
• Correlation r = – 0.0155• r suggests no relationship
• But relationship is strong• It tilts neither
up nor down
120
130
140
150
160
500 600 700 800 900
TemperatureY
ield
of
proc
ess
Outliers
A data point is an outlier if it does not fit the relationship of the rest of the data.
It can distort statistical summaries and make them very misleading.
Watch out for outliers by looking at the scatterplot and if you can justify removing an outlier (by finding that it should not have been there), then do so.
If you have to leave it, be aware of the problems it can cause and consider reporting statistical summaries (eg the correlation coefficient) both with and without it.
Example: Cost and Quantity
Cost vs. Number Produced• For a production facility
• It usually costs more to produce more
• An outlier is visible• A disaster (a fire at the factory)
• High cost, but few produced
3,000
4,000
5,000
20 30 40 50Number produced
Cos
t
0
10,000
0 20 40 60Number produced
Cos
t
Outlier removed:More details,r = 0.869r = – 0.623
Reading for next lecture
Read Chapter 18 Sections 18.1 - 18.3
(Chapter 11 Sections 11.1 – 11.3 abridged)