data analysis

61
DATA ANALYSIS Group 5

Upload: metalkid132

Post on 03-Dec-2014

1.010 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data analysis

DATA ANALYSISGroup 5

Page 2: Data analysis

The mean, median and mode

The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.

Presenter: Huu Loc

Page 3: Data analysis

Mean

The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data.

The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

Page 4: Data analysis

So, if we have n values in a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is:

Page 5: Data analysis

The mean is essentially a model of your data set. It is the value that is most common.

An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.

Page 6: Data analysis

Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

We first need to rearrange that data into order of magnitude (smallest first):

Page 7: Data analysis

Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it.

Page 8: Data analysis

Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Page 9: Data analysis
Page 10: Data analysis

Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below:

Page 11: Data analysis

One of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

Page 12: Data analysis

Summary of when to use the mean, median and mode

Using the following summary table to know what the best measure of central tendency is with respect to the different types of variable.

Page 13: Data analysis

MEASURES OF DISPERSION

Presenter: Nguyen Ngoc Cam

Page 14: Data analysis

Measures of Dispersion

Measure of central tendency give us good information about the

scores in our distribution.

However, we can have very different shapes to our distribution,

yet have the same central tendency.

Measures of dispersion or variability will give us information

about the spread of the scores in our distribution.

Are the scores clustered close together over a small portion of the

scale, or are the scores spread out over a large segment of the

scale?

Page 15: Data analysis

Main points:

1.Range

2.Standard Deviation

3.Variance

Page 16: Data analysis

1. Range

The difference between the biggest and

the smallest number in the data of the

group.

The range tells you how spread out the

data is.

Page 17: Data analysis

1. Range

 

Page 18: Data analysis

1. Range

Problem:

1. It changes drastically with the magnitude of the

extreme scores

2. It’s an unstable measure rarely used for statistical

analyses

Page 19: Data analysis

2. Standard Deviation

Standard Deviation is the most frequently

used measure of variability.

It looks at the average variability of all the

score around the mean, all the scores are

taken into account.

Page 20: Data analysis

2. Standard Deviation

The larger the Standard Deviation, the

more variability from the central point in

the distribution.

The smaller the Standard Deviation, the

closer the distribution is to the central

point.

Page 21: Data analysis

2. Standard Deviation

 

Page 22: Data analysis

2. Standard Deviation

 

Page 23: Data analysis

2. Standard Deviation

The SD tells us the standard of how far out

from the point of central tendency the

individual scores are distributed.

It tells us information that the mean

doesn’t as important or even more

important than the mean

Page 24: Data analysis

3. Variance

 

Page 25: Data analysis

PAIRED T-TEST

Presenter: Tran Thi Ngan Giang

Page 26: Data analysis

Introduction

• A paired t-test is used to compare two population means where you have two samples in which observations in one sample can be paired with observations in the other sample.

• For example:• A diagnostic test was made before studying a

particular module and then again after completing the module. We want to find out if, in general, our teaching leads to improvements in students’ knowledge/skills.

Page 27: Data analysis

First, we see the descriptive statistics for both variables.

The post-test mean scores are higher.

Page 28: Data analysis

Next, we see the correlation between the two variables.

There is a strong positive correlation. People who did well on the pre-test also did well on the post-test.

Page 29: Data analysis

Finally, we see the T, degrees of freedom, and significance.

• Our significance is .053• If the significance value is less

than .05, there is a significant difference.If the significance value is greater than. 05, there is no significant difference.

• Here, we see that the significance value is approaching significance, but it is not a significant difference. There is no difference between pre- and post-test scores. Our test preparation course did not help!

Page 30: Data analysis

INDEPENDENT SAMPLES T-TESTS

Presenter: Dinh Quoc Minh Dang

Page 31: Data analysis

Outline

1. Introduction

2. Hypothesis for the independent t-test

3. What do you need to run an independent t-test?

4. Formula

5. Example (Calculating + Reporting)

Page 32: Data analysis

Introduction

The independent t-test, also called the two sample t-test or

student's t-test is an inferential statistical test that determines

whether there is a statistically significant difference

between the means in two unrelated groups.

Page 33: Data analysis

Hypothesis for the independent t-test

The null hypothesis for the independent t-test is that the population means from the

two unrelated groups are equal:

H0: u1 = u2

In most cases, we are looking to see if we can show that we can reject the null

hypothesis and accept the alternative hypothesis, which is that the population

means are not equal:

HA: u1 ≠ u2

To do this we need to set a significance level (alpha) that allows us to either reject or

accept the alternative hypothesis. Most commonly, this value is set at 0.05.

Page 34: Data analysis

What do you need to run an independent t-test?

In order to run an independent t-test you need the following:

1. One independent, categorical variable that has two levels.

2. One dependent variable

Page 35: Data analysis

Formula

M: mean (the average score of the group)

SD: Standard Deviation

N: number of scores in each group

Exp: Experimental Group

Con: Control Group

Page 36: Data analysis

Formula

Page 37: Data analysis

Example

Page 38: Data analysis

Example

Page 39: Data analysis

Effect Size

•Cohen’s d measures difference in means in standard deviation units.

•Cohen’s d =

• Interpretation: • Small effect: d = .20 to .50

• Medium effect: d = .50 to .80

• Large effect: d = .80 and higher

Page 40: Data analysis

Reporting the Result of an Independent T-Test

When reporting the result of an independent t-test, you need to

include the t-statistic value, the degrees of freedom (df) and the

significance value of the test (P-value). The format of the test result

is: t(df) = t-statistic, P = significance value.

Page 41: Data analysis

Example result (APA Style)

• An independent samples T-test is presented the same as the one-sample t-test:

t(75) = 2.11, p = .02 (one –tailed), d = .48

• Example: Survey respondents who were employed by the federal, state, or local government had significantly higher socioeconomic indices (M = 55.42, SD = 19.25) than survey respondents who were employed by a private employer (M = 47.54, SD = 18.94) , t(255) = 2.363, p = .01 (one-tailed).

Degrees of freedom

Value of statistic

Significance of statistic

Include if test is one-tailed

Effect size if available

Page 42: Data analysis

Analysis of Variance (ANOVA)

Presenter : Minh Sang

Page 43: Data analysis

Introduction

We already learned about the chi square test for independence, which is useful for data that is measured at the nominal or ordinal level of analysis.If we have data measured at the interval level, we can compare two or more population groups in terms of their population means using a technique called analysis of variance, or ANOVA.

Page 44: Data analysis

Completely randomized design

Population 1 Population 2….. Population k

Mean = 1 Mean = 2 …. Mean = k

Variance=12 Variance=2

2 … Variance = k2

We want to know something about how the populations compare. Do they have the same mean? We can collect random samples from each population, which gives us the following data.

Page 45: Data analysis

Completely randomized design

Mean = M1 Mean = M2 ..… Mean = Mk

Variance=s12 Variance=s2

2 …. Variance = sk2

N1 cases N2 cases …. Nk cases

Suppose we want to compare 3 college majors in a business school by the average annual income people make 2 years after graduation. We collect the following data (in $1000s) based on random surveys.

Page 46: Data analysis

Completely randomized design

Accounting Marketing Finance

27 23 48

22 36 35

33 27 46

25 44 36

38 39 28

29 32 29

Page 47: Data analysis

Completely randomized design

Can the dean conclude that there are differences among the major’s incomes?

Ho: 1 = 2 = 3

HA: 1 2 3

In this problem we must take into account:1) The variance between samples, or the actual

differences by major. This is called the sum of squares for treatment (SST).

Page 48: Data analysis

Completely randomized design

2) The variance within samples, or the variance of incomes within a single major. This is called the sum of squares for error (SSE).

Recall that when we sample, there will always be a chance of getting something different than the population. We account for this through #2, or the SSE.

Page 49: Data analysis

F-Statistic

For this test, we will calculate a F statistic, which is used to compare variances.

F = SST/(k-1) SSE/(n-k)

SST=sum of squares for treatmentSSE=sum of squares for errork = the number of populationsN = total sample size

Page 50: Data analysis

F-statistic

Intuitively, the F statistic is:F = explained variance

unexplained varianceExplained variance is the difference between

majorsUnexplained variance is the difference based

on random sampling for each group (see Figure 10-1, page 327)

Page 51: Data analysis

Calculating SST

SST = ni(Mi - )2

= grand mean or = Mi/k or the sum of all values for all groups divided by total sample size

Mi = mean for each sample

k= the number of populations

Page 52: Data analysis

Calculating SST

By major

Accounting M1=29, n1=6

Marketing M2=33.5, n2=6

Finance M3=37, n3=6

= (29+33.5+37)/3 = 33.17

SST = (6)(29-33.17)2 + (6)(33.5-33.17)2 + (6)(37-33.17)2 = 193

Page 53: Data analysis

Calculating SST

Note that when M1 = M2 = M3, then SST=0 which would support the null hypothesis.

In this example, the samples are of equal size, but we can also run this analysis with samples of varying size also.

Page 54: Data analysis

Calculating SSE

SSE = (Xit – Mi)2

In other words, it is just the variance for each sample added together.

SSE = (X1t – M1)2 + (X2t – M2)2 +

(X3t – M3)2 SSE = [(27-29)2 + (22-29)2 +…+ (29-29)2]

+ [(23-33.5)2 + (36-33.5)2 +…]+ [(48-37)2 + (35-37)2 +…+ (29-37)2]

SSE = 819.5

Page 55: Data analysis

Statistical Output

When you estimate this information in a computer program, it will typically be presented in a table as follows:

Source of df Sum of Mean F-ratio

Variation squares squares

Treatmentk-1 SST MST=SST/(k-1) F=MST

Error n-k SSE MSE=SSE/(n-k) MSE

Total n-1 SS=SST+SSE

Page 56: Data analysis

Calculating F for our example

F = 193/2 819.5/15

F = 1.77Our calculated F is compared to the critical

value using the F-distribution with

F, k-1, n-k degrees of freedomk-1 (numerator df)n-k (denominator df)

Page 57: Data analysis

The Results

For 95% confidence (=.05), our critical F is 3.68 (averaging across the values at 14 and 16

In this case, 1.77 < 3.68 so we must accept the null hypothesis.

The dean is puzzled by these results because just by eyeballing the data, it looks like finance majors make more money.

Page 58: Data analysis

The Results

Many other factors may determine the salary level, such as GPA. The dean decides to collect new data selecting one student randomly from each major with the following average grades.

Page 59: Data analysis

New data

Average Accounting Marketing Finance M(b)

A+ 41 45 51 M(b1)=45.67

A 36 38 45 M(b2)=39.67

B+ 27 33 31 M(b3)=30.83

B 32 29 35 M(b4)=32

C+ 26 31 32 M(b5)=29.67

C 23 25 27 M(b6)=25

M(t)1=30.83 M(t)2=33.5 M(t)3=36.83

= 33.72

Page 60: Data analysis

Randomized Block Design

Now the data in the 3 samples are not independent, they are matched by GPA levels. Just like before, matched samples are superior to unmatched samples because they provide more information. In this case, we have added a factor that may account for some of the SSE.

Page 61: Data analysis

Two way ANOVA

Now SS(total) = SST + SSB + SSE

Where SSB = the variability among blocks, where a block is a matched group of observations from each of the populations

We can calculate a two-way ANOVA to test our null hypothesis. We will talk about this next week.