course in statistics and data analysis course b day2 september 2009 stephan frickenhaus

Course in Statisticsand Data analysis

Course B DAY2

September 2009Stephan Frickenhaus

www.awi.de/en/go/bioinformatics

http://www.awi.de/en/go/bioinformatics





DAY2

How to import data from excel

Multivariate data in plots, linear models

ANOVA

Ideas of Clustering and Modelling

Import of Data from Excel• Change in R to the directory

where the „tab1.txt“ is (File->Change Dir.).

• Load in R into a variable V

V=read.table(file=“tab1.txt“,header=T)

• You may use the column named „day“ as row-names:

V=read.table(file=“tab1.txt“,header=T,row.names=“day“)

• Copy a rectangular part (or all) from your table

• Paste into a TEXT-file in the Windows-EDITOR

• check column-names• Save as „tab1.txt“

Problems

• In case of prblems with decimal `,` or `.`:

• Tell R which is the decimal point in read.table

• If you get a text-file with commas, not tabs separating columns:

V=read.table(…,dec=“.“)

V=read.table(…, sep=“,“)

Saving result tables

• R-analysis results, e.g., from filtering etc. are sometimes exported to text-files- can be imported in Excel or R later

• Do this without quotes for each entry:write.table(V, file=“res.txt“, quote=F)

• Save only 2 desired columns („size“ and „class“):• write.table(rbind(V$size,V$class),

file=“res2.txt“)

Multivariate Data

• Suppose we have Diameter and Height of Diatoms measured

• Work with „diatoms.txt“

• What is the relation between these? It one dependent on the other? What is the strategy of the organism?

Correlation test

• Is there a significant correlation?

cor.test(D,H)

Checks if the observed correlation is significant non-zero

We find negative corr., near -1 (strong)

A good p-value shows significant correlation.

Text

• We can conclude that these diatoms show a special trend: increasing height, when decreasing diameter.

• What does this mean? Can we say that this has a compensating function?

• It could be that the cell does maintain volume (centric shape).

• Volume

V=R^2*pi*H = 1/4 D^2 *pi * H

• So we expect a linear relation between H and 1/D^2

• we need a regression…

…it is found in R: lm(Y~X)

Try ?lm to see how.

See „diatoms.R“

Linear models

• To fit a model to data• Suppose we have a

sample of measured (y,x1,x2,x3)

• The simplest model showing influence of all 3 x has the form y=a*x1+b*x2+c*x3+d

• Coefficients a,b,c,d obtained from lm(y~x1+x2+x3)

• Each coefficients value may be non-significant, so it could as well be set to zero.

• summary(lm()) shows these significances

Check „lm.R“

The data y was created with coefficients 1 , 1 , 0.5 and a random term runif/3

We see estimates of these coefficients from the fit under „Estimate“.

Now, we could write the fitted model as y.fit(x1,x2,x3)=0.26826+1.00595*x1+1.01167*x2+0.47311*x3

Use this to draw a ± error bar around the y.fit

If you want no intercept, use

y~x1+x2+x3-1

conclusions

• Variables x with significant coefficients, i.e., Pr(|t|>)<alpha, are said to have an effect on y.

• Sometimes there are relations between the explaining varibles, say x1 and x2 are correlated, like x2=2*x1.

• Then, y=c1*x1+c2*x2 can be reduced like

Y=(c1+2)*x1

ANOVA• With two different treatments we make the t-

test to compare means.• The influence of a factor/treatment with more

than 2 variants is commonly analysed by ANOVA, i.e., more than two means are compared at the same time.

• The Null is that all samples means are from the same pop [the treatment has no effect].

ANOVA• In R ist like linear models, but with factors that

influence the means.• See dataset ANOVA.txt• Try aov(y~f.c)

A weak p,effect may be

unclear because of the other

factors

But which means do differ?

• f.c has 3 levels.• We are not allowed to

look at the means of each level.

• We must make all pairwise comparisons for significance

• This is known as „post-hoc“-test

• One is TukeyHSD• It gives a table of

pairwise tests of means• Since data is used

more than once, well discover more likely some effect.

• HSD corrects p-values for multiple-tests

Post-hoc

Almost significant

effect, comparing

group 1 with 0

adjusted p for 3 tests

A graphical view

plot(y~f.c)

Compare with a T-test

So, the adjusted p-value 0.06 from HSD is

greater

Ideas of clustering and modeling

• Clustering is a way to detect/display groups in data that might point to a factor which affects the sample.

• Different ways:– Mapping:

• plot multivariate data in a special way to see groups

– Discriminant analysis:use a known factor (e.g., strain) to find a maping that best seperates the known groups

• Use the discriminant to classify new data !!!

PCA

• Download data PCA.txt

• See PCA.R to make a PCA for that multivariate data

• PC1 is rotated data, with maximal variance

• PC2 has smaller variance we could separate /

discriminate with this line

Linear Discriminant

check LDA.R and LDA.txt to see similar results

the original 3-class 3D-data in a 2D LDA

new data (squares) classified (predicted) accoring to the LDA

course in statistics and data analysis course b day2 september 2009 stephan frickenhaus

Documents