course in statistics and data analysis course b day2 september 2009 stephan frickenhaus
TRANSCRIPT
Course in Statisticsand Data analysis
Course B DAY2
September 2009Stephan Frickenhaus
www.awi.de/en/go/bioinformatics
DAY2
How to import data from excel
Multivariate data in plots, linear models
ANOVA
Ideas of Clustering and Modelling
Import of Data from Excel• Change in R to the directory
where the „tab1.txt“ is (File->Change Dir.).
• Load in R into a variable V
V=read.table(file=“tab1.txt“,header=T)
• You may use the column named „day“ as row-names:
V=read.table(file=“tab1.txt“,header=T,row.names=“day“)
• Copy a rectangular part (or all) from your table
• Paste into a TEXT-file in the Windows-EDITOR
• check column-names• Save as „tab1.txt“
Problems
• In case of prblems with decimal `,` or `.`:
• Tell R which is the decimal point in read.table
• If you get a text-file with commas, not tabs separating columns:
V=read.table(…,dec=“.“)
V=read.table(…, sep=“,“)
Saving result tables
• R-analysis results, e.g., from filtering etc. are sometimes exported to text-files- can be imported in Excel or R later
• Do this without quotes for each entry:write.table(V, file=“res.txt“, quote=F)
• Save only 2 desired columns („size“ and „class“):• write.table(rbind(V$size,V$class),
file=“res2.txt“)
Multivariate Data
• Suppose we have Diameter and Height of Diatoms measured
• Work with „diatoms.txt“
• What is the relation between these? It one dependent on the other? What is the strategy of the organism?
Correlation test
• Is there a significant correlation?
cor.test(D,H)
Checks if the observed correlation is significant non-zero
We find negative corr., near -1 (strong)
A good p-value shows significant correlation.
Text
• We can conclude that these diatoms show a special trend: increasing height, when decreasing diameter.
• What does this mean? Can we say that this has a compensating function?
• It could be that the cell does maintain volume (centric shape).
• Volume
V=R^2*pi*H = 1/4 D^2 *pi * H
• So we expect a linear relation between H and 1/D^2
• we need a regression…
…it is found in R: lm(Y~X)
Try ?lm to see how.
See „diatoms.R“
Linear models
• To fit a model to data• Suppose we have a
sample of measured (y,x1,x2,x3)
• The simplest model showing influence of all 3 x has the form y=a*x1+b*x2+c*x3+d
• Coefficients a,b,c,d obtained from lm(y~x1+x2+x3)
• Each coefficients value may be non-significant, so it could as well be set to zero.
• summary(lm()) shows these significances
Check „lm.R“
The data y was created with coefficients 1 , 1 , 0.5 and a random term runif/3
We see estimates of these coefficients from the fit under „Estimate“.
Now, we could write the fitted model as y.fit(x1,x2,x3)=0.26826+1.00595*x1+1.01167*x2+0.47311*x3
Use this to draw a ± error bar around the y.fit
If you want no intercept, use
y~x1+x2+x3-1
conclusions
• Variables x with significant coefficients, i.e., Pr(|t|>)<alpha, are said to have an effect on y.
• Sometimes there are relations between the explaining varibles, say x1 and x2 are correlated, like x2=2*x1.
• Then, y=c1*x1+c2*x2 can be reduced like
Y=(c1+2)*x1
ANOVA• With two different treatments we make the t-
test to compare means.• The influence of a factor/treatment with more
than 2 variants is commonly analysed by ANOVA, i.e., more than two means are compared at the same time.
• The Null is that all samples means are from the same pop [the treatment has no effect].
ANOVA• In R ist like linear models, but with factors that
influence the means.• See dataset ANOVA.txt• Try aov(y~f.c)
A weak p,effect may be
unclear because of the other
factors
But which means do differ?
• f.c has 3 levels.• We are not allowed to
look at the means of each level.
• We must make all pairwise comparisons for significance
• This is known as „post-hoc“-test
• One is TukeyHSD• It gives a table of
pairwise tests of means• Since data is used
more than once, well discover more likely some effect.
• HSD corrects p-values for multiple-tests
Post-hoc
Almost significant
effect, comparing
group 1 with 0
adjusted p for 3 tests
A graphical view
plot(y~f.c)
Compare with a T-test
So, the adjusted p-value 0.06 from HSD is
greater
Ideas of clustering and modeling
• Clustering is a way to detect/display groups in data that might point to a factor which affects the sample.
• Different ways:– Mapping:
• plot multivariate data in a special way to see groups
– Discriminant analysis:use a known factor (e.g., strain) to find a maping that best seperates the known groups
• Use the discriminant to classify new data !!!
PCA
• Download data PCA.txt
• See PCA.R to make a PCA for that multivariate data
• PC1 is rotated data, with maximal variance
• PC2 has smaller variance we could separate /
discriminate with this line
Linear Discriminant
check LDA.R and LDA.txt to see similar results
the original 3-class 3D-data in a 2D LDA
new data (squares) classified (predicted) accoring to the LDA