introduction to data analysis with r
TRANSCRIPT
Introduction to Data Analysis with R
Dani Solà
What is R?● “R is a language and environment for statistical
computing and graphics”
● Paradigms: array, object-oriented, imperative,
functional, procedural, reflective
● Everything resides in memory (no big data)
● Easy to get started!
Why R?● Free Software (GNU General Public License)
● Mature, v1.0 released on 2000
● Widely used
● Good documentation and manuals
● Lots of freely available packages
● Excellent graphic capabilities
Getting the data (CSV)● MySQL
● Hive + sed
● Consider sampling!
SELECT * INTO OUTFILE '/path/to/file.csv'FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'ESCAPED BY ‘\\’LINES TERMINATED BY '\n'FROM table WHERE <condition>;
INSERT OVERWRITE LOCAL DIRECTORY '/tmp_path/'SELECT * FROM table WHERE <condition>;
cat /tmp_path/* | sed 's/[Ctrl-V][Ctrl-A]/\t/g' > out.txt
Linear Regressiony=α+β x
α= y−β x
β=∑i=1
n(xi− x)( y i− y)
∑i=1
n(x i− x)
2=Cov [ x , y ]Var [ x ]
Just use lm() in R!
(But check the assumptions)
Want more?● Computing for Data Analysis – Roger D. Peng
www.coursera.org/course/compdata
● Statistics One – Andrew Conway
www.coursera.org/course/stats1
● An Introduction to R – The R Core Team
cran.r-project.org/doc/manuals/r-release/R-intro.pdf