data analysis with r and julia
DESCRIPTION
R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. This session outlines functional and performance differences between these two software packages. You’ll see demonstrations of best tips for integrating this software with Windows and walk away with guidelines for working with commercial software. A version of this presentation had 100 attendees at the PASS Business Analytics Conference in Chicago (April 2013), and 40 attendees for the PASS Virtual Business Analytics meeting (May 2013).TRANSCRIPT
Data Analysis with R and Julia Advanced Analytics and Insights
Mark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.
NetworkingInteractive
About MarkTabTraining and Consulting with http://marktab.com
Data Mining Resources and Blog at http://marktab.net
Twitter @marktabnet
OutlineR Language
Market Analysis
Performance
Production Use
Julia Language
Performance
The R Languagehttp://cran.r-project.org
Major R VersionsVersion Description
01996
Initial release: University of Auckland, New Zealand
12000
Completeness and stability high enough to characterize a full statistical system, which could be put to production use
2 2004
Strong enhancements of the memory management subsystem as well as several major features, including Sweave (into LaTeX or LyX).
32013
The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bit support on all platforms, support for parallel processing, the Matrix package
http://www.r-project.org/
How R WorksAs with an automobile, you can use R without worrying very much about how it works.
But computing with data is more complicated than driving a car (fortunately for highway safety)
John Chambers
Software for Data Analysis, page 453
R works in a shellCross-platform, including Windows x32 or x64
Interactive graphical user interface (GUI) to interpret commands
Read – accept user input
Parse -- interpret input using expected syntax
Evaluate – execute commands
Everything is an object
Data are stored in data frames, named lists
R implements S language grammar, with a few extensions
R GUI
Read-Parse-Evaluate Loop
Read
ParseEvaluate
R and SQL Serverinstall.packages("RODBC")
library(RODBC)
MDAC Downloads
R Market Analysis
Listserv Discussion
http://r4stats.com/articles/popularity/
Estimated R UsageEstimated 250,000 people use it regularly (as of 2009)
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=2&_r=0
General Forum Postings
http://r4stats.com/articles/popularity/
Stack Overflow Alone
http://r4stats.com/articles/popularity/
Academic Publications
http://r4stats.com/articles/popularity/
Comparison of R, Matlab, SAS, Stata, SPSS
http://www.analyticbridge.com/group/productreviews2/forum/topics/product-reviews-comparing-r-matlab-sas-stata-spss
R Performance
R is Memory-Bound𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒
4= 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅 𝐷𝑎𝑡𝑎
Source: Joseph B. Rickert, February 14, 2013
64𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑅𝐴𝑀
32𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑈𝑠𝑒𝑟 𝑉𝑖𝑟𝑡𝑢𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 − 0.5𝐺𝐵 ≅ 2 𝐺𝐵
Source: http://cran.r-project.org/bin/windows/base/rw-FAQ.html retrieved March 1, 2013
R is Memory-BoundAll objects in an R session are stored in memory
R places a limit of 231 − 1 bytes on all object sizes, independent of RAM
The Art of R Programming, Norman Matloff
R Memory ManagementAutomatic including garbage collection
rm()removes object assignment, but does not delete memory
gc() forces garbage collection with substantial computation
Improving Performance
The Art of R Programming, Chapter 14, Norman Matloff
Power
Simplicity
Vectorization Byte-Code Compilation
Parallel RC/C++
Improving PerformanceMethod Description
C/C++ Call C programs from R
Vectorization Recode for vectorization replacing slower functions
Byte-code compilation cmpfun()
Parallel R parallel packagehttp://cran.r-project.org/web/views/HighPerformanceComputing.html
Improving PerformanceRprof()– measures speed of functions
ff – memory-efficient storage of large data on disk and fast access functions
bigmemory -- Manage massive matrices with shared memory and memory-mapped files
R for Production Use
Derivative ProjectsRStudio – Integrated Development Environment (IDE)
Rattle – Data Mining Package
RExcel – (Statconn) Connection between R and Excel
Weka – Java-based data mining, statistical analysis by R
RapidMiner – Java-based Weka data mining, statistical analysis by R
Revolution Analytics – Scaling R for the Enterprise
Oracle R Enterprise – Integrated into Oracle
About Statconn (as of March 2013)Produces RAndFriends under noncommercial and commercial licenses
All the statconn tools work ONLY with 32-bit R
statconnDCOM
rcom (GPL2, but requires statconnDCOM)
RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)
http://rcom.univie.ac.at/
Sample Projects Using RThe Heritage Health Prize, Thomas Nguyen
A Direct Marketing In-flight Forecasting System, Shannon Terry & Ben Ogorek
Mining Twitter for Airline Consumer Sentiment, Jeffrey Breen
Alternative Data Sources for Measuring Market Sentiment and Events (Using R), Joe Rothermich
The Julia Languagehttp://julialang.org/
About JuliaHigh-level, high-performance dynamic open-source programming language for technical computing
Syntax similar to other technical computing environments
Features
Sophisticated compiler
Distributed parallel execution
Numerical accuracy
Extensive mathematical function library
Uses C, C++, Fortran libraries extensively
Why Julia: “Because we are greedy”
http://julialang.org/blog/2012/04/nyc-open-stats-meetup-announcement/
Julia CommunityHosted on github
550 mailing list subscribers (Google Groups)
1,500 github followers
190 forks
50 total contributors
As of September 2012, all contributors except the core developers had known of the language for six months or less
Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman
The Julia Manual
http://docs.julialang.org/en/latest/manual/
Julia Mathematical Functions
http://docs.julialang.org/en/latest/manual/mathematical-operations/
Julia Standard Library
http://docs.julialang.org/en/latest/stdlib/
Julia Performance
Key Ingredients of Julia PerformanceRich type information, provided naturally by multiple dispatch
Aggressive code specialization against run-time types
Julia’s LLVM-based just-in-time (JIT) compiler
Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman
Julia Performance Comparison
http://julialang.org/
Julia Performance Comparison
Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman
Julia RecommendationsThe software is ready for people already using C or Fortran
The software will develop into a usable scripting language for R users
Wait until version one for production use
Send me Your Questionshttp://marktab.net
ConclusionR provides production-ready software for statistical analysis
Julia merits personal investment and promises high performance