data analysis with r and julia

43
Data Analysis with R and Julia Advanced Analytics and Insights Mark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.

Upload: mark-tabladillo

Post on 27-Jan-2015

112 views

Category:

Business


3 download

DESCRIPTION

R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. This session outlines functional and performance differences between these two software packages. You’ll see demonstrations of best tips for integrating this software with Windows and walk away with guidelines for working with commercial software. A version of this presentation had 100 attendees at the PASS Business Analytics Conference in Chicago (April 2013), and 40 attendees for the PASS Virtual Business Analytics meeting (May 2013).

TRANSCRIPT

Page 1: Data analysis with R and Julia

Data Analysis with R and Julia Advanced Analytics and Insights

Mark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.

Page 2: Data analysis with R and Julia

NetworkingInteractive

Page 3: Data analysis with R and Julia

About MarkTabTraining and Consulting with http://marktab.com

Data Mining Resources and Blog at http://marktab.net

Twitter @marktabnet

Page 4: Data analysis with R and Julia

OutlineR Language

Market Analysis

Performance

Production Use

Julia Language

Performance

Page 5: Data analysis with R and Julia

The R Languagehttp://cran.r-project.org

Page 6: Data analysis with R and Julia

Major R VersionsVersion Description

01996

Initial release: University of Auckland, New Zealand

12000

Completeness and stability high enough to characterize a full statistical system, which could be put to production use

2 2004

Strong enhancements of the memory management subsystem as well as several major features, including Sweave (into LaTeX or LyX).

32013

The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bit support on all platforms, support for parallel processing, the Matrix package

http://www.r-project.org/

Page 7: Data analysis with R and Julia

How R WorksAs with an automobile, you can use R without worrying very much about how it works.

But computing with data is more complicated than driving a car (fortunately for highway safety)

John Chambers

Software for Data Analysis, page 453

Page 8: Data analysis with R and Julia

R works in a shellCross-platform, including Windows x32 or x64

Interactive graphical user interface (GUI) to interpret commands

Read – accept user input

Parse -- interpret input using expected syntax

Evaluate – execute commands

Everything is an object

Data are stored in data frames, named lists

R implements S language grammar, with a few extensions

Page 9: Data analysis with R and Julia

R GUI

Page 10: Data analysis with R and Julia

Read-Parse-Evaluate Loop

Read

ParseEvaluate

Page 11: Data analysis with R and Julia

R and SQL Serverinstall.packages("RODBC")

library(RODBC)

MDAC Downloads

Page 12: Data analysis with R and Julia

R Market Analysis

Page 13: Data analysis with R and Julia

Listserv Discussion

http://r4stats.com/articles/popularity/

Page 14: Data analysis with R and Julia

Estimated R UsageEstimated 250,000 people use it regularly (as of 2009)

http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=2&_r=0

Page 15: Data analysis with R and Julia

General Forum Postings

http://r4stats.com/articles/popularity/

Page 16: Data analysis with R and Julia

Stack Overflow Alone

http://r4stats.com/articles/popularity/

Page 17: Data analysis with R and Julia

Academic Publications

http://r4stats.com/articles/popularity/

Page 18: Data analysis with R and Julia

Comparison of R, Matlab, SAS, Stata, SPSS

http://www.analyticbridge.com/group/productreviews2/forum/topics/product-reviews-comparing-r-matlab-sas-stata-spss

Page 19: Data analysis with R and Julia

R Performance

Page 20: Data analysis with R and Julia

R is Memory-Bound𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒

4= 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅 𝐷𝑎𝑡𝑎

Source: Joseph B. Rickert, February 14, 2013

64𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑅𝐴𝑀

32𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑈𝑠𝑒𝑟 𝑉𝑖𝑟𝑡𝑢𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 − 0.5𝐺𝐵 ≅ 2 𝐺𝐵

Source: http://cran.r-project.org/bin/windows/base/rw-FAQ.html retrieved March 1, 2013

Page 21: Data analysis with R and Julia

R is Memory-BoundAll objects in an R session are stored in memory

R places a limit of 231 − 1 bytes on all object sizes, independent of RAM

The Art of R Programming, Norman Matloff

Page 22: Data analysis with R and Julia

R Memory ManagementAutomatic including garbage collection

rm()removes object assignment, but does not delete memory

gc() forces garbage collection with substantial computation

Page 23: Data analysis with R and Julia

Improving Performance

The Art of R Programming, Chapter 14, Norman Matloff

Power

Simplicity

Vectorization Byte-Code Compilation

Parallel RC/C++

Page 24: Data analysis with R and Julia

Improving PerformanceMethod Description

C/C++ Call C programs from R

Vectorization Recode for vectorization replacing slower functions

Byte-code compilation cmpfun()

Parallel R parallel packagehttp://cran.r-project.org/web/views/HighPerformanceComputing.html

Page 25: Data analysis with R and Julia

Improving PerformanceRprof()– measures speed of functions

ff – memory-efficient storage of large data on disk and fast access functions

bigmemory -- Manage massive matrices with shared memory and memory-mapped files

Page 26: Data analysis with R and Julia

R for Production Use

Page 27: Data analysis with R and Julia

Derivative ProjectsRStudio – Integrated Development Environment (IDE)

Rattle – Data Mining Package

RExcel – (Statconn) Connection between R and Excel

Weka – Java-based data mining, statistical analysis by R

RapidMiner – Java-based Weka data mining, statistical analysis by R

Revolution Analytics – Scaling R for the Enterprise

Oracle R Enterprise – Integrated into Oracle

Page 28: Data analysis with R and Julia

About Statconn (as of March 2013)Produces RAndFriends under noncommercial and commercial licenses

All the statconn tools work ONLY with 32-bit R

statconnDCOM

rcom (GPL2, but requires statconnDCOM)

RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)

http://rcom.univie.ac.at/

Page 29: Data analysis with R and Julia

Sample Projects Using RThe Heritage Health Prize, Thomas Nguyen

A Direct Marketing In-flight Forecasting System, Shannon Terry & Ben Ogorek

Mining Twitter for Airline Consumer Sentiment, Jeffrey Breen

Alternative Data Sources for Measuring Market Sentiment and Events (Using R), Joe Rothermich

Page 30: Data analysis with R and Julia

The Julia Languagehttp://julialang.org/

Page 31: Data analysis with R and Julia

About JuliaHigh-level, high-performance dynamic open-source programming language for technical computing

Syntax similar to other technical computing environments

Features

Sophisticated compiler

Distributed parallel execution

Numerical accuracy

Extensive mathematical function library

Uses C, C++, Fortran libraries extensively

Page 32: Data analysis with R and Julia

Why Julia: “Because we are greedy”

http://julialang.org/blog/2012/04/nyc-open-stats-meetup-announcement/

Page 33: Data analysis with R and Julia

Julia CommunityHosted on github

550 mailing list subscribers (Google Groups)

1,500 github followers

190 forks

50 total contributors

As of September 2012, all contributors except the core developers had known of the language for six months or less

Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman

Page 34: Data analysis with R and Julia

The Julia Manual

http://docs.julialang.org/en/latest/manual/

Page 35: Data analysis with R and Julia

Julia Mathematical Functions

http://docs.julialang.org/en/latest/manual/mathematical-operations/

Page 36: Data analysis with R and Julia

Julia Standard Library

http://docs.julialang.org/en/latest/stdlib/

Page 37: Data analysis with R and Julia

Julia Performance

Page 38: Data analysis with R and Julia

Key Ingredients of Julia PerformanceRich type information, provided naturally by multiple dispatch

Aggressive code specialization against run-time types

Julia’s LLVM-based just-in-time (JIT) compiler

Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman

Page 39: Data analysis with R and Julia

Julia Performance Comparison

http://julialang.org/

Page 40: Data analysis with R and Julia

Julia Performance Comparison

Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski, Shah, Edelman

Page 41: Data analysis with R and Julia

Julia RecommendationsThe software is ready for people already using C or Fortran

The software will develop into a usable scripting language for R users

Wait until version one for production use

Page 42: Data analysis with R and Julia

Send me Your Questionshttp://marktab.net

Page 43: Data analysis with R and Julia

ConclusionR provides production-ready software for statistical analysis

Julia merits personal investment and promises high performance