an analytics toolkit tour

28
A Programming Language/Toolkit Tour Rory Winston Monday, 27 February 2012

Upload: rory-winston

Post on 27-Jan-2015

113 views

Category:

Technology


0 download

DESCRIPTION

A quick tour and overview of toolkits in R, Python and C++ for analytics applications.

TRANSCRIPT

Page 1: An Analytics Toolkit Tour

A Programming Language/Toolkit Tour

Rory Winston

Monday, 27 February 2012

Page 2: An Analytics Toolkit Tour

Agenda

• A quick overview and tour of:

• R

• Python

• Java/C++

• For data analysis/analytics applications

• Comparison

Monday, 27 February 2012

Page 3: An Analytics Toolkit Tour

Purpose

• To give a feeling for the relative advantages and disadvantages of each approach

• Understand the tradeoffs involved

• See some demos

Monday, 27 February 2012

Page 4: An Analytics Toolkit Tour

R• R is a domain-specific-language (DSL) for statistics

and data analysis

• Functional-based language

• Based on an earlier language called S

• Core engine written in C

• Open-source

• Popularity has exploded in the last few years

• Some commercial support

Monday, 27 February 2012

Page 5: An Analytics Toolkit Tour

Pros• R is the de facto standard in statistical analysis tooling

• Incredible range of functionality via contributed libraries

• Powerful interactive analysis environment and visualization tools

• Large number of built-in datasets

• Cross-platform

• Broad user community

• Wide range of resources (books, tutorials, papers) available

Monday, 27 February 2012

Page 6: An Analytics Toolkit Tour

Cons

• Performance limitations

• Single-threaded interpreter

• Language limitations and quirks

• Initial learning curve may be steep

• R gives you a lot of power, but assumes you know how to use it!

Monday, 27 February 2012

Page 7: An Analytics Toolkit Tour

Language Features• R is vectorized:

• Loops are not required for many operations (and are actually discouraged)

• R is functional:

• Functions can be passed around like other variables

• R integrates with a BLAS:

• high-performance numerical operations

Monday, 27 February 2012

Page 8: An Analytics Toolkit Tour

Demo

• Console R

• R GUI

• RStudio

Monday, 27 February 2012

Page 9: An Analytics Toolkit Tour

Tips

• Learn how to use ggplot2 (http://had.co.nz/ggplot2/)

• Consider using RStudio (http://www.rstudio.org)

Monday, 27 February 2012

Page 10: An Analytics Toolkit Tour

Python

• Initially developed in the late 1980s

• Object-oriented / functional support

• Open-source

• Initially popular in web applications, now popular across a number of domains

Monday, 27 February 2012

Page 11: An Analytics Toolkit Tour

Pros

• Very readable, simple and clear syntax

• Well-supported (many libraries and extensions)

• Easy to integrate with other languages (e.g. C)

• Very efficient environment to develop in

Monday, 27 February 2012

Page 12: An Analytics Toolkit Tour

Cons

• Language syntax is not universally popular

• In terms of analytics, many libraries are still slightly immature

• Performance can be lacking (although there are many options to tune it)

• Interpreter is effectively single-threaded

Monday, 27 February 2012

Page 13: An Analytics Toolkit Tour

Python + Analytics

• There are a number of excellent libraries available for analytics applications:

• NumPy + SciPy

• matplotlib

• pandas

• scikits

• Some packages (e.g. pandas) are designed to replicate the ‘feel’ and functionality of analysis operations in R

Monday, 27 February 2012

Page 14: An Analytics Toolkit Tour

NumPy + SciPy

• Using NumPy + SciPy + matplotlib provides an experience similar to using an interactive R/Matlab environment

• Supports vectorization and BLAS integration

• Add ipython for more goodness

Monday, 27 February 2012

Page 15: An Analytics Toolkit Tour

Tips

• Use ipython!

• Check out:

• http://pandas.pydata.org/

• http://statsmodels.sourceforge.net/

• http://scikit-learn.org

Monday, 27 February 2012

Page 16: An Analytics Toolkit Tour

Comparisons

x <- 1:10

x <- seq(1, 2, .2)

x <- seq(1,2, length.out=15)

M <- matrix(1:100, 10, 10)

x[ x < 1.5 ]

X <- cbind(a,b)

x = arange(1,11)

x = arange(1,2,.2)

x = linspace(1,2,15)

M <- arange(1,101).reshape(10,10)

x[x < 1.5]

X = colstack((a,b))

Monday, 27 February 2012

Page 17: An Analytics Toolkit Tour

Java/C++

• The ultimate in power/flexibility

• Also the ultimate in development time and effort

• Lets just look at C++ briefly

Monday, 27 February 2012

Page 18: An Analytics Toolkit Tour

C++• Old but still very popular

• Just had a revamp (C++11, was C++0x)

• Mostly competes with Java on the server side

• Everything else (JVM, R, Python) is written in C/C++

• Both R and Python provide easy ways to interface with C/C++ code

• This is used a lot

Monday, 27 February 2012

Page 19: An Analytics Toolkit Tour

Pros

• Flexibility

• Lots of libraries available

• Control of resources for performance-critical apps (e.g. memory)

• C++11 adds a lot of nice stuff (finally)

Monday, 27 February 2012

Page 20: An Analytics Toolkit Tour

Cons

• Lots of effort

• Lots of hidden traps for the unwary

• Initial experience may be a large productivity hit

• Effort in porting between systems

• There is “modern” C++ (which is actually pretty nice) and everything else (which isn’t so nice)

Monday, 27 February 2012

Page 21: An Analytics Toolkit Tour

Examples

• Lets look at a sample library

• This one is called Armadillo (http://arma.sourceforge.net/)

• Developed in Australia (NICTA / Univ. Queensland)

• Contains functions for numerical applications and some statistical functions

• Modern, efficient use of C++

Monday, 27 February 2012

Page 22: An Analytics Toolkit Tour

Armadillo

• Armadillo supports vectorized operations

• Also integrates with a BLAS

• Example (see console)

Monday, 27 February 2012

Page 23: An Analytics Toolkit Tour

Simple Example

• Using the Box-Jenkins airline passenger data

• Classic dataset

• 12 years of monthly airline passenger observations (144 in all)

Monday, 27 February 2012

Page 24: An Analytics Toolkit Tour

Passenger Dataset

Monday, 27 February 2012

Page 25: An Analytics Toolkit Tour

Linear Model

• We will use a simple linear model (explains 85% of the variance of this data)

Ax = b

A =

1 t11 t21 t3... ...

Monday, 27 February 2012

Page 26: An Analytics Toolkit Tour

Conclusion

• Use the toolkit that’s most appropriate for you

• Common approches are to use e.g. R for prototyping and model selection and (if required) switch to a higher-performance implementation for production

• If you have time, learn all of them!

Monday, 27 February 2012

Page 27: An Analytics Toolkit Tour

Language Map

ROctave

PythonRuby

JavaC/C++

Performance, complexity

InteractivityDynamic Typing Static Typing

Monday, 27 February 2012

Page 28: An Analytics Toolkit Tour

Resources

Monday, 27 February 2012