r at microsoft
TRANSCRIPT
• Introduction to R• Applications of R at Microsoft• R Products at Microsoft• What’s coming for R at Microsoft• Q&A
Agenda
April 6, 2015
“This acquisition will help customers use advanced analytics within Microsoft data platforms.“
• Most widely used data analysis software• Used by 2M+ data scientists, statisticians and analysts
• Most powerful statistical programming language• Flexible, extensible and comprehensive for productivity
• Create beautiful and unique data visualizations• As seen in New York Times, The Economist and FlowingData
• Thriving open-source community• Leading edge of analytics research
• Fills the talent gap• New graduates prefer R
What is R?
www.revolutionanalytics.com/what-is-r
• 1993: Research project in Auckland, NZ• Ross Ihaka and Robert Gentlemen
• 1995: Released as open-source software• Generally compatible with the “S” language
• 1997: R core group formed• 2000: R 1.0.0 released• 2003: R Foundation formed in
Austria• 2004: First international user
conference• 2007: Revolution Analytics
founded• 2009: New York Times article on R• 2013: Revolution R Open released• 2015: Microsoft acquires
Revolution Analytics
A brief history of R
7
Photo credit: Robert Gentleman
R’s popularity is growing rapidlyMore at blog.revolutionanalytics.com/popularity
R Usage GrowthRexer Data Miner Survey, 2007-
2013
• Rexer Data Miner Survey • IEEE Spectrum, July 2014
#9: R
Language PopularityIEEE Spectrum Top Programming Languages
Advanced Analytics with Data ScienceBeyond business intelligence
Source: Gartner
VA
LU
E
DIFFICULTY
HINDSIGHT
INSIGHT
FORESIGHT
Descriptive Analytics
DiagnosticAnalytics
Predictive Analytics
Prescriptive Analytics
What happened?
Why did it happen?
What will happen?
How can we make it happen?
Traditional BI Advanced AnalyticsINFORMATION
OPTIMIZATION
• System monitoring & alerting• Understanding user behavior (how users configure monitoring
platform)• Visualizing infrastructure utilization data• Abnormal login detection• Custom R packages to analyze monitoring data (time series anomaly
detection)
• Capacity Planning• Forecasting hardware purchase requirements (forecast package)• Also RAM requirements for Microsoft IT
Microsoft Azure uses R for Reliability
• TruSkill Matchmaking System
• Player Churn• Game design• Difficulty curve• Level trouble-spots
• In-game purchase optimization
• Fraud detection• Player communities
Xbox uses R for a great gaming experience
• Enhanced Open Source R distribution
• Compatible with all R-related software
• Multi-threaded for performance• Focus on reproducibility• Open source (GPLv2 license)• Available for Windows, Mac OS X,
Ubuntu, Red Hat and OpenSUSE • Download from
mran.revolutionanalytics.com
Revolution R Open
15
• Built on latest R engine• Currently R 3.2.0• Updates released 3 weeks after R• Drop-in replacement for R
• 100% compatible with• R scripts• R packages• Applications with R connections
• Designed to work with RStudio• No configuration required
RRO: 100% Compatibility
16
• Multithreaded library replaces standard BLAS/LAPACK algorithms• Intel MKL on Windows/Linux ; Accelerate on Mac
• High-performance algorithms• Sequential Parallel
• Uses as many threads as there are available cores
• No need to change any R code• Included with RRO binary
distributions
Multi-threaded performance
17
More at Revolutions blog
An R Reproducibility Problem
Adapted from http://xkcd.com/234/ CC BY-NC 2.5
• Static CRAN mirror• CRAN packages fixed with each Revolution R Open update
• Daily CRAN snapshots• Storing every package version since September 2014• Binaries and sources• At mran.revolutionanalytics.com/snapshot
• Easily write and share scripts synced to a specific snapshot• “checkpoint” package installed with RRO
Reproducible R Toolkit in RRO
19
CRAN
RRDaily snapshots
http://mran.revolutionanalytics.com/snapshot/
checkpoint package
library(checkpoint)checkpoint("2014-09-17")
CRAN mirror
http://cran.revolutionanalytics.com/
checkpoint server
Midnight UTC
• Easy to use: add 2 lines to the top of each scriptlibrary(checkpoint)checkpoint("2014-09-17")
• For the package author:• Use package versions available on the chosen date• Installs packages local to this project• Allows different package versions to be used simultaneously
• For a script collaborator:• Automatically installs required packages• Detects required packages (no need to manually install!)
• Uses same package versions as script author to ensure reproducibility
Using checkpoint
20
• Download Revolution R Open
• Learn about R and RRO
• Daily CRAN snapshots
• Explore Packages• and dependencies
• Explore Task Views
MRAN: The Managed R Archive Network
21http://mran.revolutionanalytics.com
Transformational Trends
cloud computing
2011 2016 5x increase
emerging data science talent
Universities filling 300,000 US talent gap
90% of the data in the world today has been created in the last two years alone
data explosion
opensourcee.g. R and Python
• Toolkits for data scientists and numerical analysts to create custom parallel and distributed algorithms• ParallelR: parallel programming for multi-CPU servers and grids• RHadoop: map-reduce programming in R language
• Mainly useful for “embarrassingly parallel” problems, where parallel components work with small amounts of data
• Big Data Predictive Analytics mostly not embarrassingly parallel• 80+ pre-built “parallel external memory algorithms” included with Revolution R
Enterprise• Azure ML Studio includes many ML algorithms
Details at projects.revolutionanalytics.com
R Packages: RHadoop and ParallelR
24
Revolution R Enterprise
• High Performance, Scalable Analytics
• Portable Across Enterprise Platforms
• Easier to Build & Deploy Analytics
is….the only big data big analytics platform based on open source Rthe defacto statistical computing language for modern analytics
Naïve Bayes
ScaleR Functions & Algorithms
Data import – Delimited, Fixed, SAS, SPSS, OBDC
Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for
set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &
long form) Marginal Summaries of Cross Tabulations
Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test
Subsample (observations & variables) Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for
set variables) Multiple Linear Regression Generalized Linear Models (GLM)
exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models
Predictive Models K-Means
Decision Trees Decision Forests Gradient Boosted Decision
Trees
Cluster Analysis
Classification
Simulation
Variable Selection Stepwise Regression
Simulation (e.g. Monte Carlo) Parallel Random Number
Generation
Combination
New in
v7.3
PEMA-R API rxDataStep rxExec
Coming in v7.4
• ETL• Marketing channel data• Behavioral variables• Promotional data• Overlay data
• Exploratory data analysis• Time-to-event models• GAM survival models
• Scoring for inference• Scoring for prediction
• 5 billion scores per day per retailer
CUSTOM DATA FORMAT
CUSTOM VARIABLES (PMML)
• Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing the Data Science talent gap
The Opportunity: Data Science as a Service
Azure: Huge infrastructure scale19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
100+ datacentersOne of the top 3 networks in the world (coverage, speed, connections) 2 x AWS and 6x Google number of offered regionsG Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Operational Announced
Central USIowa
West USCalifornia
North EuropeIreland
East USVirginia
East US 2Virginia
US GovVirginia
North Central USIllinois
US GovIowa
South Central USTexas
Brazil SouthSao Paulo
West EuropeNetherlands
China North *Beijing
China South *Shanghai
Japan EastSaitama
Japan WestOsakaIndia West
TBD
India EastTBD
East AsiaHong Kong
SE AsiaSingapore
Australia WestMelbourne
Australia EastSydney
* Operated by 21Vianet
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Microsoft Azure Machine Learning – Custom Modules in R
Get started for free at gallery.azureml.net
Data ScientistInteract directly with data
Built-in to SQL Server
Data Developer/DBAManage data and analytics together
SQL Server 2016Built-in in-database analytics
Example Solutions• Fraud detection
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?R
R Integration
010010
100100
010101
Microsoft AzureMachine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
In-Database Acceleration5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process
rows
min
ute
s
R on a server pulling data via SQL
R on a server
Invoking RRE ScaleR
Inside the EDW
Wrap-upR is strategic for Microsoft:• Widespread internal use• Enhanced open source R: Revolution R
Open• Big Data R: Revolution R Enterprise• R in the Cloud: Azure ML Studio• In-Database R: SQL Server 2016… and more to come!
Thank youDownload Revolution R Open:mran.revolutionanalytics.com
More at:blog.revolutionanalytics.com
David SmithR Community LeadRevolution Analytics@[email protected]
© 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
46
DeployR
• Goal: embed results from R scripts into existing applications, in real time
• Problem:• Exposing arbitrary R functions is
unwise• Need to handle concurrent R
sessions• Solution: DeployR
• R, on a server, behind a firewall• Repository Manager defines entry
points• Expose only authorized R
functions• Automatically creates Web Services
APIs• Manages and monitors pool of R
sessions• Separates roles for R and app
developer• DeployR Open: for prototyping
integrations• Revolution R Enterprise adds grid-
scaling and enterprise authentication
More at deployr.revolutionanalytics.com