r, scikit-learn and apache spark ml - what difference does it make?
TRANSCRIPT
![Page 1: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/1.jpg)
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Villu RuusmannOpenscoring OÜ
![Page 2: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/2.jpg)
Overview
● Identifying long-standing, high-value opportunities in the applied predictive analytics domain
● Thinking about problems in API terms● Providing solutions in API terms● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
![Page 3: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/3.jpg)
The trade-off
![Page 4: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/4.jpg)
"More data beats better algorithms"
![Page 5: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/5.jpg)
The state of the art
![Page 6: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/6.jpg)
Scaling out horizontally
![Page 7: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/7.jpg)
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model○ From model to real-life output table
● Model● Statistics
![Page 8: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/8.jpg)
Calling R from within Apache Spark
1. Create and initialize R runtime2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD3. Destroy R runtime
![Page 9: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/9.jpg)
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array2. Invoke Scikit-Learn via Python/C API3. Parse output numpy.array into result RDD
![Page 10: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/10.jpg)
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activitiesShort-term << Long-term
![Page 11: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/11.jpg)
JPMML - Java PMML API
● Conversion API● Maintenance API● Execution API
○ Interpreted mode○ Translated + compiled ("Transpiled") mode
● Serving API○ Integrations with popular Big Data frameworks○ REST web service
![Page 12: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/12.jpg)
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
![Page 13: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/13.jpg)
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:○ A continuous label: log(price)○ Two string and four numeric categorical features○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:○ 153'978 complete cases○ 116'480 incomplete (ie. with missing values) cases
![Page 14: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/14.jpg)
Gradient-Boosted Trees (GBTs)
![Page 15: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/15.jpg)
R training and conversion API#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "\t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
![Page 16: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/16.jpg)
Scikit-Learn training and conversion APIfrom sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "\t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
![Page 17: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/17.jpg)
Dataset
R LightGBM XGBoost Scikit-Learn
Apache Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory layout
Contiguous, dense
Contiguous, dense(?)
Contiguous, dense/sparse
Contiguous, dense/sparse
Distributed,dense/sparse
Data type Any double float float or double
double
Categorical values
As-is (factor) Encoded Binarized Binarized Binarized
Missing values
Yes Pseudo (NaN) Pseudo (NaN) No No
![Page 18: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/18.jpg)
LightGBM via Scikit-Learnfrom sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
![Page 19: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/19.jpg)
XGBoost via Scikit-Learnfrom sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
![Page 20: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/20.jpg)
GBT algorithm (training)
R LightGBM XGBoost Scikit-Learn
Apache Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoostingRegressor
GBTRegressor
Parameterizability
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical values
"set contains" "equals" Pseudo ("equals")
Pseudo ("equals")
"equals"
Missing values
First-class Pseudo Pseudo No No
![Page 21: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/21.jpg)
gbm-style splits<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
![Page 22: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/22.jpg)
LightGBM- and XGBoost-style splits (1/3)<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
![Page 23: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/23.jpg)
LightGBM- and XGBoost-style splits (2/3)<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
![Page 24: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/24.jpg)
LightGBM- and XGBoost-style splits (2/3)<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
![Page 25: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/25.jpg)
Model measurement using JPMMLorg.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
![Page 26: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/26.jpg)
![Page 27: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/27.jpg)
![Page 28: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/28.jpg)
![Page 29: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/29.jpg)
GBT algorithm (interpretation)
R LightGBM XGBoost Scikit-Learn
Apache Spark ML
Feature importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model persistence
RDS (binary) Proprietary (text)
Proprietary (binary, text)
Pickle (binary) SER (binary) or JSON (text)
Model reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
![Page 30: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/30.jpg)
LightGBM feature importancesAge 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
![Page 31: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/31.jpg)
Model execution using JPMMLorg.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
![Page 32: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/32.jpg)
![Page 33: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/33.jpg)
![Page 34: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/34.jpg)
Lessons (to be-) learned
● Limits and limitations of individual APIs● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in most application scenarios
● "Conventions over configuration"
![Page 35: R, Scikit-Learn and Apache Spark ML - What difference does it make?](https://reader034.vdocuments.mx/reader034/viewer/2022052418/58f9a8ea760da3da068b6911/html5/thumbnails/35.jpg)
https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml