michal malohlava, software engineer, h2o.ai at mlconf nyc
TRANSCRIPT
![Page 1: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/1.jpg)
Building Machine Learning Applications with Sparkling Water
MLConf 2015 NYC
Michal Malohlava and Alex Tellez and H2O.ai
![Page 3: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/3.jpg)
Scalable Machine Learning
For Smarter Applications
![Page 4: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/4.jpg)
Smarter Applications
![Page 5: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/5.jpg)
Scalable Applications
Distributed
Able to process huge amount of data from different sources
Easy to develop and experiment
Powerful machine learning engine inside
![Page 6: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/6.jpg)
BUT how to build
them?
![Page 7: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/7.jpg)
Build an application with …
![Page 8: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/8.jpg)
Build an application with …
?
![Page 9: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/9.jpg)
…with Spark and H2O
![Page 10: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/10.jpg)
Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community
![Page 11: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/11.jpg)
Open-source scalable machine learning platform
Tuned for efficient computation and memory use
Production ready machine learning algorithms
R, Python, Java, Scala APIs
Interactive UI, robust data parser
![Page 12: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/12.jpg)
![Page 13: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/13.jpg)
Sparkling WaterProvides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
![Page 14: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/14.jpg)
Lets build an application !
![Page 15: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/15.jpg)
![Page 16: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/16.jpg)
OR
![Page 17: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/17.jpg)
OR
![Page 18: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/18.jpg)
OR
Detect spam text messages
![Page 19: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/19.jpg)
Data example
![Page 20: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/20.jpg)
ML Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF
4. Create and evaluate Deep Learning model
5. Use the model
Goal: For a given text message identify if it is spam or not
![Page 21: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/21.jpg)
Lego #1: Data load
// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}
![Page 22: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/22.jpg)
Lego #2: Ad-hoc Tokenization
def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}
![Page 23: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/23.jpg)
Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}
“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]
Thank Order
![Page 24: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/24.jpg)
Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}
Hash words into large
space
“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]
Thank Order
![Page 25: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/25.jpg)
Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}
Hash words into large
space
Term freq scale
“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]
Thank Order
![Page 26: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/26.jpg)
Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}
Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations
![Page 27: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/27.jpg)
Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)
![Page 28: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/28.jpg)
Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)
Data munging
![Page 29: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/29.jpg)
Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)
Split dataset
Data munging
![Page 30: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/30.jpg)
Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)
Split dataset
Build model
Data munging
![Page 31: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/31.jpg)
Data exploration
![Page 32: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/32.jpg)
Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)
![Page 33: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/33.jpg)
Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)
Collect model metrics
![Page 34: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/34.jpg)
Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}
![Page 35: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/35.jpg)
Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}
Prepared models
![Page 36: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/36.jpg)
Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}
Prepared models
Default decision threshold
![Page 37: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/37.jpg)
Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}
Prepared models
Default decision threshold
Scoring
![Page 38: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/38.jpg)
Predict spam
![Page 39: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/39.jpg)
Predict spamisSpam("Michal, beer tonight in MV?")
![Page 40: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/40.jpg)
Predict spamisSpam("Michal, beer tonight in MV?")
![Page 41: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/41.jpg)
Predict spamisSpam("Michal, beer tonight in MV?")
isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
![Page 42: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/42.jpg)
Predict spamisSpam("Michal, beer tonight in MV?")
isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
![Page 43: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/43.jpg)
Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
![Page 44: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC](https://reader031.vdocuments.mx/reader031/viewer/2022032421/55a687401a28ab451e8b45b9/html5/thumbnails/44.jpg)
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O