implementing bigpetstore with apache flink
TRANSCRIPT
![Page 1: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/1.jpg)
Implementing BigPetStore A blueprint for Flink users
Márton [email protected] / @MartonBalassi
Hungarian Academy of Sciences
![Page 2: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/2.jpg)
Outline
• BigPetStore model• Data generator with the DataSet API• ETL with the DataSet & Table API• Matrix factorization with FlinkML• Recommendation with the DataStream API• Summary
![Page 3: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/3.jpg)
BigPetStore
Blueprints for Big Data applicationsConsists of:• Data Generators
• Examples using tools in Big Data ecosystem to process data
• Build system and tests for integrating tools and multiple JVM languages
Part of the Apache BigTop project
![Page 4: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/4.jpg)
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
![Page 5: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/5.jpg)
Data generation
val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
• Use RJ Nowling’s Java generator classes• Write transactions to JSON
![Page 6: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/6.jpg)
ETL with the DataSet API
val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId
val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,
p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct
customerAndProductPairs.writeAsCsv(output)
• Read the dirty JSON• Output (customer, product) pairs for the
recommender
![Page 7: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/7.jpg)
ETL with the Table API
val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]
• Read the dirty JSON• SQL style queries
![Page 8: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/8.jpg)
A little Recommeder theory
Item factors
User side information User-Item matrixUser factors
Item side informatio
n
U
I
PQ
R
• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)
![Page 9: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/9.jpg)
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)
• Read the (customer, product) pairs• Write P and Q to file
![Page 10: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/10.jpg)
Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();
• Get the user’s row for a userID• Compute the distributed TopK of the
user’s row Q
![Page 11: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/11.jpg)
Summary
• Go beyond WordCount with BigPetStore• Feel free to mix the DataSet, DataStream,
FlinkML, Table APIs in your Flink workflows• Data generation, cleaning, ETL, Machine
learning, streaming prediction on top of one engine with under 500 lines of code
• Java and Scala APIs work well together• A Flink pet project is always fun. No pun intended.
![Page 12: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/12.jpg)
Big thanks to
• The BigPetStore folks:Suneel MarthiRonald J. NowlingJay Vyas
• Squirrels helping with the code:Gyula FóraGábor GevayGábor HermannFabian HueskeAljoscha Krettek
• And to the whole Flink community
![Page 13: Implementing BigPetStore with Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051101/589b9eb51a28abd63e8b5c25/html5/thumbnails/13.jpg)
Check out the code
https://github.com/mbalassi/bigpetstore-flink
Márton [email protected] / @MartonBalassi
Hungarian Academy of Sciences