dmm.com ラボはなぜsparkを採用したのか?...
TRANSCRIPT
DMM.COM SPARK
2015/4 - DMM labo
API
AGENDADMM Apache Spark DMM Tips
DMM
DMM
SPARK
UC Berkekey Apache
Scala, Python, Java, SQL, R API
(2014/09)Mahout
SparkJava, Scala, Python
GraphLab
WHY SPARK
MLlib, GraphX
Hadoop
Hadoop
item to item
user to item
popular
1. (Tracking API)
2. (Hive on Spark)
3. (Spark)
4. (Sqoop)
5. API(Play)
(TRACKING API)Javascript
API
RDB Hadoop
(HIVE ON SPARK)Spark
(SPARK - ITEM2ITEM)
val itemToItems = userProducts.join(userProducts).filter { case (user, ((item1, keyword1, score1), (item2, keyword2, score2))) => item1 != item2}.map { case (user, ((item1, keyword1, score1), (item2, keyword2, score2))) => ((item1, keyword1, item2), score1 * score2)}.reduceByKey(_ + _).mapValues(math.sqrt(_)).map { case ((item1, keyword1, item2), score) => ((item1, keyword1), (item2, score))}.groupByKey().mapValues(_.toList.sortBy(_._2).reverse.take(config.numDisplayItems)).filter { case ((item1, keyword1), items) => items.size >= config.numDisplayItems}.cache()
(SPARK -USER2ITEM)
MLlib ALS( )
val model = ALS.train(ratings.map(_._1), config.alsRank, config.alsNumIterations, config.alsLambda) val predictions = model.predict(candidates).groupBy(_.user).map { case (user, ratings) => (user, ratings.toList.sortBy(_.rating) .reverse.take(config.numDisplayItems)) }.cache()
(SPARK)RDB Hadoop
Sqoop MariaDB
API item2item(id: ItemId): List[ItemId]user2item(id: UserId): List[ItemId]popular : List[ItemId]
DEPLOY AND EXECUTE
Jenkins + Build Pipeline + BuildFlow
(2015/09)Jenkins + Build Pipeline + BuildFlow
Job Script + Git
Hive
Spark
Sqoop
Recommend API(Node.js)
MariaDB(Galera Cluster)
Jenkins + Build Pipeline + BuildFlow
Job Script + Management API
Hive on Spark
Spark
Sqoop
Recommend API(Play)
MariaDB(Galera Cluster)
Management API
File
Hive on Spark
Hive 3
Play
Spark, Hive UDF Util
AB PDCA
[ ]
701
75 % ↑ 97% ↑
TIPS
use dataframes or datasetshive
executor
memoryOverhead
cheat sheet
Top 5 Mistakes to Avoid When Writing Apache Spark
Applications
HIVE
Spark
HiveContext
Hive on Spark
DATAFRAMES DATASETS (1.3 - ) Dataframes(1.6 - ) Datasets
Project Tungsten(1.5 - )
Realtime RecommendDataframes & DatasetsGraphframes