computing recommendations at extreme scale with apache flink @buzzwords 2015
TRANSCRIPT
Till Rohrmann Flink committer / data Artisans
[email protected] @stsffap
Computing Recommendations at Extreme Scale with Apache Flink
Collaborative Filtering § Recommend items based on users with
similar preferences § Latent factor models capture underlying
characteristics of items and preferences of user
§ Predicted preference:
4
r̂u,i = xuT yi
Rating Matrix § Explicit or implicit ratings § Prediction goal: Rating for unseen items
5
Items
Users10 5 ?? 2 107 ? 5
!
"
###
$
%
&&& Princess
Matrix Factorization § Calculate low rank approximation to obtain latent factors
6
minX,Y ru,i − xuT yi( )
2+λ nu xu
2+ ni yi
2
i∑
u∑#
$%
&
'(
ru,i≠0∑
Items
Users10 5 ?? 2 107 ? 5
!
"
###
$
%
&&&≈ X
!
"
###
$
%
&&&• Y( )Princess
§ = rating of user u for item i § = latent factors of user u § = latent factors of item i
§ = regularization constant § = number of rated items for user u § = number of ratings for item i
xuyi
λnuni
ru,i
Alternating Least Squares § Hard to optimize since we have two
variables § Fixing one variables gives quadratic
problem
7
X,Y
xu = YSuY T +λnuΙ( )−1Yru
T
Siiu =
1 if ru,i ≠ 00 else
"#$
%$
We only need the item vectors rated by user u
ALS Algorithm § Update user matrix
8
Items
Users10 5 ?? 2 107 ? 5
!
"
###
$
%
&&&≈ X
!
"
###
$
%
&&&• Y( )
Keep fixed
Calculate update
ALS Algorithm contd. § Update item matrix
9
Items
Users10 5 ?? 2 107 ? 5
!
"
###
$
%
&&&≈ X
!
"
###
$
%
&&&• Y( )
Keep fixed
Calculate update
ALS Algorithm contd. § Repeat update step until convergence
10
Items
Users10 5 ?? 2 107 ? 5
!
"
###
$
%
&&&≈ X
!
"
###
$
%
&&&• Y( )
Keep fixed
Calculate update
What is Apache Flink? Apache Flink deep-‐dive by Stephan Ewen Tomorrow, 12:20 – 13:00 on Stage 3
12
Gel
ly
Tabl
e
Flin
kML
SAM
OA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Had
oop
M/R
Local Remote Yarn Tez Embedded
Dat
aflow
Dat
aflow
(WiP
)
MRQ
L
Tabl
e
Casc
adin
g (W
iP)
Streaming dataflow runtime
Why Using Flink for ALS? § Expressive API § Pipelined stream processor § Closed loop iterations § Operations on managed memory
13
Expressive APIs § DataSet: Abstraction for distributed data § Computation specified as sequence of lazily
evaluated transformations
14
case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(…) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”) .print()
Program Execution
15
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }
Optimizer
Type extraction stack
Task scheduling
Dataflow metadata
Pre-flight (Client)
Master Workers
Data Source orders.tbl
Filter
Map DataSource
lineitem.tbl
Join Hybrid Hash
buildHT
probe
hash-part [0] hash-part [0]
GroupRed sort
forward
Program
Dataflow Graph
deploy operators
track intermediate
results
Iterate by looping
§ for/while loop in client submits one job per iteration step
§ Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
17
Naïve Implementation 1. Join item vectors
with ratings 2. Group on user ID 3. Compute new
user vectors
22
xu = YSuY T +λnuΙ( )−1Yru
T
Pros and Cons of Naïve ALS § Pros • Easy to implement
§ Cons • Item vectors are sent redundantly to network
nodes • Two shuffle steps make execution expensive
23
Blocked ALS Implementation 1. Create user and item
rating blocks 2. Cache them on
worker nodes 3. Send all item vectors
needed by user rating block bundled
4. Compute block of item vectors
24 Based on Spark’s MLlib implementaBon
Pros and Cons of Blocked ALS § Pros • Reduces network load by avoiding data
duplication • Caching ratings: Only one shuffle step needed
§ Cons • Duplicates the rating matrix (user block/item
block partitioning)
25
Performance Comparison
26
• 40 node GCE cluster, highmem-‐8
• 10 ALS iteraBons with 50 latent factors
• RaBng matrix has 28 billion non zero entries: Scale of NeAlix or SpoCfy
Machine Learning with FlinkML § FlinkML contains
blocked ALS § Support for many
other tasks • Clustering • Regression • Classification
§ scikit-learn like pipeline support
27
val als = ALS() val ratingDS = env.readCsvFile[(Int, Int, Double)](ratingData) val parameters = ParameterMap() .add(ALS.Iterations, 10) .add(ALS.NumFactors, 50) .add(ALS.Lambda, 1.5) als.fit(ratingDS, parameters) val testingDS = env.readCsvFile[(Int, Int)](testingData) val predictions = als.predict(testingDS)
What Have You Seen? § How to use collaborative filtering to make
recommendations § Apache Flink, a powerful parallel stream
processing engine § How to use Apache Flink and alternating least
squares to factorize really large matrices
29