treasure data summer internship final report
TRANSCRIPT
Summer Internship Final ReportNaoki Ishikawa (@NeokiStones)
2015/09/30 13:30-
Who am I
2
• Naoki Ishikawa
• Waseda University, Information Science M1
• Research: Evolutional Computation/ Reinforcement Learning
• Laboratory: Sugawara Lab
• Laboratory theme: Artificial Intelligence
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
3
Table of contents
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
4
Table of contents
Factorization Machine
5
• Algorithm for Recommendation
• Classification(Clustering)
• Regression
• Supervised Learning
• Need Input/Output Data
• Suitable for Sparse Data
Application
Application
7
• Prediction of Movie Rating • Task: Prediction movie rating (real number)
• Regression - Input: Self-designed Matrix - Output: Rating Vector
8
Input Output
Prediction of Movie Rating
INPUT Details
9
• Identifier- User Identifier : [0, 0, …, 0, 1, 0, …,0] - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]
• Designed Feature- Rating of Other Movie- Time- Last Movie rated
10
Recommendation Algorithm
• Collaborative Filtering
• Associations Analysis
• Bayesian Network
Prediction of Movie Rating
11
• Hivemall
• Matrix Factorization
• Recommendation
12
Difference from Matrix Factorization• Data Structure
• Matrix Factorization
• User-Item Matrix
http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png
Input Learning Parameter
13
Difference from Matrix Factorization
• Factorization Machine
Vv
kInput
Learning Parameter
Wk1
14
• Factorization Machine
• Consider
• context data
• Interaction between valuables
Advantage of Factorization Machine
15
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
16
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
(mean)Global bias
Interaction
Factorization(Wkj)
Regression coefficienceof k-th variable
17
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
Learning MethodStochastic Gradient descent(SGD)
18
Local Implementation
19
Difference from Matrix Factorization
• d-way
• FM / MF
• assume K latent attributes
• Matrix Factorization: d = 2
• Factorization Machine: d ≧2
20
HyperParameter
• K: the number of hidden factor
• η: the regulation parameter
21
Implemented Model
• Implemented Model
• d = 2
• MapModel
• ArrayModel
22
Implemented Model
• MapModel
• For unknown data
• Flexible
• Suitable for Online Learning
23
Implemented Model
• ArrayModel
• For known data
• less overhead
24
Other Use Case• E-Commerce User-Item Recommendation
• Input Data
• Age
• Purchase timezone
• Past bought items
• Cluster ID
• Target Data
• Evaluation of an Item by User
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
25
Table of contents
Latent Dirichlet Allocation
26
• Most Popular Algorithm of Topic Model
• Mostly applied for text data
• Find hidden structure of data
• Unsupervised Learning
• Need Input Data only
• Generative Model
Latent Dirichlet Allocation
27
• Generative Modelling in LDA
• Mimic how to generate Document
• 1. Choose what you write about
• 2. Choose word from the Topic
• 3. Write
Latent Dirichlet Allocation
28
• Input
• Text data (Documents)
• Output
• Topic-word distribution
• Document-Topic distribution
Latent Dirichlet Allocation
29https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpghttps://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/
Learning Method
30
• Define Generative model
• For documents
• Learn parameters to reproduce the document
Learning Method
31
K
Topic
Learning Method
32 http://heartruptcy.blog.fc2.com/blog-entry-124.html
Graphical Model(Code)
33
• For Topic ={1,…, K}
• WordDistribution[k] ~ Dir(β)
For Document={1,…, D}
TopicDistribution[d] ~ Dir(α)
For Word={1,…, numOfWord[d]}
WordTopic[d][n] ~ TopicDistribution[d]
Word[d][n] ~ WordDistribution[WordTopic[d][n]]
Learning Method
34
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
Learning Method
35
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
faster than Gibbs Sampling
Mini-batch Online LDA
36
• Faster than Batch Algorithm
• Less noise than pure Online LDA
Pure Online Mini-batch Online Batch
Batch Size
37
Implemented Model• Mini-Batch Map Model
• For unknown data
• Don’t assume Vocabulary List
• Mini-Batch Array Model (Other implementation)
• For known data
• Assume Vocabulary List
• Mini-Batch Map Model
• For unknown data
• Don’t assume Vocabulary List
38
Implemented Model
• Mini-Batch Array Model (Other implementation)
• For known data
• Assume Vocabulary List
• Meaning Less word
• LDA: Clustering word by co-occurrence
• “a”, “the”, “I”, “He”, “is”, “in”, “on”
• Stop Word: Ignore them
• TF-IDF: “how important a word is to a document in a collection or dataset ”
39
Faced Implementation Problem
40
Faced Implementation Problem
• Meaning Less word
• LDA: Clustering word by co-occurrence
• “a”, “the”, “I”, “He”, “is”, “in”, “on”
• Stop Word: Ignore them
• TF-IDF: “how important a word is to a document in a collection or dataset”
• TF-IDF
• can be calculated by Hivemall
• Input Data: (DocId, Words)
• https://github.com/myui/hivemall/wiki/TFIDF-calculation
41
Faced Implementation Problem
• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:0.06564983513276658","law:0.065
• 64983513276658","based:0.06564983513276658","religion:0.06564983513276658","viewpoints:0.03282491756638329","
• rationality:0.03282491756638329","including:0.03282491756638329","context:0.03282491756638329","concept:0.032
• 82491756638329","rightness:0.03282491756638329","general:0.03282491756638329","many:0.03282491756638329","dif
• fering:0.03282491756638329","fairness:0.03282491756638329","social:0.03282491756638329","broadest:0.032824917
• 56638329”,"equity:0.03282491756638329","includes:0.03282491756638329","theology:0.03282491756638329"]
42
Faced Implementation Problem
• TF-IDF
• Vocabulary List Model
• Initialize all lambda for all words at first
• if word does not appear in the Doc:
• Lambda decreases at the same rate
• No initialization problem
43
Faced Implementation Problem
• Online Map Model
• Initialize lambda when new word fetched
• final lambda: depend on the first appeared time
• Initialize problem
44
Faced Implementation Problem
• Prepared Dummy Lambda
• Initialize dummy lambdas at first
• Apply lambda update rule for dummy lambda
45
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
46
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
47
Faced Implementation Problem
• Implicit Φ Normalization
• Not written explicitly
48
Faced Implementation Problem
49
Faced Implementation Problem
• Difficult Debugging
• Circular reference
Φ
γ β
:dependence
• Data: 20News
• Topic:6
• Iteration:10
50
Result: Online LDA
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.001887098951
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]: 0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.001887098952
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]: 0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
Sports
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.001720405353
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]: 0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.001720405354
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]: 0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
Computer
Impression about Internship
55
• Machine Learning
• Implementing ML algorithm from Scratch was fun
• Contributing for OSS is precious experience for me
Unfinished Business
56
• Documentation
• write entry for FM/Online LDA
• UDTF
• build the function into Hivemall
57
• Thank you for Listening