treasure data summer internship final report

57
Summer Internship Final Report Naoki Ishikawa (@NeokiStones) 2015/09/30 13:30-

Upload: naoki-ishikawa

Post on 15-Apr-2017

2.526 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Treasure Data Summer Internship Final Report

Summer Internship Final ReportNaoki Ishikawa (@NeokiStones)

2015/09/30 13:30-

Page 2: Treasure Data Summer Internship Final Report

Who am I

2

• Naoki Ishikawa

• Waseda University, Information Science M1

• Research: Evolutional Computation/ Reinforcement Learning

• Laboratory: Sugawara Lab

• Laboratory theme: Artificial Intelligence

Page 3: Treasure Data Summer Internship Final Report

• Implemented Algorithm

• Factorization Machine

• Latent Dirichlet Allocation

3

Table of contents

Page 4: Treasure Data Summer Internship Final Report

• Implemented Algorithm

• Factorization Machine

• Latent Dirichlet Allocation

4

Table of contents

Page 5: Treasure Data Summer Internship Final Report

Factorization Machine

5

• Algorithm for Recommendation

• Classification(Clustering)

• Regression

• Supervised Learning

• Need Input/Output Data

• Suitable for Sparse Data

Page 6: Treasure Data Summer Internship Final Report

Application

Page 7: Treasure Data Summer Internship Final Report

Application

7

• Prediction of Movie Rating • Task: Prediction movie rating (real number)

• Regression - Input: Self-designed Matrix - Output: Rating Vector

Page 8: Treasure Data Summer Internship Final Report

8

Input Output

Prediction of Movie Rating

Page 9: Treasure Data Summer Internship Final Report

INPUT Details

9

• Identifier- User Identifier : [0, 0, …, 0, 1, 0, …,0] - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]

• Designed Feature- Rating of Other Movie- Time- Last Movie rated

Page 10: Treasure Data Summer Internship Final Report

10

Recommendation Algorithm

• Collaborative Filtering

• Associations Analysis

• Bayesian Network

Page 11: Treasure Data Summer Internship Final Report

Prediction of Movie Rating

11

• Hivemall

• Matrix Factorization

• Recommendation

Page 12: Treasure Data Summer Internship Final Report

12

Difference from Matrix Factorization• Data Structure

• Matrix Factorization

• User-Item Matrix

http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png

Input Learning Parameter

Page 13: Treasure Data Summer Internship Final Report

13

Difference from Matrix Factorization

• Factorization Machine

Vv

kInput

Learning Parameter

Wk1

Page 14: Treasure Data Summer Internship Final Report

14

• Factorization Machine

• Consider

• context data

• Interaction between valuables

Advantage of Factorization Machine

Page 15: Treasure Data Summer Internship Final Report

15

Difference from Matrix Factorization

Prediction by Factorization Machine (d=2)

Page 16: Treasure Data Summer Internship Final Report

16

Difference from Matrix Factorization

Prediction by Factorization Machine (d=2)

(mean)Global bias

Interaction

Factorization(Wkj)

Regression coefficienceof k-th variable

Page 17: Treasure Data Summer Internship Final Report

17

Difference from Matrix Factorization

Prediction by Factorization Machine (d=2)

Learning MethodStochastic Gradient descent(SGD)

Page 18: Treasure Data Summer Internship Final Report

18

Local Implementation

Page 19: Treasure Data Summer Internship Final Report

19

Difference from Matrix Factorization

• d-way

• FM / MF

• assume K latent attributes

• Matrix Factorization: d = 2

• Factorization Machine: d ≧2

Page 20: Treasure Data Summer Internship Final Report

20

HyperParameter

• K: the number of hidden factor

• η: the regulation parameter

Page 21: Treasure Data Summer Internship Final Report

21

Implemented Model

• Implemented Model

• d = 2

• MapModel

• ArrayModel

Page 22: Treasure Data Summer Internship Final Report

22

Implemented Model

• MapModel

• For unknown data

• Flexible

• Suitable for Online Learning

Page 23: Treasure Data Summer Internship Final Report

23

Implemented Model

• ArrayModel

• For known data

• less overhead

Page 24: Treasure Data Summer Internship Final Report

24

Other Use Case• E-Commerce User-Item Recommendation

• Input Data

• Age

• Purchase timezone

• Past bought items

• Cluster ID

• Target Data

• Evaluation of an Item by User

Page 25: Treasure Data Summer Internship Final Report

• Implemented Algorithm

• Factorization Machine

• Latent Dirichlet Allocation

25

Table of contents

Page 26: Treasure Data Summer Internship Final Report

Latent Dirichlet Allocation

26

• Most Popular Algorithm of Topic Model

• Mostly applied for text data

• Find hidden structure of data

• Unsupervised Learning

• Need Input Data only

• Generative Model

Page 27: Treasure Data Summer Internship Final Report

Latent Dirichlet Allocation

27

• Generative Modelling in LDA

• Mimic how to generate Document

• 1. Choose what you write about

• 2. Choose word from the Topic

• 3. Write

Page 28: Treasure Data Summer Internship Final Report

Latent Dirichlet Allocation

28

• Input

• Text data (Documents)

• Output

• Topic-word distribution

• Document-Topic distribution

Page 29: Treasure Data Summer Internship Final Report

Latent Dirichlet Allocation

29https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpghttps://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/

Page 30: Treasure Data Summer Internship Final Report

Learning Method

30

• Define Generative model

• For documents

• Learn parameters to reproduce the document

Page 31: Treasure Data Summer Internship Final Report

Learning Method

31

K

Topic

Page 32: Treasure Data Summer Internship Final Report

Learning Method

32 http://heartruptcy.blog.fc2.com/blog-entry-124.html

Page 33: Treasure Data Summer Internship Final Report

Graphical Model(Code)

33

• For Topic ={1,…, K}

• WordDistribution[k] ~ Dir(β)

For Document={1,…, D}

TopicDistribution[d] ~ Dir(α)

For Word={1,…, numOfWord[d]}

WordTopic[d][n] ~ TopicDistribution[d]

Word[d][n] ~ WordDistribution[WordTopic[d][n]]

Page 34: Treasure Data Summer Internship Final Report

Learning Method

34

• Variational Bayes

• Gibbs Sampling (MCMC)

• Particle Filtering

Page 35: Treasure Data Summer Internship Final Report

Learning Method

35

• Variational Bayes

• Gibbs Sampling (MCMC)

• Particle Filtering

faster than Gibbs Sampling

Page 36: Treasure Data Summer Internship Final Report

Mini-batch Online LDA

36

• Faster than Batch Algorithm

• Less noise than pure Online LDA

Pure Online Mini-batch Online Batch

Batch Size

Page 37: Treasure Data Summer Internship Final Report

37

Implemented Model• Mini-Batch Map Model

• For unknown data

• Don’t assume Vocabulary List

• Mini-Batch Array Model (Other implementation)

• For known data

• Assume Vocabulary List

Page 38: Treasure Data Summer Internship Final Report

• Mini-Batch Map Model

• For unknown data

• Don’t assume Vocabulary List

38

Implemented Model

• Mini-Batch Array Model (Other implementation)

• For known data

• Assume Vocabulary List

Page 39: Treasure Data Summer Internship Final Report

• Meaning Less word

• LDA: Clustering word by co-occurrence

• “a”, “the”, “I”, “He”, “is”, “in”, “on”

• Stop Word: Ignore them

• TF-IDF: “how important a word is to a document in a collection or dataset ”

39

Faced Implementation Problem

Page 40: Treasure Data Summer Internship Final Report

40

Faced Implementation Problem

• Meaning Less word

• LDA: Clustering word by co-occurrence

• “a”, “the”, “I”, “He”, “is”, “in”, “on”

• Stop Word: Ignore them

• TF-IDF: “how important a word is to a document in a collection or dataset”

Page 41: Treasure Data Summer Internship Final Report

• TF-IDF

• can be calculated by Hivemall

• Input Data: (DocId, Words)

• https://github.com/myui/hivemall/wiki/TFIDF-calculation

41

Faced Implementation Problem

Page 42: Treasure Data Summer Internship Final Report

• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:0.06564983513276658","law:0.065

• 64983513276658","based:0.06564983513276658","religion:0.06564983513276658","viewpoints:0.03282491756638329","

• rationality:0.03282491756638329","including:0.03282491756638329","context:0.03282491756638329","concept:0.032

• 82491756638329","rightness:0.03282491756638329","general:0.03282491756638329","many:0.03282491756638329","dif

• fering:0.03282491756638329","fairness:0.03282491756638329","social:0.03282491756638329","broadest:0.032824917

• 56638329”,"equity:0.03282491756638329","includes:0.03282491756638329","theology:0.03282491756638329"]

42

Faced Implementation Problem

• TF-IDF

Page 43: Treasure Data Summer Internship Final Report

• Vocabulary List Model

• Initialize all lambda for all words at first

• if word does not appear in the Doc:

• Lambda decreases at the same rate

• No initialization problem

43

Faced Implementation Problem

Page 44: Treasure Data Summer Internship Final Report

• Online Map Model

• Initialize lambda when new word fetched

• final lambda: depend on the first appeared time

• Initialize problem

44

Faced Implementation Problem

Page 45: Treasure Data Summer Internship Final Report

• Prepared Dummy Lambda

• Initialize dummy lambdas at first

• Apply lambda update rule for dummy lambda

45

Faced Implementation Problem

Page 46: Treasure Data Summer Internship Final Report

• Implicit Φ Normalization

• Not written implicitly

46

Faced Implementation Problem

Page 47: Treasure Data Summer Internship Final Report

• Implicit Φ Normalization

• Not written implicitly

47

Faced Implementation Problem

Page 48: Treasure Data Summer Internship Final Report

• Implicit Φ Normalization

• Not written explicitly

48

Faced Implementation Problem

Page 49: Treasure Data Summer Internship Final Report

49

Faced Implementation Problem

• Difficult Debugging

• Circular reference

Φ

γ β

:dependence

Page 50: Treasure Data Summer Internship Final Report

• Data: 20News

• Topic:6

• Iteration:10

50

Result: Online LDA

Page 51: Treasure Data Summer Internship Final Report

• Topic:1

• No.0 writes[6]: 0.007909349

• No.1 article[7]: 0.006535292

• No.2 apr[3]: 0.0034389505

• No.3 team[4]: 0.00340712

• No.4 game[4]: 0.0033219245

• No.5 year[4]: 0.0032751847

• No.6 good[4]: 0.0032546786

• No.7 time[4]: 0.0030503264

• No.8 play[4]: 0.00262638

• No.9 games[5]: 0.002433915

• No.10 season[6]: 0.0022433712

• No.11 ll[2]: 0.0020719478

• No.12 players[7]: 0.0020332362

• No.13 win[3]: 0.0019284738

• No.14 hockey[6]: 0.001887098951

Result: Online LDA

• No.15 league[6]: 0.0018450991

• No.16 baseball[8]: 0.0018226414

• No.17 years[5]: 0.0017960512

• No.18 mail[4]: 0.0017936684

• No.19 people[6]: 0.0017642054

• No.20 teams[5]: 0.0016675185

• No.21 great[5]: 0.001642102

• No.22 ve[2]: 0.0015846819

• No.23 point[5]: 0.0015730233

• No.24 cs[2]: 0.0015609838

• No.25 didn[4]: 0.0015398773

• No.26 lot[3]: 0.0015123658

• No.27 mike[4]: 0.0014935194

• No.28 university[10]: 0.0014718652

• No.29 player[6]: 0.0014655796

Page 52: Treasure Data Summer Internship Final Report

• Topic:1

• No.0 writes[6]: 0.007909349

• No.1 article[7]: 0.006535292

• No.2 apr[3]: 0.0034389505

• No.3 team[4]: 0.00340712

• No.4 game[4]: 0.0033219245

• No.5 year[4]: 0.0032751847

• No.6 good[4]: 0.0032546786

• No.7 time[4]: 0.0030503264

• No.8 play[4]: 0.00262638

• No.9 games[5]: 0.002433915

• No.10 season[6]: 0.0022433712

• No.11 ll[2]: 0.0020719478

• No.12 players[7]: 0.0020332362

• No.13 win[3]: 0.0019284738

• No.14 hockey[6]: 0.001887098952

Result: Online LDA

• No.15 league[6]: 0.0018450991

• No.16 baseball[8]: 0.0018226414

• No.17 years[5]: 0.0017960512

• No.18 mail[4]: 0.0017936684

• No.19 people[6]: 0.0017642054

• No.20 teams[5]: 0.0016675185

• No.21 great[5]: 0.001642102

• No.22 ve[2]: 0.0015846819

• No.23 point[5]: 0.0015730233

• No.24 cs[2]: 0.0015609838

• No.25 didn[4]: 0.0015398773

• No.26 lot[3]: 0.0015123658

• No.27 mike[4]: 0.0014935194

• No.28 university[10]: 0.0014718652

• No.29 player[6]: 0.0014655796

Sports

Page 53: Treasure Data Summer Internship Final Report

• Topic:3

• No.0 writes[6]: 0.0065424195

• No.1 article[7]: 0.005621346

• No.2 apr[3]: 0.002746017

• No.3 work[4]: 0.002731466

• No.4 good[4]: 0.00266331

• No.5 ve[2]: 0.0025969497

• No.6 time[4]: 0.0025880735

• No.7 system[6]: 0.0024449623

• No.8 problem[7]: 0.002349667

• No.9 mail[4]: 0.0023234019

• No.10 windows[7]: 0.0021310966

• No.11 people[6]: 0.0018598152

• No.12 find[4]: 0.0018072439

• No.13 computer[8]: 0.0017470584

• No.14 email[5]: 0.001720405353

Result: Online LDA

• No.15 drive[5]: 0.0017121765

• No.16 bit[3]: 0.0016401116

• No.17 program[7]: 0.001636191

• No.18 software[8]: 0.0016341405

• No.19 university[10]: 0.0015907411

• No.20 ll[2]: 0.0015530549

• No.21 thing[5]: 0.0015159848

• No.22 card[4]: 0.0013826761

• No.23 doesn[5]: 0.0013809163

• No.24 phone[5]: 0.0013786326

• No.25 question[8]: 0.0013721529

• No.26 internet[8]: 0.001368883

• No.27 file[4]: 0.0013417117

• No.28 things[6]: 0.0013097903

• No.29 set[3]: 0.0013029057

Page 54: Treasure Data Summer Internship Final Report

• Topic:3

• No.0 writes[6]: 0.0065424195

• No.1 article[7]: 0.005621346

• No.2 apr[3]: 0.002746017

• No.3 work[4]: 0.002731466

• No.4 good[4]: 0.00266331

• No.5 ve[2]: 0.0025969497

• No.6 time[4]: 0.0025880735

• No.7 system[6]: 0.0024449623

• No.8 problem[7]: 0.002349667

• No.9 mail[4]: 0.0023234019

• No.10 windows[7]: 0.0021310966

• No.11 people[6]: 0.0018598152

• No.12 find[4]: 0.0018072439

• No.13 computer[8]: 0.0017470584

• No.14 email[5]: 0.001720405354

Result: Online LDA

• No.15 drive[5]: 0.0017121765

• No.16 bit[3]: 0.0016401116

• No.17 program[7]: 0.001636191

• No.18 software[8]: 0.0016341405

• No.19 university[10]: 0.0015907411

• No.20 ll[2]: 0.0015530549

• No.21 thing[5]: 0.0015159848

• No.22 card[4]: 0.0013826761

• No.23 doesn[5]: 0.0013809163

• No.24 phone[5]: 0.0013786326

• No.25 question[8]: 0.0013721529

• No.26 internet[8]: 0.001368883

• No.27 file[4]: 0.0013417117

• No.28 things[6]: 0.0013097903

• No.29 set[3]: 0.0013029057

Computer

Page 55: Treasure Data Summer Internship Final Report

Impression about Internship

55

• Machine Learning

• Implementing ML algorithm from Scratch was fun

• Contributing for OSS is precious experience for me

Page 56: Treasure Data Summer Internship Final Report

Unfinished Business

56

• Documentation

• write entry for FM/Online LDA

• UDTF

• build the function into Hivemall

Page 57: Treasure Data Summer Internship Final Report

57

• Thank you for Listening