jubatusのリアルタイム分散レコメンデーション@tokyowebmining#17

45
Jubatusのリアルタイム分散 レコメンデーション 2012/05/20@TokyoWebmining 株式会社Preferred Infrastructure 海野 裕也 (@unnonouno)

Upload: yuya-unno

Post on 15-Jan-2015

14.057 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

  • 1. Jubatus 2012/05/20@TokyoWebminingPreferred Infrastructure (@unnonouno)

2. l (@unnonouno)l unno/no/unol Preferred Infrastructure l Seduel l l l Jubatus 3. l Jubatusl #TokyoNLPm(_ _)m 4. Jubatus 5. Big Data !l l l l l l l l PCEC 5 6. STEP 1. STEP 2. STEP 3. l30 30 6 7. Jubatusl NTT PFPreferred Infrastructure10/27OSS http://jubat.us/ 7 8. Jubatus l l 9. l l l TVl l l l l l 9 10. l Hadoop & Mahoutl l l l l l CEPl l 11. l l l l l l 1l l l 12. Jubatusl l l l l RPC 13. l l l 13 14. l UPDATEl l ANALYZEl l MIXl l cf. MAP / REDUCEl 3 14 15. l l (sum)(count)l UPDATEl sum += xl count += 1l ANALYZEl return (sum / count)l MIXl sum = sum1 + sum2l count = count1 + count215 16. l libsvml +1 1:1 3:1 8:1l l l l Cl Cl l 16 17. RDBHadoopl l l l l l l SQLl Map/Reducel 17 18. Jubatus l l l l MCMCl l l l 18 19. l Jubatusl JSONl twitter APIl l l l l l l 19 20. Remote Procedure Call (RPC) l mprpc-idl l IDL (Interface Definition Language) l IDLmprpc-idl RubyPython Java Ruby PythonJava RubyPython JavaRuby Python Java IDL RPC IDL 21. 22. l l l l 23. l D={d1, d2, , dn}l ql ff(d, q)kl fJaccard q 24. l OKinput: xfor d in all data:score[d] = sim(x, d)sort scorereturn top-K elements of score 25. l l l l 2 26. l l l cos((x, y)) = xTy / |x||y|l Jaccardl l Jacc(X, Y) = |XY|/|XY|l 27. l l Locality Sensitive Hashing (simhash)l minhashl 28. l l 29. Locality Sensitive Hashing (LSH)l r l x, yxTryTr cos((x, y))l kl x{r1, , rk} H(x) = {sign(xTr1), , sign(xTrk)}l sign10l H(x)k 30. LSHl l 1 (x, y)/ cos((x, y)) / 31. Jaccardl l 0, 1OKl Jacc(X, Y) = |XY| / |XY|l X = {1, 2, 4, 6, 7}l Y = {1, 3, 5, 6}l XY = {1, 6}l XY = {1, 2, 3, 4, 5, 6, 7}l Jacc(X, Y) = 2/7 32. minhashl X = { x1, x2, , xn }l Xl H(X) = { h(x1), , h(xn) }l m(X) = argmin(H(X))l m(X) = m(Y)Jacc(X, Y)l m(X)=m(Y)Jacc(X, Y)l m(X) [Li+10a, Li+10b] 33. minhashl XYXY X Y 34. Jaccardl idfl wJacc(X, Y) = iXY wi / iXY wil wi1l X = {1, 2, 4, 6, 7}l Y = {1, 3, 5, 6}l w = (2, 3, 1, 4, 5, 2, 3)l XY = {1, 6}l XY = {1, 2, 3, 4, 5, 6, 7}l wJacc(X, Y) = (2+2)/(2+3+1+4+5+2+3)=4/20 35. Jaccardminhash [Chum+08]l X = { x1, x2, , xn }l H(X) = {h(x1)/w1, , h(xn)/wn}l -log(h(x))l wil wil m(X) = argmin(H(X))l m(X) = m(Y)wJacc(X, Y) 36. [Liu+11]l l l l OK 37. Jubatusl Ddl L(d, q)d 38. l IDl mix1~100101~200 CHT (Consistent Hashing)201~300 39. l MIXl 1~100 1 2101~2003201~300 MIX!! 40. l LSHminhashbitl 1~100 1 2101~2003201~300 MIX!! 41. l l l l 42. l LSHl minhashl 43. 43 44. l Jubatusl MIXl l IDLl l l Locality Sensitive Hashing (simhash)l minhashl l 45. l [Chum+08] Ondrej Chum, James Philbin, Andrew Zisserman.Near Duplicate Image Detection: min-Hash and tf-idfWeighting.BMVC 2008.l [Li+10a] Ping Li, Arnd Christian Konig.b-Bit Minwise Hashing.WWW 2008.l [Li+10b] Ping Li, Arnd Christian Konig, Wenhao Gui.b-Bit Minwise Hashing for Estimating Three-Way Similarities.NIPS 2008.l [Liu+11] Wei Liu, Jun Wang, Sanjiv Kumar, Shin-Fu Chang.Hashing with Graphs.ICML 2011.