making the impossible possible: randomized machine...
TRANSCRIPT
Making the Impossible Possible: Randomized Machine Learning
Algorithms for Big Data
Rong Jin
Alibaba Group
Big Data Challenge
• Data exists in the digital universe • 2012: 2.7 Zetabytes (1021) • 2020: 40 Zetabytes
• Huge amount of data generated
on the Internet every minute • YouTube users upload 300 hours of
video, • Facebook users share 4 million
pieces of content
http://www.fiercebigdata.com/story/how-much-data-created-internet-
every-minute/2015-08-14
Too much data to process
Big Data Challenge
High dimensional data
• E.g. millions of features have been used for image classification & online advertising
Why Data Size Matters ?
Matrix completion
• Classification, clustering, recommender systems
• Performance is measured by recovery error
Why Data Size Matters ?
O(rnlog2(n)): PERFECT Recovery
O(rnlog (n)): POOR Recovery
reco
very
err
or
# observed entries
O(rnlog (n)) O(rnlog2(n))
Un
kno
wn
# observed entries
Why Learning from Big Data is Hard ?
Even computing data average is non-trivial
• Each matrix Mi is sparse with size 1Bx1M
• Average matrix Z is much dense, too expensive to store
• Can we compute an approximate average Z’ without having to computing Z explicitly ?
Why Learning from Big Data is Hard ?
Turn matrix average into an optimization problem
Why Learning from Big Data is Hard ?
Turn matrix average into an optimization problem
• Solved efficiently by stochastic gradient descent
• Intermediate sparse solutions, strong guarantee
Why Learning from Big Data is Hard ?
• : training examples
• : a convex loss (e.g. )
• : a convex domain
Why Learning from Big Data is Hard ?
Require a large-scale optimization problem • Too many data points (109)
• Very high dimensionality (108)
Randomized Algorithms for Big Data
Randomized algorithms are efficient
• for large-sized data sets
• only need one pass of the entire data set
• for high dimensional data
• reduce dimensionality by random projection
Randomized Algorithms for Big Data
Randomized algorithms are efficient
• for large-sized data sets
• only need one pass of the entire data set
• for high dimensional data
• reduce dimensionality by random projection
Randomized algorithms are effective
• Minimizes the generalization error
Randomized Algorithms for Big Data
Limitations of randomized algorithms
• Random decision is suboptimal and can be very poor
We will focus our discussion on Random Projection
Random Projection
Random Projection
• Project data into a random low dimensional space
Gaussian Random Matrix S
Random Projection
• Project data into a random low dimensional space
Random Projection
• Recover the solution in the high dimensional space
Gaussian Random Matrix ST
Random Projection
• Good news
random projections are sufficient if data is linearly separated with margin
Random Projection
High Dimensional Space
Low Dimensional Space
Recovery
Random Projection
Random Projection
• is an poor approximation of
Random Projection
• Impossibility theorem: for most random projection S,
S Random Projection
Random Projection
• Impossibility theorem: for most random projection S,
S
Is it possible to overcome the limitation of random projection
while enjoys its simplicity ?
Randomized Algorithms for Big Data
Limitations of randomized algorithms
• Random decision is suboptimal and can be very poor
How to overcome the fundamental limitations of randomized alg. in ML ?
Dual Random Projection
Random Projection
Dual Random Projection
Random Projection
Compute Dual Variables
Dual Random Projection
Random Projection
Compute Dual Variables
Dual Recovery
Dual Random Projection
Recovery property
• If X can be well approximated by a rank r matrix, with a high probability, we have
Dual Random Projection
Recovery property
• If X can be well approximated by a rank r matrix, with a high probability, we have
Why Dual Random Projection Work ?
• Although primal solution can’t be recovered accurately via random projection, dual variables can
• It is closely related to gradient descent
where
Iterative Dual Random Projection
Iterative Dual Random Projection
With high probability
where
Experiment with Synthetic Dataset
• N=50,000, d = 20,000, r=10
Experiment with RCV1 Dataset
• 800K documents, 40,000 features
Fine-Grained Visual Classification
• Fine-Grained Challenge 2013 (https://sites.google.com/site/fgcomp2013)
• Categories: air crafts, birds, dogs, shoes, cars
• Number of training images: 100K
Fine-Grained Visual Classification
• # Visual features: 134,016
• Our approach is based on metric learning
• Apply dual random projection to improve computational efficiency
Team Performance Inria-Xerox 77.1
CafeNet 75.8
VisionMetric
(Our method)
71.7
Symbiotic (University
of Oxford)
71.6
CognitiveVision
(MSR)
70.0
DPD_Berkeley
(Berkeley)
69.2
MPG (University of
Tokyo)
52.9
Infor_FG (CMU) 16.0
InterfAIce (UIUC) 4.5
Online Display Ads
Advertiser
• Market its products
User
• Find products/service
Platform
• Attract enough traffic
Online Display Ads
Advertiser • Choose target audience by
selecting appropriate tags Platform • Match users with ads
through tags Users • Profile by tag assignments • Assigned tags with the
largest scores (greedy approach)
Tag1 Tag2 …… Tag n
Supply & Demand Mismatch
Advertisers
• Limited budget limited supplies of tags
Platform
• Match users with ads through tags
Users
• Profile by tag assignments
• Assigned tags with the largest scores (greedy approach)
Tag1 Tag2 …… Tag n
Supply
Demand
5000
1000
1000
5000
Supply and Demand Mismatch (I)
• Assume consumers a & b come at random order
• On average, 50% of time b can’t find matched ad
Advertiser/Consumer budget a b
A 1 1.1 1
B 1 1 0
b a
A B
a b
A B
Supply and Demand Mismatch (II)
• Alternative solution: remove a from the list of target audience for ad A
• Both a and b will find their matched ad regardless of their order
Advertiser/Consumer a b
A 1
B 1 0
Advertiser/Consumer budget a b
A 1 1
B 1 1 0
b a
A B
Supply and Demand Mismatch in Alibaba
• Many targets with strong demand (i.e. consumers) but weak supply (i.e. advertisement budgets)
• Many targets with weak demand (i.e. consumers) but strong supply (i.e. advertisement budgets)
Minimize Mismatch: Global Optimization
• Find the best assignment of tags
1. maximize the revenue, and
2. minimize the supply and demand mismatch
• A gigantic optimization problem
• Billions of users and thousands of tags
• Need to find solutions in 2 hours
u1
u2
……
un
a1
a2
am
……
A
Users (109) tags (105)
Minimize Mismatch: Global Optimization
• Apply dual random projection to efficiently find the solution for A u1
u2
……
un
a1
a2
am
……
A
Users (109) Ads (104)
Random Projection
Obtain optimal solution & dual variables
Implementation
• Implement by Map-Reduce
Results in Online Display Ads
• Reduce the supply and demand mismatch
After optimization
Before optimization
What Is the Next ?
• Impossibility theorems exist in many randomized algorithms in ML • Passive learning
• Active learning
• Data clustering
• Matrix completion
• Difference privacy
• Compressive sensing
• Low rank matrix approximation
• ……