Business Proprietary & Confidential
Quantum ClusteringSigalit Bechler, Data Researcher
SimilarWeb & Tel-Aviv university
December 1, 2014
Business Proprietary & Confidential
• SimilarWeb – a quick introduction
• Quantum Clustering
December 1, 2014
Agenda
3/31
$65M
Funding
2007Founded 6
Offices300
Employees
SimilarWeb
Some of our clients
What We Do
60M WEBSITES DAILYFOR EVERY WEBSITE:• TRAFFIC ESTIMATION• TRAFFIC SOURCES• AUDIENCE• INDUSTRY• CONTENT
We Provide Digital Insights to the Entire World2M MOBILE APPS DAILYFOR EVERY MOBILE APP:RATINGENGAGEMENTAPP STORE DATACATEGORYKEYWORDS
What We Do
60M WEBSITES DAILYFOR EVERY WEBSITE:• TRAFFIC METRICS• TRAFFIC SOURCES• AUDIENCE• INDUSTRY• CONTENT
2M MOBILE APPS DAILYFOR EVERY MOBILE APP:• RATING• ENGAGEMENT• APP STORE• CATEGORY• KEYWORDS
INGEST:INTERNATIONAL PANEL, CRAWLING, ISP DATA, LEARNING SET
• 90K events/sec• 4TB/day compressed
BATCH & ON DEMAND PROCESSING:
• 100TB i/o a day• > 150 machines just in processing
cluster• Statistical & machine learning
algorithms
We Provide Digital Insights to the Entire World
Business Proprietary & Confidential
Quantum clustering
December 1, 2014
Prof. David Horn and Dr. Assaf Gottlieb.Phys. Rev. Lett. 88 (2002) 018702
• Unsupervised learning problem - dealing with unlabeled data• Goal: group together elements that are similar to each other in some sense.• We usually have an idea or a desire of what this “sense” should be• Might discover new patterns
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
• The user identity is unknown• Leaving it in for the example
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
?
?
?
?
?
?
?
?
• Grouping by gender
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
• Grouping by fields of interest
Clustering- general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
Quantum Clustering - Motivation
• Relatively easy clustering task
• Still need to set the number of clusters manually.
• Very complex clustering task. • Unbiased analysis of X-Ray
absorption data
Quantum Clustering - Example
Analyzing Big Data with Dynamic Quantum Clustering M. Weinstein, F. Meirer, A. Hume, Ph. Sciau, G. Shaked, R. Hofstetter, E. Persi, A. Mehta, D. Horn http://arxiv.org/abs/1310.2700
• Information era - big data• Massive collection of data• Strong presence of outliers• Unknown structures• Non trivial patterns
Why is it important?
Quantum Clustering
Distributed computationtechnologies
Quantum clustering - the potential trick1. Turn data-points into Gaussians centered around the data points:
2. Plug into Schrodinger equation and find V(). Define the solution for V as the potential transform
• Single point → Gaussian →• Multi-points: =
3. Move each data point towards the direction of the minima of the according to the potential surface with gradient descent.
Quantum clustering – reasoning
• Why does it make sense?• Models the divergence effects from the cluster center.• V() : The effects that bind points from the same cluster together.• We may say that we are looking for the minima of V() since this is where the
divergence effects are minimal (slow changes – small numerator and high density- denominator:
• SVD may be performed prior to the clustering: X=USVT , perform QC on U or V• Solve the fact that each feature is of a different dimension type, and scale.• enable dimension reduction to those with the highest variance.
A topographic map of the probability distribution for the crab data set with =1/2 using principal components 2 and 3. There exists only one maximum.
A topographic map of the potential for the crab data set with =1/2 using principal components 2 and 3 . The four minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1.
The Crabs Example (from Ripley’s textbook), 4 classes, 50 samples each, d=5
The data 3D Plot of the potential
Quantum clustering - summary
• Built-in capability to handle outliers (divergence part): no need for additional parameters or processes, no effect on the amount of significant clusters
• The cluster may be a line or other shape and not necessarily a point in the feature space.
• The clusters are not defined by geometric or probability considerations alone
• No need to pre-define the amount of clusters
• Existing approximated quantum clustering variation for improving time complexity.
• Sensitive to small variations in the data density unlike geometry consideration alone.
• Possible Distributed calculation:• Since all we have is to calculate V, V for every data point parts can be calculated at
each point separately in a different machine
• Performed exceptionally in exposing hidden patterns of data structures from a wide range of fields - finance, on-line marketing, experimental physics, speech-recognition, biological data.
Quantum clustering
• Physics may provide interesting perspective to questions that at the first glance has no connection to physics.
• It has been done in scale space theory • Simulated annealing• In bio-informatics for extracting protein structure• And many more
• Next steps: implement in a distributed manner, examine this algorithm on web data, improve time complexity, explore approximated QC, theoretical research.
Quantum clustering
Business Proprietary & Confidential
Thank You!
December 1, 2014
Get to know SimilarWeb : https://www.similarweb.com/
References
Prof David Horn Homepage: http://horn.tau.ac.il/