Download - Imada presentation
Martin R. [email protected]
Outline• Personal introduction
• What is Colourbox?
• Why is Colourbox interesting?
• Similar images
• Search result ranking
• Recommendations
• Why Colourbox?
• Open position
• Questions
Martin R. [email protected]
Who am I?Why am I here?
• Me
• Graduated from IMADA, 2010
• Ph.D. in Computer Science
• Online Algorithms
• Technical Project Manager& System Architect
• Why this talk?
• Promote Colourbox
• There are interesting jobs on Funen
Martin R. [email protected]
Colourbox• Microstock photography company
• Resell images, vector graphics, videos
• March 2006
• 3 employees, 50 users, 50,000 images,150 new images daily
• November 2011
• 21 employees, 65,000 users, 2,000,000 images,5,000 new images daily
• Currently in top 10 of all stock sites, aiming at #1
Martin R. [email protected]
Colourbox• Only stock site that offers
flat rate
• Download all you want for €249,- per month
• Search, find, download
• Browse, get inspired, download
Martin R. [email protected]
The Tech• Build using open source software
• HTML(5), CSS(3), and Javascript (jQuery) front-end
• Varnish, Lighttpd, and Memcached
• MySQL (Percona) database
• PHP backend
• PHP, Python, and C++ scripts
• Self-developed search engine (Colourit)
• Using Python and C
• Cloud based on Amazon EC2 and S3
Martin R. [email protected]
The Geek Side• Techniques from mathematics and computer science• Distributed/parallel computing
• Vector mathematics• Various tree structures
• Set intersection
• Cache oblivious algorithms• Clustering algorithms
• Ranking algorithms• Markov chains
• etc...
Martin R. [email protected]
Similar images• Given an image, what other images look similar to it?
• Inspire
• Browse
• All images have keywords
• The keyword-to-image association is weighted
• How pronounced is the keyword for the image?
• Calculated automatically (more later)
Martin R. [email protected]
Similar images• Each keyword is a dimension in keyword vector space
• Each image is then represented as a vector in this space
• The projection onto each dimension is the weight of the corresponding keyword
• Example
• (goat, 96), (white, 94), (outside, 50)
• Vector (x, y, z, w) = (0.96, 0.94, 0.5, 0)
• (goat, 47), (white, 81), (day, 19)
• Vector (x, y, z, w) = (0.47, 0.81, 0, 0.19)
Martin R. [email protected]
Similar images• Similarity is then the angle between two vectors
• Easily calculated using high school math
• Result between 0 and 90 degrees
• Example (cont.)
• (0.96, 0.94, 0.5, 0) and (0.47, 0.81, 0, 0.19)
• Approx 27.73 degrees
• Do two images with similarity of 27.73 degrees look similar?
• Experiments determined the cut-off
�u · �v = cos(θ)|�u||�v|
Martin R. [email protected]
Similar images• 2,000,000 images yields 2,000,000,000,000 comparisons
• No job dependencies
• No data modifications
• Relatively small data size
• Each keyword is identified by a number
• Very easy to do in parallel and distribute
• Speed up using a trick from cache oblivious algorithms
• This is not a one-time thing
• Keywords and weights change
Martin R. [email protected]
Ranking of results• How to rank search results?
• Want the “best” results first
• First solution: Use number of downloads as parameter
• Problems
• Old good images rank over new excellent images
• Wrong keywords distort the results
Martin R. [email protected]
Ranking of results• Harvest information from the users
• A clicked/downloaded image
• Matched the search string well
• Is a “good” image
• A shown-but-not-clicked image either
• Does not match the search string well, or
• Is a “bad” image
Martin R. [email protected]
Ranking of results• The keyword-to-image association is weighted
• Keyword weights are updated when
• a keyworder assigns a keyword (high weight)
• a supplier assigns a keyword (high weight)
• a user clicks on a photo presented by a search
• a user does NOT click on a photo presented
Martin R. [email protected]
Ranking of results• Search “Summer Lemon”
• User clicks first result
• Pros
• Second image rankedlower for “Lemon”
• Cons
• “Summer” ranked loweron second image
• Fixed by subsequentsearches
Lemon (0.7)Summer (0.9)Apple (0.0)
Lemon (0.9)Summer (0.8)Apple (0.1)
Lemon (0.65)Summer (0.8)Apple (0.0)
Lemon (0.95)Summer (0.86)Apple (0.1)
Martin R. [email protected]
Ranking of results• Images with
• Wrong keywords are ranked very low over time
• Good keywords are ranked higher
• Great images are ranked higher overall
• New excellent images can rank over old mediocre images
Martin R. [email protected]
Recommendations• “You are currently looking at image X,
and you might be interested in image Y, Z, and W”
Martin R. [email protected]
Recommendations• What images are connected?
• Let’s track our users to find out
Martin R. [email protected]
Recommendations• Enter Markov chains
• Using a Markov chain of order 1, the probability of going from media X to media Y is
• How many times path X -> Y was followed, divided by
• Sum over all paths going out of image X
Martin R. [email protected]
Why Colourbox?• We are
• small - 15 people no more than 15 steps apart
• flat - no long chains of command• flexible - we can move on good idea immediately
• a 2011 Gazelle - we are still hiring while others are still firing
• We have• Relaxed atmosphere
• Flexible work hours
• Candy cabinet, world class coffee machine, and stunning view :-)
• etc...
Martin R. [email protected]
Why Colourbox?• You get
• to work on fun problems
• great colleagues
• an international outlook
• to serve customers who are excited about us
• to be part of a company which aims to be #1
• New projects
• SkyFish - Company Colourbox
• Zulubox - to articles what Colourbox is to images
Martin R. [email protected]
We are hiring!• Software Developer – front-end systems
• Focus on HTML5, JS, PHP, SQL, etc.
• Can implement a pixel-perfect design from a PSD
• Can implement scalable code that also performs well when it is executed 50 times per second
• You know your way around Linux
• Start August 1st
• We are construction a new office building
• Unsolicited applications are always welcome