imada presentation

29
Martin R. Ehmsen [email protected] www.colourbox.com

Upload: ehmsen

Post on 24-Dec-2014

549 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 2: Imada presentation

Martin R. [email protected]

Outline• Personal introduction

• What is Colourbox?

• Why is Colourbox interesting?

• Similar images

• Search result ranking

• Recommendations

• Why Colourbox?

• Open position

• Questions

Page 3: Imada presentation

Martin R. [email protected]

Who am I?Why am I here?

• Me

• Graduated from IMADA, 2010

• Ph.D. in Computer Science

• Online Algorithms

• Technical Project Manager& System Architect

• Why this talk?

• Promote Colourbox

• There are interesting jobs on Funen

Page 4: Imada presentation

Martin R. [email protected]

Colourbox• Microstock photography company

• Resell images, vector graphics, videos

• March 2006

• 3 employees, 50 users, 50,000 images,150 new images daily

• November 2011

• 21 employees, 65,000 users, 2,000,000 images,5,000 new images daily

• Currently in top 10 of all stock sites, aiming at #1

Page 5: Imada presentation

Martin R. [email protected]

Colourbox• Only stock site that offers

flat rate

• Download all you want for €249,- per month

• Search, find, download

• Browse, get inspired, download

Page 6: Imada presentation

Martin R. [email protected]

The Tech• Build using open source software

• HTML(5), CSS(3), and Javascript (jQuery) front-end

• Varnish, Lighttpd, and Memcached

• MySQL (Percona) database

• PHP backend

• PHP, Python, and C++ scripts

• Self-developed search engine (Colourit)

• Using Python and C

• Cloud based on Amazon EC2 and S3

Page 8: Imada presentation

Martin R. [email protected]

The Geek Side• Techniques from mathematics and computer science• Distributed/parallel computing

• Vector mathematics• Various tree structures

• Set intersection

• Cache oblivious algorithms• Clustering algorithms

• Ranking algorithms• Markov chains

• etc...

Page 9: Imada presentation

Martin R. [email protected]

Similar images• Given an image, what other images look similar to it?

• Inspire

• Browse

• All images have keywords

• The keyword-to-image association is weighted

• How pronounced is the keyword for the image?

• Calculated automatically (more later)

Page 11: Imada presentation

Martin R. [email protected]

Similar images• Each keyword is a dimension in keyword vector space

• Each image is then represented as a vector in this space

• The projection onto each dimension is the weight of the corresponding keyword

• Example

• (goat, 96), (white, 94), (outside, 50)

• Vector (x, y, z, w) = (0.96, 0.94, 0.5, 0)

• (goat, 47), (white, 81), (day, 19)

• Vector (x, y, z, w) = (0.47, 0.81, 0, 0.19)

Page 12: Imada presentation

Martin R. [email protected]

Similar images• Similarity is then the angle between two vectors

• Easily calculated using high school math

• Result between 0 and 90 degrees

• Example (cont.)

• (0.96, 0.94, 0.5, 0) and (0.47, 0.81, 0, 0.19)

• Approx 27.73 degrees

• Do two images with similarity of 27.73 degrees look similar?

• Experiments determined the cut-off

�u · �v = cos(θ)|�u||�v|

Page 14: Imada presentation

Martin R. [email protected]

Similar images• 2,000,000 images yields 2,000,000,000,000 comparisons

• No job dependencies

• No data modifications

• Relatively small data size

• Each keyword is identified by a number

• Very easy to do in parallel and distribute

• Speed up using a trick from cache oblivious algorithms

• This is not a one-time thing

• Keywords and weights change

Page 15: Imada presentation

Martin R. [email protected]

Ranking of results• How to rank search results?

• Want the “best” results first

• First solution: Use number of downloads as parameter

• Problems

• Old good images rank over new excellent images

• Wrong keywords distort the results

Page 16: Imada presentation

Martin R. [email protected]

Ranking of results• Harvest information from the users

• A clicked/downloaded image

• Matched the search string well

• Is a “good” image

• A shown-but-not-clicked image either

• Does not match the search string well, or

• Is a “bad” image

Page 17: Imada presentation

Martin R. [email protected]

Ranking of results• The keyword-to-image association is weighted

• Keyword weights are updated when

• a keyworder assigns a keyword (high weight)

• a supplier assigns a keyword (high weight)

• a user clicks on a photo presented by a search

• a user does NOT click on a photo presented

Page 18: Imada presentation

Martin R. [email protected]

Ranking of results• Search “Summer Lemon”

• User clicks first result

• Pros

• Second image rankedlower for “Lemon”

• Cons

• “Summer” ranked loweron second image

• Fixed by subsequentsearches

Lemon (0.7)Summer (0.9)Apple (0.0)

Lemon (0.9)Summer (0.8)Apple (0.1)

Lemon (0.65)Summer (0.8)Apple (0.0)

Lemon (0.95)Summer (0.86)Apple (0.1)

Page 19: Imada presentation

Martin R. [email protected]

Ranking of results• Images with

• Wrong keywords are ranked very low over time

• Good keywords are ranked higher

• Great images are ranked higher overall

• New excellent images can rank over old mediocre images

Page 20: Imada presentation

Martin R. [email protected]

Recommendations• “You are currently looking at image X,

and you might be interested in image Y, Z, and W”

Page 21: Imada presentation

Martin R. [email protected]

Recommendations• What images are connected?

• Let’s track our users to find out

Page 22: Imada presentation

Martin R. [email protected]

Recommendations

#2364906 #2964241 #2684393

Page 25: Imada presentation

Martin R. [email protected]

Recommendations• Enter Markov chains

• Using a Markov chain of order 1, the probability of going from media X to media Y is

• How many times path X -> Y was followed, divided by

• Sum over all paths going out of image X

Page 26: Imada presentation

Martin R. [email protected]

Why Colourbox?• We are

• small - 15 people no more than 15 steps apart

• flat - no long chains of command• flexible - we can move on good idea immediately

• a 2011 Gazelle - we are still hiring while others are still firing

• We have• Relaxed atmosphere

• Flexible work hours

• Candy cabinet, world class coffee machine, and stunning view :-)

• etc...

Page 27: Imada presentation

Martin R. [email protected]

Why Colourbox?• You get

• to work on fun problems

• great colleagues

• an international outlook

• to serve customers who are excited about us

• to be part of a company which aims to be #1

• New projects

• SkyFish - Company Colourbox

• Zulubox - to articles what Colourbox is to images

Page 28: Imada presentation

Martin R. [email protected]

We are hiring!• Software Developer – front-end systems

• Focus on HTML5, JS, PHP, SQL, etc.

• Can implement a pixel-perfect design from a PSD

• Can implement scalable code that also performs well when it is executed 50 times per second

• You know your way around Linux

• Start August 1st

• We are construction a new office building

• Unsolicited applications are always welcome