evolving search relevancy: presented by james strassburg, direct supply

26

Upload: lucidworks

Post on 13-Jul-2015

328 views

Category:

Software


1 download

TRANSCRIPT

Evolving Search Relevancy James Strassburg Senior  Architect  -­‐  Direct  Supply  

@jstrassburg  

Agenda

• An Optimization Problem • Genetic Algorithm Overview • Modeling Solr Parameters • Fitness Function

sir can you help me… ????

"iam from indonesia want to build search engine like a Google and i want to build the system using Genetic Algorithm but iam confused what will i do first. Thanks before."

Search Algorithm Parameters

/select?q=foo&defType=dismax

&qf=name^20+desc^10

&pf=name^10&ps=3&mm=2

&bf=”ord(popularity)^0.05”

and many more

Where did those numbers come from?

I made them up… shhhhhhh.

Then we tweaked them after testing.

An Optimization Problem

So, how do we know we have the best set of numbers? Or even a good set? We have an optimization problem.

Sample Schema

<field name="name" type="text_en" indexed="true" stored="true" required="true" multiValued="false" omitNorms="true"/>

<field name="description" type="text_en" indexed="true" stored="true" multiValued="false" omitNorms="true"/>

Sample Data Set [{

"name":"Red Lobster",

"description":"We deliver the freshest caught seafood every day."

},{

"name":"Joe's Crab Shack",

"description":"We serve delicious red crabs, rock crabs, large lobsters, and other delicious seafood. Our lobsters are our specialty."}]

http://localhost:8983/solr/restaurantsCollection/select?q=red+lobster&defType=dismax&qf=name+description&indent=true&fl=name+description

Genetic Algorithms

• A tool for solving optimization problems • Based on ideas from genetics, evolution,

and natural selection • DEAP – Distributed Evolutionary

Algorithms in Python

Genetic Algorithms

•  Define candidate solution encoding •  Define a fitness function •  Generate random solutions •  Select candidates for reproduction •  Use crossover and mutation to create a new

generation •  Repeat until some criteria is met

Crossover and Mutation

Parent 1: [1,0,1,1,1,0,1,1]

Parent 2: [0,0,0,0,1,1,1,1]

Child: [1,0,0,1,1,0,1,0]

Encoding Parameters

>>> sys.float_info

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Encoding Parameters

>>> import numpy

>>> single = numpy.float32(3.4)

>>> single

3.4000001

>>> half_single = numpy.float16(3.4)

>>> half_single

3.4004

Encoding Parameters

/select?q=foo&qf=field^35.2

versus

/select?q=foo&qf=field^35.3

Decimal / Fibonacci Encoding

• 0, 0.2, 0.4, 0.8, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144

•  16 values encode into 4-bits • Supports fast evolution • Avoids relative maxima

Decimal / Fibonacci Encoding

0.0 => [0, 0, 0, 0]

0.2 => [0, 0, 0, 1]

0.4 => [0, 0, 1, 0]

1 => [0, 1, 0, 1]

2 => [0, 1, 1, 0]

144 => [1, 1, 1, 1]

Candidate Solution Encoding

/select?q=foo&qf=name^0.4+desc^13

0.4 => [0, 0, 1, 0]

13 => [1, 0, 1, 0]

Candidate Solution: [0, 0, 1, 0, 1, 0, 1, 0]

Fitness Function

• Measure how well a candidate solution solves the problem

• Should be very fast

Normalized Discounted Cumulative Gain

• Very relevant > relevant > not relevant • Relevant results are more useful if they

appear earlier • Results should be irrelevant of the query

Precision and Recall

Precision – Likelihood that a returned result was correct Recall – Likelihood that a relevant result was returned

F-measure

• Harmonic mean of precision and recall • Punishes outliers

Analytics in Schema

<field name="searchTermInteractions" type="lowercase" indexed="true" stored="true" multiValued="true"/>

Demo

Resources

•  DEAP - https://code.google.com/p/deap/ •  My github repo for this example -

https://github.com/jstrassburg/evolving-search-relevancy

•  @jstrassburg