liking relevance - php north east 2014
DESCRIPTION
Enabling Solr or ElasticSearch in your PHP application isn't that hard these days anymore. There are multiple libraries available which can turn any collection of data into a searchable index making it as accessible as the PHP language itself. But how do you become a pro in creating the right schema, query and data analyzer? This talk will take the attendee knee deep into the problems we faced in analyzing, faceting, getting the right user experience and of course the relevant results for the second-hand car site AutoTrack where you have to ability to search on more then 200 different attributes of a car! Now you may calculate how many unique different search queries that sums up to…TRANSCRIPT
TITEL DAG MAAND JAARPHP NORTH EAST 2014 JEROEN VAN DIJK
LIKING
RELEVANCE
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WILL I FIND WHAT I’M LOOKING FOR?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
129 LINKEDIN PROFILE MATCHES
SEARCHING JEROEN VAN DIJK?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
enrise.com/jeroen-van-dijk
phpbenelux.eu/jeroen-van-dijk
jrdk.nl/jeroen-van-dijk
twitter.com/jrvandijk
LINK: JEROEN VAN DIJK
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPGjoind.in/10916
LINK: JOIND.IN
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS AUTOTRACK?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW MANY OPTIONS DO I HAVE?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
NEED MORE?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHERE TO START?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DO YOU SEARCH?
SOLR
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
SCHEMA.XML, WHERE DO I START?├── contrib ├── dist ├── docs ├── example │ ├── solr │ │ ├── bin │ │ └── collection1 │ │ └── conf │ │ ├── admin-extra.html │ │ ├── admin-extra.menu-bottom.html │ │ ├── admin-extra.menu-top.html │ │ ├── currency.xml │ │ ├── elevate.xml │ │ ├── mapping-FoldToASCII.txt │ │ ├── mapping-ISOLatin1Accent.txt │ │ ├── protwords.txt │ │ ├── schema.xml │ │ ├── scripts.conf │ │ ├── solrconfig.xml │ │ ├── spellings.txt │ │ ├── stopwords.txt │ │ ├── synonyms.txt │ │ ├── update-script.js └── licenses
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
DID YOU READ THE SCHEMA.XML? 1 <?xml version="1.0" encoding="UTF-8" ?> 2 <schema name="example" version="1.5"> 3 <!-- 4 This is the Solr schema file. This file should be named "schema.xml" and 5 should be in the conf directory under the solr home 6 (i.e. ./solr/conf/schema.xml by default) 7 or located where the classloader for the Solr webapp can find it. 8 9 This example schema is the recommended starting point for users. 10 It should be kept correct and concise, usable out-of-the-box. 11 12 For more information, on how to customize this file, please see 13 http://wiki.apache.org/solr/SchemaXml 14 --> 15 <types> 16 <!-- There is 61Kb of example schema --> 17 </types> 18 <fields> 19 <!-- documented in this file! --> 20 </fields> 21 <uniqueKey>id</uniqueKey> 22 <defaultSearchField>text</defaultSearchField> 23 </schema>
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT ARE TYPES AND FIELDS? 1 <?xml version="1.0" encoding="UTF-8" ?> 2 <schema name="cars" version="1.5"> 3 <types> 4 <!-- A plain string type --> 5 <fieldType name="string" class=“solr.StrField" 6 sortMissingLast="true" omitNorms="true"/> 7 <!-- The model type for tokenizing & filtering synonyms --> 8 <fieldtype name="modelType" class="solr.TextField"> 9 <analyzer type="query"> 10 <tokenizer class="solr.KeywordTokenizerFactory"/> 11 <filter class="solr.SynonymFilterFactory" synonyms="models. 12 querytime.txt" ignoreCase="false" 13 expand="true"/> 14 </analyzer> 15 </fieldtype> 16 </types> 17 <fields> 18 <field name="merk" type="string" indexed="true" stored="true" 19 termVectors="true"/> 20 <field name="model" type="modelType" indexed="true" stored="true"/> 21 </fields> 22 <!-- ... --> 23 </schema>
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT’S THE USAGE OF A COPY FIELD? 1 <?xml version="1.0" encoding="UTF-8" ?> 2 <schema name="cars" version="1.5"> 3 <types> 4 <!-- ... --> 5 </types> 6 <fields> 7 <!-- ... --> 8 <field name="text" type="text" indexed="true" stored="false" 9 multiValued="true"/> 10 </fields> 11 <uniqueKey>auto_id</uniqueKey> 12 <defaultSearchField>text</defaultSearchField> 13 <solrQueryParser defaultOperator="OR"/> 14 15 <copyField source="merk" dest="text"/> 16 <copyField source="model" dest="text"/> 17 <copyField source="uitvoering" dest="text"/> 18 <copyField source="aanbieder_informatie" dest="text"/> 19 <copyField source="interieur_kleur" dest="text"/> 20 <copyField source="bouwjaar" dest="text"/> 21 </schema>
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DID YOU INDEX DATA?
§Using Solr since version 1.3
§CSV & XML update request handlers
§Data import handler
§Explicitly chose Solr XML import
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
CAN IT BE MORE ABSTRACT?
Database
!
!
Nightly bulk export Trigger item export
!
!
Mapping
!
!
Solr
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
DO YOU HAVE SOME TIPS?
§Analyze your data up front
§Do not store, what you don’t want to visualize!
§Pay extra attention to columns you want to sort
§Create a well defined copy field for the query
§Do not use Solr as your persistent storage
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
NO SQL?
SELECT * FROM …
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHICH QUERY HANDLERS ARE THERE?
§Standard
§Disjunction Max
§Extended Disjunction Max
§Create your own based on the above!
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS SOLARIUM?
Extensible PHP Library
Usable in any PHP based framework
Abstracts raw Solr communication
@basdenooijer / @solariumproject
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
IS IT THAT EASY WITH SOLARIUM?
1 // get a select query instance 2 $client = new Solarium\Client($config); 3 $query = $client->createSelect(); 4 5 // define the output field 6 $query->setFields(array('auto_id', 'merk_model_uitvoering', 7 'aanbieder_informatie', 'score')); 8 $query->setSorts(array('score' => 'desc')); 9 10 // set the query 11 $query->setQuery('audi +avant +abs'); 12 13 $resultset = $client->select($query);
select?q=audi +avant +abs &fl=auto_id,merk_model_uitvoering,aanbieder_informatie,score&sort=score desc
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DO I USE THE DISMAX REQUEST HANDLER? 1 // define the output field 2 $query->setFields(array('auto_id', 'merk_model_uitvoering', 3 'aanbieder_informatie', 'score')); 4 $query->setSorts(array('score' => 'desc')); 5 6 // get the dismax component and set a boost query 7 $dismax = $query->getDisMax(); 8 $dismax->setQueryFields('uitvoering^2.3 text^0.9'); 9 // set the query 10 $query->setQuery('audi +avant +abs'); 11 12 $resultset = $client->select($query);
select?q=audi +avant +abs& fl=auto_id,merk_model_uitvoering,aanbieder_informatie,score&sort=score desc&defType=dismax&qf=uitvoering^2.3 text^0.9
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS A BOOST FUNCTION? 1 // define the output field 2 $query->setFields(array('auto_id', 'merk_model_uitvoering', 2 'score')); 3 $query->setSorts(array('score' => 'asc')); 4 5 // get the dismax component and set a boost function 6 $dismax = $query->getDisMax(); 7 $dismax->setBoostFunctions( 8 ‘sqedist(x_coordinaat,y_coordinaat,155000,463000)'); 9 10 // set the query 11 $query->setQuery('audi +avant +abs');
select?q=audi +avant +abs &fl=auto_id,merk_model_uitvoering,score&sort=score asc&defType=dismax&bf=sqedist(x_coordinaat,y_coordinaat,155000,463000)
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHY NOT USE WGS84 FOR DISTANCE SEARCH?
!
Solr 4.0 introduced
!
SpatialRecursivePrefixTreeFieldType
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPGQUERIES
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS THE LUCENE RANGE QUERY SYNTAX?
maximum_speed:[1 TO 10]
publication_date:[20140301 TO 20140318]
publication_date:[* TO NOW]
any_range:[* TO *]
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DO I CREATE A RANGE QUERY? 1 // get the facetset component 2 $facetSet = $query->getFacetSet(); 3 4 // create a facet field instance and set options 5 $facet = $facetSet->createFacetRange('topsnelheid'); 6 $facet->setField('topsnelheid'); 7 $facet->setStart(120); 8 $facet->setGap(10); 9 $facet->setEnd(250); 10 11 // this executes the query and returns the result 12 $resultset = $client->select($query);
select?facet=true&facet.range={!key=topsnelheid}topsnelheid&f.topsnelheid.facet.range.start=90&f.topsnelheid.facet.range.end=250&f.topsnelheid.facet.range.gap=10
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DO I CREATE A RANGE QUERY? === Topsnelheid === [ ] 120.0 [31] [ ] 130.0 [616] [ ] 140.0 [2842] [ ] 150.0 [17254] [ ] 160.0 [15869] [ ] 170.0 [25597] [ ] 180.0 [29735] [ ] 190.0 [25370] [ ] 200.0 [17624] [ ] 210.0 [10256] [ ] 220.0 [6611] [ ] 230.0 [3322] [ ] 240.0 [1768]
select?facet=true&facet.range={!key=topsnelheid}topsnelheid&f.topsnelheid.facet.range.start=90&f.topsnelheid.facet.range.end=250&f.topsnelheid.facet.range.gap=10
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
HOW DO YOU DEFINE A CUSTOM RANGE?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS A FACET MULTI QUERY? 1 // get the facetset component 2 $facetSet = $query->getFacetSet(); 3 4 // create a facet field instance and set options 5 $facet = $facetSet->createFacetMultiQuery( 6 array('key' => 'topsnelheid')); 7 $facet->createQuery( 8 'topsnelheid[*TO140]', 'topsnelheid:[* TO 140]'); 9 $facet->createQuery( 10 'topsnelheid[141TO150]', 'topsnelheid:[141 TO 150]'); 11 12 // this executes the query and returns the result 13 $resultset = $client->select($query);
select?facet=true&facet.query={!key=topsnelheid[*TO140]}topsnelheid:[* TO 140]&facet.query={!key=topsnelheid[141TO150]}topsnelheid:[141 TO 150]
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
BUT WHAT ABOUT THE OTHER FIELDS?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
RESULT
GROUPING
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT ABOUT DIFFERENT CAR OPTIONS?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT ABOUT DIFFERENT CAR OPTIONS?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
CREATING GROUPING QUERY
1 // define the output field 2 $query->setFields(array('auto_id', 3 'merk_model_uitvoering', 'score')); 3 $query->setSorts(array('score' => 'asc')); 4 5 // get group component and create two query groups 6 $group = $query->getGrouping(); 7 $group->addField('uitvoering_carrosserievorm'); 8 9 $query->setQuery('+cabriolet +abs');
select?q=+cabriolet +abs&fl=auto_id,merk_model_uitvoering,score&sort=score asc&group=true &group.field=uitvoering_carrosserievorm
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
PIVOT
FACETING
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
ISN’T THAT GOOGLE SPREADSHEETS?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS A DECISION TREE?
!
Audi !
!
A3 A4 !
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
IS IT EASY TO CREATE A PIVOT FACET?
1 $query->setQuery('+cabriolet +abs'); 2 3 // get the facetset component 4 $facetSet = $query->getFacetSet(); 5 6 // create a facet pivot instance 7 $facet = $facetSet->createFacetPivot('merk-model'); 8 $facet->addFields('merk,model'); 9 10 // this executes the query and returns the result 11 $resultset = $client->select($query);
select?q=+cabriolet +abs &facet=true&facet.pivot=merk,model
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
SEEMS SO I GUESS?
[ ] AUDI [7] [ ] -- A4 [7] [ ] BMW [4] [ ] -- 3-SERIE [3] [ ] -- 1-SERIE [1] [ ] VOLKSWAGEN [3] [ ] -- GOLF [2] [ ] -- NEW BEETLE [1] [ ] PEUGEOT [3] [ ] -- 207 [1] [ ] -- 306 [1] [ ] -- 307 [1]
select?q=+cabriolet +abs &facet=true&facet.pivot=merk,model
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
MULTI SELECT
FACETING
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHERE ARE THE ZEROES?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
FACET MINCOUNT 0
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHERE DID LPG GO?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
FACET MINCOUNT 1
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
DO YOU SEE ONLY THE RELEVANT MODELS?
=== Model === [ ] A3 (11) [ ] A4 (19) [ ] A6 (7) [ ] A8 (2) !
!
facet=true&facet.field=model&facet.mincount=1&fq=aanbieder_id:1&fq=merk:AUDI
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
IS THIS THE EXPECTED RESULT?
=== Model === [ ] A3 (11) [X] A4 (19) [ ] A6 (7) [ ] A8 (2) !
!
facet=true&facet.field=model&facet.mincount=1&fq=aanbieder_id:1&fq=merk:AUDI&fq=model:A4
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
OR THIS ONE?
=== Model === [X] A4 (19) !
!
facet=true&facet.field=model&facet.mincount=1&fq=aanbieder_id:1&fq=merk:AUDI&fq=model:A4
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
TAGGING AND EXCLUDING TO THE RESCUE? 1 $query->addFilterQueries(array( 2 array('key'=>'aanbieder_id', ‘query’=>'aanbieder_id:1', 3 'tag'=>'inner'), 4 array('key'=>'merk', 'query'=>'merk:AUDI', 5 'tag'=>'inner'), 6 array('key'=>'model', 'query'=>'model:A4', 7 'tag'=>'outer'))); 8 9 // get the facetset component 10 $facetSet = $query->getFacetSet(); 11 $facetSet->setMinCount(1); 12 $facetSet->createFacetField(array('key'=>'model', 13 ‘field'=>'model', 'exclude'=>'outer'));
facet=true&facet.field={!key=model ex=outer}model&facet.mincount=1& fq={!tag:inner}aanbieder_id:1&fq={!tag=inner}merk:AUDI&fq={!tag=outer}model:A4
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
IS THIS BETTER?
=== Model === [ ] A3 (11) [X] A4 (19) [ ] A6 (7) [ ] A8 (2) !
!facet=true&facet.field={!key=model ex=outer}model&facet.mincount=1& fq={!tag:inner}aanbieder_id:1&fq={!tag=inner}merk:AUDI&fq={!tag=outer}model:A4
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHERE ARE THE COUNTS?
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
FACET MINCOUNT 0
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
DID YOU JUST CREATE THE FACET TWICE? 1 $query->addFilterQueries(array( 2 array('query'=>'aanbieder_id:1', 'tag'=>'inner'), 3 array('query'=>'carrosserievorm:CABRIOLET', 4 'tag'=>'outer'))); 5 6 // get the facetset component and add fields 7 $facetSet = $query->getFacetSet(); 9 9 $facetSet->createFacetField(array('key'=>'cv', 10 'field'=>'carrosserievorm', 'exclude'=>'outer')); 11 $facetSet->createFacetField(array('key'=>'cv_internal', 12 'field'=>'carrosserievorm'));
select?facet=true&fq={!tag=inner}aanbieder_id:1& fq={!tag=outer}carrosserievorm:CABRIOLET& facet.field={!key=cv ex=outer}carrosserievorm& facet.field={!key=cv_internal}carrosserievorm
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS THE EXPECTED RESULT?=== CV === [ ] HATCHBACK [514] [ ] STATIONWAGEN [510] [ ] SEDAN [188] [ ] MPV [185] [ ] SUV/TERREINWAGEN [108] [X] CABRIOLET [71]
select?facet=true&fq={!tag=inner}aanbieder_id:1& fq={!tag=outer}carrosserievorm:CABRIOLET& facet.field={!key=cv ex=outer}carrosserievorm& facet.field={!key=cv_internal}carrosserievorm
=== CV_INTERNAL === [ ] HATCHBACK [0] [ ] STATIONWAGEN [0] [ ] SEDAN [0] [ ] MPV [0] [ ] SUV/TERREINWAGEN [0] [ ] COUPE [0] [ ] BEDRIJFSWAGEN [0] [X] CABRIOLET [26] [ ] PERSONENBUS [0]
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
WHAT IS THE EXPECTED RESULT?
!
=== CV === [ ] HATCHBACK [0] [ ] STATIONWAGEN [0] [ ] SEDAN [0] [ ] MPV [0] [ ] SUV/TERREINWAGEN [0] [X] CABRIOLET [26]
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
INTERESTED IN SOME STATISTICS?
1 index, 2 web applications
!
250.000 cars, over 200 attributes per car
!
Only ~ 500Mb search index, easily run in memory
imap://[email protected]:993/fetch%3EUID%3E/INBOX%3E13466?part=1.2&type=image/jpeg&filename=foto.JPG
Thank you!
Jeroen van Dijk | jrvandijk | joind.in/10916