document indexing and scoring in solr ist 441 spring 2010 instructor: dr. c. lee giles presented by:...

24
Document Indexing and Document Indexing and Scoring in Solr Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian Khabsa

Upload: esther-johnston

Post on 04-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Document Indexing and Scoring Document Indexing and Scoring in Solrin Solr

IST 441 Spring 2010Instructor: Dr. C. Lee GilesPresented by: Sujatha DasSlides courtesy: Saurabh Kataria/Madian Khabsa

Page 2: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

OutlineOutlineBrief overview of Solr

ArchitectureIndexing Scoring/Searching

2

Page 3: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Features of SolrFeatures of SolrBuilt on top of LuceneXML/HTTP interfacesSchema to define types and

fieldsWeb adminstration interfaceCaching/replication/extensibility

Spring 2006 3

Page 4: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

YouSeer ArchitectureYouSeer Architecture

Spring 2008 4

File System

WWW

FS Crawler

Crawl(Heritrix)

PDFHTMLDOCTXT…

TXTparser

PDFparser

HTMLparser

SolrDocu-ments

StopAnalyzer

YourAnalyzer

StandardAnalyzer

indexer

indexer

Index

sear

cher

sear

cher

Crawling(Heritrix) ParsingIndexing/Searching(Solr)

Searching

YouSeer

Page 5: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene’s index (conceptual)Lucene’s index (conceptual)

Spring 2008 5

Index

Document

Document

Document

Document

Field

Field

Field

Field

Field

Name Value

Page 6: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Solr documentsSolr documents

Spring 2008 6

Solr accepts well formatted XML documents

<add> <doc>

◦ <field name=“URL">www.cnn.com</field> ◦ <field name=“title">CNN Breaking News –

Obama wins</field>◦ <field name=“content">Barack Obama is the

44th president of the USA</field>◦ <field name=“pubDate">2008-11-

06T23:59:59.999Z</field> </doc> </add>

Page 7: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Adding documents via Adding documents via YouSeerYouSeer

Spring 2006 7

To index documents crawled by heritrix:◦Navigate to ~/youseer/release◦Run: java –jar YouSeer.jar

http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached

files Number of threads, please keep it 1 Waiting Time between retries

Page 8: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

YouSeer workflowYouSeer workflow

Spring 2006 8

Iterates on the compressed files (ARC files), and process the documents

Extract the textual content of the document, and parse metadata (from heritrix)

Generate an XML file as outputThis XML file is submitted to the index

via the url◦http://localhost:90XX/solr/update

Page 9: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Analyzers specify how the text in a field is to Analyzers specify how the text in a field is to be indexedbe indexed

◦ Options in Lucene WhitespaceAnalyzer

divides text at whitespace SimpleAnalyzer

divides text at non-letters convert to lower case

StopAnalyzer SimpleAnalyzer removes stop words

StandardAnalyzer good for most European Languages removes stop words convert to lower case

Create you own Analyzers

Spring 20089

Page 10: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

An example of analyzing a document

Spring 2008 10

Page 11: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Analyzers are specified in Analyzers are specified in schema.xmlschema.xml <fieldType name="integer" class="solr.IntField" omitNorms="true"/>

<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >

<analyzer>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer>

</fieldType>

Spring 2006 11

Page 12: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

What is an index comprised of ?What is an index comprised of ?

Spring 2008 12

Inverted Index (Inverted File)

Doc 1:

Penn State Football …

football

Doc 2:

Football players … State

Postingid

word doc offset

1 football Doc 1 3

Doc 1 67

Doc 2 1

2 penn Doc 1 1

3 players Doc 2 2

4 state Doc 1 2

Doc 2 13

PostingTable

Page 13: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Postings tablePostings table

Postings table is a fast look-up mechanism ◦ Key: word◦ Value: posting id, satellite data (#df, offset, …)

Lucene implements the posting table with Java’s hash table◦ Objectified from java.util.Hashtable◦ Hash function depends on the JVM

hc2 = hc1 * 31 + nextChar Postings table usage

◦ Indexing: insertion (new terms), update (existing terms)

◦ Searching: lookup, and construct document vector

Spring 2008 13

Page 14: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene Index Files: Field infos file (.fnm)

Spring 2008 14

Format: FieldsCount, <FieldName, FieldBits>FieldsCount the number of fields in the indexFieldName the name of the field in a stringFieldBits a byte and an int where the lowest

bit of the byte shows whether the field is indexed, and the int is the id of the term

1, <content, 0x01>

Page 15: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene Index Files: Term Dictionary file (.tis)

Spring 2008 15

Format: TermCount, TermInfosTermInfos <Term, DocFreq>Term <PrefixLength, Suffix, FieldNum>

This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's textTermCount the number of terms in the documentsTerm Term text prefixes are shared. The PrefixLength is the

number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

FieldNumber the term's field, whose name is stored in the .fnm file

4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2>

Document Frequency can be obtained from this file.

Page 16: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene Index Files: Term Info index (.tii)

Spring 2008 16

Format: IndexTermCount, IndexInterval, TermIndicesTermIndices <TermInfo, IndexDelta>

This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.IndexDelta determines the position of this term's

TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.4,<football,1> <penn,3><layers,2> <state,1>

Page 17: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene Index Files: Frequency file (.frq)

Spring 2008 17

Format: <TermFreqs>TermFreqs TermFreqs TermFreq DocDelta, Freq?

TermFreqs are ordered by term (the term is implicit, from the .tis file).TermFreq entries are ordered by increasing document number.DocDelta determines both the document number and the

frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as the next Int.

For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of Ints: 15, 8, 3

<<2, 2, 3> <3> <5> <3, 3>>Term Frequency can be obtained from this file.

Page 18: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Lucene Index Files: Position file (.prx)

Spring 2008 18

Format: <TermPositions>TermPositions <Positions> Positions <PositionDelta >

TermPositions are ordered by term (the term is implicit, from the .tis file).Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).PositionDelta the difference between the position of the current

occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document).

For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of Ints: 4, 5, 4

<<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>>

Page 19: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Query ProcessingQuery Processing

Spring 2008 19

Query

Term Dictionary(Random file

access)

Term Info Index

(in Memory)

Constant time

Constant time

Frequency File(Random file

access)

Con

stan

t tim

e

Position File(Random file

access)

Constant time

Field info(in Memory)

Constant time

Page 20: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Searching solr indexSearching solr index

Spring 2008 20

Types of query:◦ Boolean

◦ [IST441 Giles] [IST441 OR Giles][java AND NOT SUN]

◦ wildcard◦ [nu?ch] [nutc*]

◦ phrase◦ [“JAVA TOMCAT”]

◦ Proximity◦ [“lucene nutch” ~10]

◦ Fuzzy◦ [roam~] matches roams and foam

◦ Field:value◦ …

Page 21: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Scoring Function is specified in schema.xmlScoring Function is specified in schema.xmlSimilarity

score(Q,D)   =   coord(Q,D)  · queryNorm(Q)  

·  ∑ t in Q ( tf(t in D)  ·  idf(t)2  

·  t.getBoost() · norm(D) )

Question:◦ What type of IR model does Lucene use?

factors◦ term-based factors

tf(t in D) : term frequency of term t in document d default implementation

idf(t): inverse document frequency of term t in the entire corpus default implementation

Spring 2008 21

1)]1/(ln[ ++docFreqNDocs

requencyraw term f

Page 22: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Default Scoring FunctionDefault Scoring Function

Spring 2008 22

•coord(Q,D) = overlap between Q and D / maximum overlapMaximum overlap is the maximum possible length of overlap between

Q and D

•queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2

If t.getBoost() = 1, q.getBoost() = 1 Then, sum of square weight = ∑ t in Q ( idf(t) )2

thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½

• norm(D) = 1/number of terms½ (This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.)

Page 23: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

Example:Example: D1: hello, please say hello to him. D2: say goodbye Q: you say hello

◦ coord(Q, D) = overlap between Q and D / maximum overlap coord(Q, D1) = 2/3, coord(Q, D2) = 1/2,

◦ queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 t.getBoost() = 1, q.getBoost() = 1 sum of square weight = ∑ t in Q ( idf(t) )2 queryNorm(Q) = 1/(0.59452+12) ½ =0.8596

◦ tf(t in d) = frequency½ tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0

◦ idf(t) = ln (N/(nj+1)) + 1 idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0.5945, idf(hello) = ln(2/(1+1)) +1 = 1

◦ norm(D) = 1/number of terms½

norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071

◦ Score(Q, D1) = 2/3*0.8596*(1*0.59452+1.4142*12)*0.4082=0.4135◦ Score(Q, D2) = 1/2*0.8596*(1*0.59452)*0.7071=0.1074

Spring 2008 23

Page 24: Document Indexing and Scoring in Solr IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian

ReferencesReferenceshttp://youseer.sourceforge.net/doc/Tutor

ial.pdfhttp://lucene.apache.org/solr/http://lucene.apache.org/java/2_4_0/sco

ring.html

Spring 2006 24