document indexing and scoring in solr ist 441 spring 2010 instructor: dr. c. lee giles presented by:...

Document Indexing and Scoring Document Indexing and Scoring in Solrin Solr

IST 441 Spring 2010Instructor: Dr. C. Lee GilesPresented by: Sujatha DasSlides courtesy: Saurabh Kataria/Madian Khabsa

OutlineOutlineBrief overview of Solr

ArchitectureIndexing Scoring/Searching

2

Features of SolrFeatures of SolrBuilt on top of LuceneXML/HTTP interfacesSchema to define types and

fieldsWeb adminstration interfaceCaching/replication/extensibility

Spring 2006 3

YouSeer ArchitectureYouSeer Architecture

Spring 2008 4

File System

WWW

FS Crawler

Crawl(Heritrix)

PDFHTMLDOCTXT…

TXTparser

PDFparser

HTMLparser

SolrDocu-ments

StopAnalyzer

YourAnalyzer

StandardAnalyzer

indexer

indexer

Index

sear

cher

sear

cher

Crawling(Heritrix) ParsingIndexing/Searching(Solr)

Searching

YouSeer

Lucene’s index (conceptual)Lucene’s index (conceptual)

Spring 2008 5

Index

Document

Document

Document

Document

Field

Field

Field

Field

Field

Name Value

Solr documentsSolr documents

Spring 2008 6

Solr accepts well formatted XML documents

<add> <doc>

◦ <field name=“URL">www.cnn.com</field> ◦ <field name=“title">CNN Breaking News –

Obama wins</field>◦ <field name=“content">Barack Obama is the

44th president of the USA</field>◦ <field name=“pubDate">2008-11-

06T23:59:59.999Z</field> </doc> </add>

Adding documents via Adding documents via YouSeerYouSeer

Spring 2006 7

To index documents crawled by heritrix:◦Navigate to ~/youseer/release◦Run: java –jar YouSeer.jar

http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached

files Number of threads, please keep it 1 Waiting Time between retries

YouSeer workflowYouSeer workflow

Spring 2006 8

Iterates on the compressed files (ARC files), and process the documents

Extract the textual content of the document, and parse metadata (from heritrix)

Generate an XML file as outputThis XML file is submitted to the index

via the url◦http://localhost:90XX/solr/update

Analyzers specify how the text in a field is to Analyzers specify how the text in a field is to be indexedbe indexed

◦ Options in Lucene WhitespaceAnalyzer

divides text at whitespace SimpleAnalyzer

divides text at non-letters convert to lower case

StopAnalyzer SimpleAnalyzer removes stop words

StandardAnalyzer good for most European Languages removes stop words convert to lower case

Create you own Analyzers

Spring 20089

An example of analyzing a document

Spring 2008 10

Analyzers are specified in Analyzers are specified in schema.xmlschema.xml <fieldType name="integer" class="solr.IntField" omitNorms="true"/>

<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >

<analyzer>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer>

</fieldType>

Spring 2006 11

What is an index comprised of ?What is an index comprised of ?

Spring 2008 12

Inverted Index (Inverted File)

Doc 1:

Penn State Football …

football

Doc 2:

Football players … State

Postingid

word doc offset

1 football Doc 1 3

Doc 1 67

Doc 2 1

2 penn Doc 1 1

3 players Doc 2 2

4 state Doc 1 2

Doc 2 13

PostingTable

Postings tablePostings table

Postings table is a fast look-up mechanism ◦ Key: word◦ Value: posting id, satellite data (#df, offset, …)

Lucene implements the posting table with Java’s hash table◦ Objectified from java.util.Hashtable◦ Hash function depends on the JVM

hc2 = hc1 * 31 + nextChar Postings table usage

◦ Indexing: insertion (new terms), update (existing terms)

◦ Searching: lookup, and construct document vector

Spring 2008 13

Lucene Index Files: Field infos file (.fnm)

Spring 2008 14

Format: FieldsCount, <FieldName, FieldBits>FieldsCount the number of fields in the indexFieldName the name of the field in a stringFieldBits a byte and an int where the lowest

bit of the byte shows whether the field is indexed, and the int is the id of the term

1, <content, 0x01>

Lucene Index Files: Term Dictionary file (.tis)

Spring 2008 15

Format: TermCount, TermInfosTermInfos <Term, DocFreq>Term <PrefixLength, Suffix, FieldNum>

This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's textTermCount the number of terms in the documentsTerm Term text prefixes are shared. The PrefixLength is the

number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

FieldNumber the term's field, whose name is stored in the .fnm file

4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2>

Document Frequency can be obtained from this file.

Lucene Index Files: Term Info index (.tii)

Spring 2008 16

Format: IndexTermCount, IndexInterval, TermIndicesTermIndices <TermInfo, IndexDelta>

This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.IndexDelta determines the position of this term's

TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.4,<football,1> <penn,3><layers,2> <state,1>

Lucene Index Files: Frequency file (.frq)

Spring 2008 17

Format: <TermFreqs>TermFreqs TermFreqs TermFreq DocDelta, Freq?

TermFreqs are ordered by term (the term is implicit, from the .tis file).TermFreq entries are ordered by increasing document number.DocDelta determines both the document number and the

frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as the next Int.

For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of Ints: 15, 8, 3

<<2, 2, 3> <3> <5> <3, 3>>Term Frequency can be obtained from this file.

Lucene Index Files: Position file (.prx)

Spring 2008 18

Format: <TermPositions>TermPositions <Positions> Positions <PositionDelta >

TermPositions are ordered by term (the term is implicit, from the .tis file).Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).PositionDelta the difference between the position of the current

occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document).

For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of Ints: 4, 5, 4

<<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>>

Query ProcessingQuery Processing

Spring 2008 19

Query

Term Dictionary(Random file

access)

Term Info Index

(in Memory)

Constant time

Constant time

Frequency File(Random file

access)

Con

stan

t tim

e

Position File(Random file

access)

Constant time

Field info(in Memory)

Constant time

Searching solr indexSearching solr index

Spring 2008 20

Types of query:◦ Boolean

◦ [IST441 Giles] [IST441 OR Giles][java AND NOT SUN]

◦ wildcard◦ [nu?ch] [nutc*]

◦ phrase◦ [“JAVA TOMCAT”]

◦ Proximity◦ [“lucene nutch” ~10]

◦ Fuzzy◦ [roam~] matches roams and foam

◦ Field:value◦ …

Scoring Function is specified in schema.xmlScoring Function is specified in schema.xmlSimilarity

score(Q,D) = coord(Q,D) · queryNorm(Q)

· ∑ t in Q ( tf(t in D) · idf(t)2

· t.getBoost() · norm(D) )

Question:◦ What type of IR model does Lucene use?

factors◦ term-based factors

tf(t in D) : term frequency of term t in document d default implementation

idf(t): inverse document frequency of term t in the entire corpus default implementation

Spring 2008 21

1)]1/(ln[ ++docFreqNDocs

requencyraw term f

Default Scoring FunctionDefault Scoring Function

Spring 2008 22

•coord(Q,D) = overlap between Q and D / maximum overlapMaximum overlap is the maximum possible length of overlap between

Q and D

•queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2

If t.getBoost() = 1, q.getBoost() = 1 Then, sum of square weight = ∑ t in Q ( idf(t) )2

thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½

• norm(D) = 1/number of terms½ (This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.)

Example:Example: D1: hello, please say hello to him. D2: say goodbye Q: you say hello

◦ coord(Q, D) = overlap between Q and D / maximum overlap coord(Q, D1) = 2/3, coord(Q, D2) = 1/2,

◦ queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 t.getBoost() = 1, q.getBoost() = 1 sum of square weight = ∑ t in Q ( idf(t) )2 queryNorm(Q) = 1/(0.59452+12) ½ =0.8596

◦ tf(t in d) = frequency½ tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0

◦ idf(t) = ln (N/(nj+1)) + 1 idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0.5945, idf(hello) = ln(2/(1+1)) +1 = 1

◦ norm(D) = 1/number of terms½

norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071

◦ Score(Q, D1) = 2/3*0.8596*(1*0.59452+1.4142*12)*0.4082=0.4135◦ Score(Q, D2) = 1/2*0.8596*(1*0.59452)*0.7071=0.1074

Spring 2008 23

ReferencesReferenceshttp://youseer.sourceforge.net/doc/Tutor

ial.pdfhttp://lucene.apache.org/solr/http://lucene.apache.org/java/2_4_0/sco

ring.html

Spring 2006 24

document indexing and scoring in solr ist 441 spring 2010 instructor: dr. c. lee giles presented by:...

Documents

tis spring

fnm spring

z spring

lucene index files

inverted index

field infos file

outputthis xml file

term dictionary file