document indexing and scoring in solr ist 441 spring 2010 instructor: dr. c. lee giles presented by:...
TRANSCRIPT
Document Indexing and Scoring Document Indexing and Scoring in Solrin Solr
IST 441 Spring 2010Instructor: Dr. C. Lee GilesPresented by: Sujatha DasSlides courtesy: Saurabh Kataria/Madian Khabsa
OutlineOutlineBrief overview of Solr
ArchitectureIndexing Scoring/Searching
2
Features of SolrFeatures of SolrBuilt on top of LuceneXML/HTTP interfacesSchema to define types and
fieldsWeb adminstration interfaceCaching/replication/extensibility
Spring 2006 3
YouSeer ArchitectureYouSeer Architecture
Spring 2008 4
File System
WWW
FS Crawler
Crawl(Heritrix)
PDFHTMLDOCTXT…
TXTparser
PDFparser
HTMLparser
SolrDocu-ments
StopAnalyzer
YourAnalyzer
StandardAnalyzer
indexer
indexer
Index
sear
cher
sear
cher
Crawling(Heritrix) ParsingIndexing/Searching(Solr)
Searching
YouSeer
Lucene’s index (conceptual)Lucene’s index (conceptual)
Spring 2008 5
Index
Document
Document
Document
Document
Field
Field
Field
Field
Field
Name Value
Solr documentsSolr documents
Spring 2008 6
Solr accepts well formatted XML documents
<add> <doc>
◦ <field name=“URL">www.cnn.com</field> ◦ <field name=“title">CNN Breaking News –
Obama wins</field>◦ <field name=“content">Barack Obama is the
44th president of the USA</field>◦ <field name=“pubDate">2008-11-
06T23:59:59.999Z</field> </doc> </add>
Adding documents via Adding documents via YouSeerYouSeer
Spring 2006 7
To index documents crawled by heritrix:◦Navigate to ~/youseer/release◦Run: java –jar YouSeer.jar
http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached
files Number of threads, please keep it 1 Waiting Time between retries
YouSeer workflowYouSeer workflow
Spring 2006 8
Iterates on the compressed files (ARC files), and process the documents
Extract the textual content of the document, and parse metadata (from heritrix)
Generate an XML file as outputThis XML file is submitted to the index
via the url◦http://localhost:90XX/solr/update
Analyzers specify how the text in a field is to Analyzers specify how the text in a field is to be indexedbe indexed
◦ Options in Lucene WhitespaceAnalyzer
divides text at whitespace SimpleAnalyzer
divides text at non-letters convert to lower case
StopAnalyzer SimpleAnalyzer removes stop words
StandardAnalyzer good for most European Languages removes stop words convert to lower case
Create you own Analyzers
Spring 20089
An example of analyzing a document
Spring 2008 10
Analyzers are specified in Analyzers are specified in schema.xmlschema.xml <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Spring 2006 11
What is an index comprised of ?What is an index comprised of ?
Spring 2008 12
Inverted Index (Inverted File)
Doc 1:
Penn State Football …
football
Doc 2:
Football players … State
Postingid
word doc offset
1 football Doc 1 3
Doc 1 67
Doc 2 1
2 penn Doc 1 1
3 players Doc 2 2
4 state Doc 1 2
Doc 2 13
PostingTable
Postings tablePostings table
Postings table is a fast look-up mechanism ◦ Key: word◦ Value: posting id, satellite data (#df, offset, …)
Lucene implements the posting table with Java’s hash table◦ Objectified from java.util.Hashtable◦ Hash function depends on the JVM
hc2 = hc1 * 31 + nextChar Postings table usage
◦ Indexing: insertion (new terms), update (existing terms)
◦ Searching: lookup, and construct document vector
Spring 2008 13
Lucene Index Files: Field infos file (.fnm)
Spring 2008 14
Format: FieldsCount, <FieldName, FieldBits>FieldsCount the number of fields in the indexFieldName the name of the field in a stringFieldBits a byte and an int where the lowest
bit of the byte shows whether the field is indexed, and the int is the id of the term
1, <content, 0x01>
Lucene Index Files: Term Dictionary file (.tis)
Spring 2008 15
Format: TermCount, TermInfosTermInfos <Term, DocFreq>Term <PrefixLength, Suffix, FieldNum>
This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's textTermCount the number of terms in the documentsTerm Term text prefixes are shared. The PrefixLength is the
number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
FieldNumber the term's field, whose name is stored in the .fnm file
4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2>
Document Frequency can be obtained from this file.
Lucene Index Files: Term Info index (.tii)
Spring 2008 16
Format: IndexTermCount, IndexInterval, TermIndicesTermIndices <TermInfo, IndexDelta>
This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.IndexDelta determines the position of this term's
TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.4,<football,1> <penn,3><layers,2> <state,1>
Lucene Index Files: Frequency file (.frq)
Spring 2008 17
Format: <TermFreqs>TermFreqs TermFreqs TermFreq DocDelta, Freq?
TermFreqs are ordered by term (the term is implicit, from the .tis file).TermFreq entries are ordered by increasing document number.DocDelta determines both the document number and the
frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as the next Int.
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of Ints: 15, 8, 3
<<2, 2, 3> <3> <5> <3, 3>>Term Frequency can be obtained from this file.
Lucene Index Files: Position file (.prx)
Spring 2008 18
Format: <TermPositions>TermPositions <Positions> Positions <PositionDelta >
TermPositions are ordered by term (the term is implicit, from the .tis file).Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).PositionDelta the difference between the position of the current
occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document).
For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of Ints: 4, 5, 4
<<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>>
Query ProcessingQuery Processing
Spring 2008 19
Query
Term Dictionary(Random file
access)
Term Info Index
(in Memory)
Constant time
Constant time
Frequency File(Random file
access)
Con
stan
t tim
e
Position File(Random file
access)
Constant time
Field info(in Memory)
Constant time
Searching solr indexSearching solr index
Spring 2008 20
Types of query:◦ Boolean
◦ [IST441 Giles] [IST441 OR Giles][java AND NOT SUN]
◦ wildcard◦ [nu?ch] [nutc*]
◦ phrase◦ [“JAVA TOMCAT”]
◦ Proximity◦ [“lucene nutch” ~10]
◦ Fuzzy◦ [roam~] matches roams and foam
◦ Field:value◦ …
Scoring Function is specified in schema.xmlScoring Function is specified in schema.xmlSimilarity
score(Q,D) = coord(Q,D) · queryNorm(Q)
· ∑ t in Q ( tf(t in D) · idf(t)2
· t.getBoost() · norm(D) )
Question:◦ What type of IR model does Lucene use?
factors◦ term-based factors
tf(t in D) : term frequency of term t in document d default implementation
idf(t): inverse document frequency of term t in the entire corpus default implementation
Spring 2008 21
1)]1/(ln[ ++docFreqNDocs
requencyraw term f
Default Scoring FunctionDefault Scoring Function
Spring 2008 22
•coord(Q,D) = overlap between Q and D / maximum overlapMaximum overlap is the maximum possible length of overlap between
Q and D
•queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2
If t.getBoost() = 1, q.getBoost() = 1 Then, sum of square weight = ∑ t in Q ( idf(t) )2
thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½
• norm(D) = 1/number of terms½ (This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.)
Example:Example: D1: hello, please say hello to him. D2: say goodbye Q: you say hello
◦ coord(Q, D) = overlap between Q and D / maximum overlap coord(Q, D1) = 2/3, coord(Q, D2) = 1/2,
◦ queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 t.getBoost() = 1, q.getBoost() = 1 sum of square weight = ∑ t in Q ( idf(t) )2 queryNorm(Q) = 1/(0.59452+12) ½ =0.8596
◦ tf(t in d) = frequency½ tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0
◦ idf(t) = ln (N/(nj+1)) + 1 idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0.5945, idf(hello) = ln(2/(1+1)) +1 = 1
◦ norm(D) = 1/number of terms½
norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071
◦ Score(Q, D1) = 2/3*0.8596*(1*0.59452+1.4142*12)*0.4082=0.4135◦ Score(Q, D2) = 1/2*0.8596*(1*0.59452)*0.7071=0.1074
Spring 2008 23
ReferencesReferenceshttp://youseer.sourceforge.net/doc/Tutor
ial.pdfhttp://lucene.apache.org/solr/http://lucene.apache.org/java/2_4_0/sco
ring.html
Spring 2006 24