information retrieval apache solr use case - bioassist retrieval & apache solr use case may 21,...
TRANSCRIPT
Information retrieval
Information Need Information Items
Representation Representation
Query Indexed ItemsRelevance?
Retrieved information
Evaluating/ Relevance feedback
Text vector representation• “Bag of words”
• “Bag of phrases”
• With stemming/normalization
• N-gram
...
• Concept based
• Content bearing word,
stop list
Term Term frequency
BagOfWordsTextN-gram...
22111....
Relevance #2• Vector space model
– dj = (w
1,j,w
2,j,...,w
t,j)
– q = (w1,q
,w2,q
,...,wt,q
)
• Probability model
• Citation analysis model* wikipedia
Term frequency & inverse document frequency
• Some terms are more important than others
• w = TF · IDF
• TFt , d
– the frequency of occurrence of a term t in document d
• IDFt
–
– N is the number of documents in the collection; and nt is the number of
documents where term t occurs
– N=100, nt =25, idf = 0.6
– N=100, nt =1, idf = 2
Evaluation
• Recall
• Precisionrelevant∧retrieved
retrieved
relevant∧retrievedrelevant
precision
recall
Karen Spärck Jones. A statistical interpretation of term specificityand its application in retrieval. 1972.
TPTP∧FP
TPTP∧FN
Solr: a brief history• 2004, CNET
– search capability for reviews, news, price offers, etc
• 2006, join Apache and under Lucene– Lucene is high-performance, full-featured text search engine
library in Java
– 20MB/minute on Pentium M 1.5GHz,
– 1MB memory requirement
– index size 20-30% of text size
• Nov 2009, Solr 1.4 released
• Who are using Solr– WhiteHouse.gov; AOL.com; SourceForge.net; Netflix.com;
Plaxo.com; ...
Solr: features• XML over HTTP. update via POST, query via GET.
• Support tokenizing, stemming, normalization
• Web interface to invoke, monitor, analyze the search engine.
• Support load distribution: server replication.
• Caching, auto-warming
• External file-based configuration of stopword lists, synonym lists, and protected word lists
• Import data from DB or other sources
• ...
Solr: update example• http://localhost:8983/solr/update/?
• Add
• Delete
<add> <doc> <field name="ID">m00123456</field> <field name="product">memory</field> <field name="volume">2G</field> <field name="supplier">Kingston</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>
<delete> <ID>m00123456</ID> <query>supplier:Kingston</query></delete>
Solr: query example• http://localhost:8983/solr/select/?q=cpu+memory&qt=standard
• Default relation between search terms: OR
• Can set # of minimal matches
• Results are ranked – Vector space model, TF·IDF
Article DB & Author DB in Solr• Article profile DB
– 19 million PubMed articles
– Each article: title, abstract, MESH, keyword, chemicals
– list of concepts => article DB
• <PMID><date><list of concepts>
– Query using a PMID
• [C0041221, 0.39],[C0879626, 0.03], ...
• Author profile DB
– 2.14 million unique authors
– Identify all articles of each unique author
• Query the article DB
• [C0041221, 0.39],[C0879626, 0.03], ...
– Each author: list of expertise => author DB
• 39 x C0041221 , 3 x C0879626
Cook an expert-finder query• Identify a set of articles in an interested area
– a list of PMID
• Query the article DB using this list– [C4249671, 1], [C0041234, 0.79], ...
• Construct the query– {C4249671, C0041234, ...}
• Can specify # of minimal matches
Summary on expert-finder• Rank the authors based on the similarity between their publications and
the publications belong to a particular field
Article DB
PubMed
Author DB
Cooked query?
A ranked list1, Barend2, Rob3, Machiel4, Marc...
Interested area