information retrieval apache solr use case - bioassist retrieval & apache solr use case may 21,...

17
Information Retrieval & Apache Solr Use Case May 21, 2010 Leon Mei

Upload: dothien

Post on 16-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Information Retrieval&

Apache Solr Use Case

May 21, 2010 Leon Mei

Outline• Information retrieval & text mining

• Apache Lucene/Solr

• Use case: expert finder

Information retrieval

Information Need Information Items

Representation Representation

Query Indexed ItemsRelevance?

Retrieved information

Evaluating/ Relevance feedback

Text vector representation• “Bag of words”

• “Bag of phrases”

• With stemming/normalization

• N-gram

...

• Concept based

• Content bearing word,

stop list

Term Term frequency

BagOfWordsTextN-gram...

22111....

Relevance #1• Boolean

– e.g. query

“term 1 term 2 term 3”

Relevance #2• Vector space model

– dj = (w

1,j,w

2,j,...,w

t,j)

– q = (w1,q

,w2,q

,...,wt,q

)

• Probability model

• Citation analysis model* wikipedia

Term frequency & inverse document frequency

• Some terms are more important than others

• w = TF · IDF

• TFt , d

– the frequency of occurrence of a term t in document d

• IDFt

– N is the number of documents in the collection; and nt is the number of

documents where term t occurs

– N=100, nt =25, idf = 0.6

– N=100, nt =1, idf = 2

Evaluation

• Recall

• Precisionrelevant∧retrieved

retrieved

relevant∧retrievedrelevant

precision

recall

Karen Spärck Jones. A statistical interpretation of term specificityand its application in retrieval. 1972.

TPTP∧FP

TPTP∧FN

Solr: a brief history• 2004, CNET

– search capability for reviews, news, price offers, etc

• 2006, join Apache and under Lucene– Lucene is high-performance, full-featured text search engine

library in Java

– 20MB/minute on Pentium M 1.5GHz,

– 1MB memory requirement

– index size 20-30% of text size

• Nov 2009, Solr 1.4 released

• Who are using Solr– WhiteHouse.gov; AOL.com; SourceForge.net; Netflix.com;

Plaxo.com; ...

Solr: features• XML over HTTP. update via POST, query via GET.

• Support tokenizing, stemming, normalization

• Web interface to invoke, monitor, analyze the search engine.

• Support load distribution: server replication.

• Caching, auto-warming

• External file-based configuration of stopword lists, synonym lists, and protected word lists

• Import data from DB or other sources

• ...

Solr: update example• http://localhost:8983/solr/update/?

• Add

• Delete

<add> <doc> <field name="ID">m00123456</field> <field name="product">memory</field> <field name="volume">2G</field> <field name="supplier">Kingston</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>

<delete> <ID>m00123456</ID> <query>supplier:Kingston</query></delete>

Solr: query example• http://localhost:8983/solr/select/?q=cpu+memory&qt=standard

• Default relation between search terms: OR

• Can set # of minimal matches

• Results are ranked – Vector space model, TF·IDF

Expert finder

Person DB (PubMed authors)

Who are the experts on malaria?

Article DB & Author DB in Solr• Article profile DB

– 19 million PubMed articles

– Each article: title, abstract, MESH, keyword, chemicals

– list of concepts => article DB

• <PMID><date><list of concepts>

– Query using a PMID

• [C0041221, 0.39],[C0879626, 0.03], ...

• Author profile DB

– 2.14 million unique authors

– Identify all articles of each unique author

• Query the article DB

• [C0041221, 0.39],[C0879626, 0.03], ...

– Each author: list of expertise => author DB

• 39 x C0041221 , 3 x C0879626

Cook an expert-finder query• Identify a set of articles in an interested area

– a list of PMID

• Query the article DB using this list– [C4249671, 1], [C0041234, 0.79], ...

• Construct the query– {C4249671, C0041234, ...}

• Can specify # of minimal matches

Summary on expert-finder• Rank the authors based on the similarity between their publications and

the publications belong to a particular field

Article DB

PubMed

Author DB

Cooked query?

A ranked list1, Barend2, Rob3, Machiel4, Marc...

Interested area

Acknowledgement• Christine Chichester (NBIC)

• Bharat Singh (NBIC)

• Marc Weeber (Knewco, Inc)

• Jun Wang (University College London)