the lailaps search engine - core

27
The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases International Symposium on Integrative Bioinformatics 2010 [email protected] IPK Gatersleben, March 22, 2010 M Lange , K Spies, C Colmsee, S Flemming, M Klapperstück, U Scholz Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Upload: others

Post on 25-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The LAILAPS Search Engine - CORE

The LAILAPS Search Engine-

A Feature Model for Relevance Ranking in Life Science Databases

International Symposium on Integrative Bioinformatics 2010

[email protected] Gatersleben, March 22, 2010

M Lange, K Spies, C Colmsee, S Flemming, M Klapperstück, U Scholz

Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 2: The LAILAPS Search Engine - CORE

Outline

• Information Retrieval in Live Science

• The LAILAPS Approach

• Query Engine & Feature Scoring

• Relevance Prediction

• The LAILAPS System

03/22/2010 2Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 3: The LAILAPS Search Engine - CORE

Life Science Data Universe

03/22/2010 3Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 4: The LAILAPS Search Engine - CORE

Life Science Data Universe

03/22/2010 4Matthias Lange @ IPK-Gatersleben, Germany

Data Domain Entry#

Literature (PubMed): 19*106

Molecules (PubChem) 62*106

DNA-sequences (GenBank): 112*106

Data Domain Entry#

Protein sequences(UniProt): 11*106

Protein functions(OLS, PDB, INTACT, PFAM): 1*106

Data Domain Entry#

Pathway (KEGG): 0.1*106

Genes (KEGG): 5*106

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 5: The LAILAPS Search Engine - CORE

Life Science Data Universe

03/22/2010 5Matthias Lange @ IPK-Gatersleben, Germany

words: 1111

lines: 164 words: 1516

lines: 238

chlorophyll synthase

476 entries of chlorophyll synthase

on average 1200 words per entry

on average 200 lines

571,200 words

95,200 lines

1322 pages A4

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 6: The LAILAPS Search Engine - CORE

Life Science Data Universe - Search

03/22/2010 6Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 7: The LAILAPS Search Engine - CORE

The Story of Searching Data

03/22/2010 7

1. 37% of life science scientists spending over 80% of working time in front of a computer2. 47% of all scientist make daily use of search engines

Divoli A, Hearst MA, Wooldridge MA., Evidence for showing gene/protein name suggestions in bioscience literature search interfaces., Pac Symp Biocomput. 2008:568-79

Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 8: The LAILAPS Search Engine - CORE

The LAILAPS Approach

03/22/2010 8Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 9: The LAILAPS Search Engine - CORE

Motivation for LAILAPS

• platform for information retrieval over isolated databases

• content sensitive relevance ranking

• user profiles for relevance estimation

• estimation of relevance factors by tracking user behavior

03/22/2010 9Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 10: The LAILAPS Search Engine - CORE

LAILPAS Search Engine

03/22/2010 10Matthias Lange @ IPK-Gatersleben, Germany

user login

phrase and conjunctive

multi term queries

hit excerpt &

link to data browser

feature

vector

relevance

score

synonym &

database filter

relevance

ratingJavascript data

browsing behaviour

tracker

chart type

feature scores

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 11: The LAILAPS Search Engine - CORE

LAILAPS Feature Model

1. attribute in the record

2. database in which the hit was found

3. frequency of hits per attribute and record

4. co-occurrence of query terms

5. keyword surrounding the hit position

6. organism that is reported in the record

7. sequence length

8. text position of the hit in the record structure

9. synonym that produced the hit

03/22/2010 11Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 12: The LAILAPS Search Engine - CORE

LAILAPS Ranking Workflow

03/22/2010 12Matthias Lange @ IPK-Gatersleben, Germany

query Q = t1 … tn

query resultR = (t,d,{(a,p)})

entries feature scores( 1… 9)

relevance score

relevance ranking

1 … n

find:using a text index system

score:using features F1 … F9

relevance ( 1… 9) :using a feed forward neural network

rank: sorting ofrelevance scores

chlorophyll synthase

('chlorophyll', 'Q9MBA1', {('DE', 83), ('RT',2)}

('synthase', 'Q9MBA1', {('DE', 97)}

scorefreq= 3

score co-ocurence= 1-(distance/maxdist) *

{cor: 1; wrong: 0.3}

. . .

Nfeedforward :(neurons = 20, input =9, output =1, edges = 7 x 4)

Q9MBA1 = N(68, 27,71,33,80,0,31,0,0) = 0.935

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 13: The LAILAPS Search Engine - CORE

Query Engine: Text Index

• decompose text into textual atoms– avoid line breaks in tokens

– do not decompose number groups

– keep chars in token _-/*+.

– ignore chars in token ()[]<>'',:'

– token delimiter whitspace and =;\t

• stop words (ignore frequent but useless tokens)– standard in natural language

• but, his, nor, than, was, by, how, not, that, we, can, however, of, the, were …

– more in life sciences

• product, locus_tag, db_xref, region, source, DB, organism, CDS, xref, interpro, submission, submitted, pfam, binding, table, embl, geneid, genomic, transl, cdd, sequence, pubmed, taxon, IEA, NIH, genbank, data, here …

• synonyms (from databases Brenda, UniProt)– Synonyms Relations: 67521

– Protein names: 76869

09/12/2008 13M. Lange, K. Spies @ IPK-Gatersleben

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 14: The LAILAPS Search Engine - CORE

Relevance Prediction

03/22/2010 14Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 15: The LAILAPS Search Engine - CORE

Relevance Prediction: Confidence Classes

03/22/2010 15Matthias Lange @ IPK-Gatersleben, Germany

confidenceclass

accessioncurated

ranklailapsrank

relevant problem

1Acc1 1 5 no wrong ranking

Acc2 2 2 yes

Acc3 3 3 yes

2Acc4 - 4 -

additional hit (e.g. synonm)

Acc5 4 5 yes

Acc6 5 - no database version

Acc7 6 6 yes

3Acc8 7 7 yes

Acc9 8 - no text decomposition

Acc10 9 8 yes

Acc11 10 1 no wrong ranking

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 16: The LAILAPS Search Engine - CORE

Relevance Prediction: Training Data

03/22/2010 16Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 17: The LAILAPS Search Engine - CORE

Relevance Prediction: Training Data

• 22 protein queries using data retrieval systems

• 1069 manual ranked database records

• 3 quality classes (high, medium, low)

03/22/2010 17Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 18: The LAILAPS Search Engine - CORE

Relevance Prediction: Neural Network Training

• split of training/validation data:80/20

• training epochs:500

• mean square error:0.33

• type:feed forward

• architecture:7-4

03/22/2010 18Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 19: The LAILAPS Search Engine - CORE

User Ranking Profile: Relevance Rating

• Explicit rating

• Implicit rating: tracking browsing behaviorusing JavaScript– clicked result entries

– clicked entries above, below

– activity time

– scroll amount

– mouse movement

– lost / got focus

– text selection

03/22/2010 19Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 20: The LAILAPS Search Engine - CORE

Relevance Prediction: Workflow

03/22/2010 20Matthias Lange @ IPK-Gatersleben, Germany

PredictedRating

Training Data

QueryResults

FeedbackSystem

RelevancePrediction

ManualRating

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 21: The LAILAPS Search Engine - CORE

The LAILAPS System

03/22/2010 21Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 22: The LAILAPS Search Engine - CORE

LAILAPS – Technical Background

03/22/2010 22Matthias Lange @ IPK-Gatersleben, Germany

Search Form

Frontend

Backend/Logic

Network Communication

Color Code – Software Modules:

Result BrowserAdministration

Configuration

ORACLE 11g

Render Engine

DBMS

Text QueryResult

Ranking

Feedback

Tracking

RMI

Tapestry / iSpring

JAVA 1.5, JOONE,

SPRING

HTTP/SQLNet

Color Code - Technology:

ORACLE 11g

HTML

3rd-party Retrieval

System (e.g.

SRS@EBI, in house)

SQL

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 23: The LAILAPS Search Engine - CORE

LAILAPS Customization

• database to be indexed

03/22/2010 23Matthias Lange @ IPK-Gatersleben, Germany

• synonyms

• configuration

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 24: The LAILAPS Search Engine - CORE

LAILAPS Benchmark

03/22/2010 24Matthias Lange @ IPK-Gatersleben, Germany

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 25: The LAILAPS Search Engine - CORE

Benchmark of Neuronal Network Ranking

03/22/2010 25Matthias Lange @ IPK-Gatersleben, Germany

Pre

cis

ion

in %

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 26: The LAILAPS Search Engine - CORE

Conclusion1. LAILAPS as search engine for life science dabases

2. „objective subjectiveness“ – learning of human relevance classes

3. cost-reduction of scientific information retrieval

4. user relevance profiles

5. deterministic quality of database research

03/22/2010 26Matthias Lange @ IPK-Gatersleben, Germany

# database records estimated effort in person day

- 50 0,5

51 – 100 1

101 - 250 > 1

> 250 objective rating hardly not possible

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0

Page 27: The LAILAPS Search Engine - CORE

LAILAPS Project:http://lailaps.ipk-gatersleben.de

Contact:[email protected]

03/22/2010 27Matthias Lange @ IPK-Gatersleben, Germany

Acknowledgement

Jinbo Chen (IPK Gatersleben)

Mandy Weißbach (IPK Gatersleben)

Gregor Haberhauer (BASF)

Michael Leps (BASF Plant Science)

Jens Stein (BASF Plant Science)

Röbbe Wünschiers (FH Mitweidae)

Nat

ure

Pre

cedi

ngs

: doi

:10.

1038

/npr

e.20

10.4

337.

1 : P

oste

d 7

Apr

201

0