ebi is an outstation of the european molecular biology laboratory. gautier koscielny vectorbase...

18
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

Upload: anis-day

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

EBI is an Outstation of the European Molecular Biology Laboratory.

Gautier KoscielnyVectorBase Meeting

08 Feburary 2012, EBI

VectorBase Text Search Engine

Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

2Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

History of text search

• Up to 2009:• Notre Dame University maintained the main site text search• At the time, there was no text search module available in the

version of Ensembl installed.

• In 2010:• The Ensembl installation was updated to reflect the latest

Ensembl Genomes installation.• Text search technology available • At the time, Ensembl search was based on the EB-EYE indices

2

Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

3Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Challenges in 2010

• How to integrate the new Lucene EB-EYE indices in the main site?

• Multiple sources of indexing VectorBase (expression, community annotations, etc.)

• Relied on good will from external services to update the EB-EYE indices from VectorBase core databases

• Relied on a XML dump of the core database• Time-consuming task• Difficult to index new datatypes or resources

3

Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

4Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Requirements

• Framework to generate indices at any time• Can reflect new community annotations (CAP)• Ontology information• New datasources: literature

• Search to serve Lucene indices from different providers:• Gene annotation, x-refs, comparative genomics data (EBI)• Microarray and gene expression data (Imperial)• CAP (Notre Dame)

• Indexing must be fast, easy to use and maintain• Search can be plugged to different tools:

• Main VectorBase website• Ensembl genome browser

4

Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

5Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Architecture

5

Ensembl FuncGen CAP

Lucene indices

Data sources

Indexfile

VectorBase Search Service Layer

Clients

EBI Imperial Notre Dame

Indexfile

Indexfile

SOAP

Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

6Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

What is being searched?

• Genomic information (Ensembl databases)• Gene models• Variation• Probes• Orthologs

• Expression data (Imperial)• CAP • Ontologies (idomal, miro, anatomy)• Population genomics (Imperial)

6

Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

7Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Generating Ensembl indices at the EBI

• Based on a direct connection to the database(s)• Use a configuration file containing the description of

objects and their types• Database connection (staging-1, …)• Database type (core, funcgen, variation)• Genome (aedes_aegypti)• Homologies

• Each object in the configuration file is represented by a java class

• The configuration loader will automatically create an instance of each type using the class loader.

7

Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

8Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Example of configuration file

8

Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

9Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Procedure (for Ensembl indices)

9

core funcgen variation compara

1. If compara is defined, get all homologies2. For each genome in turn:3. Get all gene, transcript, exons, proteins, xrefs information from core4. Get all reporters from funcgen and their mapping to gene models5. Get all variations and relation to gene models6. Associate all existing homologies to the genes7. Create a Lucene Document for all genes8. The indices are copied to Notre Dame University9. Tomcat instance is restarted

Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

10

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Ensembl object mapping in Java

• Ensembl concepts are mapped to equivalent Java data access objects (DAO)

• All Ensembl concepts are stored in memory and removed when a Lucene Document is created

10

EnsemblFeature

Gene

extendscontains

Transcripts, translations, exons

Homologyextends

Xrefcontains

Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

11

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Creating a Lucene document

• A document is a container for the index• Each document define one or several fields• The framework creates a document per gene• Each field can store its value (or not)• Each field can be indexed (or not)• The text stored can be compressed.

11

Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

12

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Gene Document

• Fields:• Gene id, name, description• Coordinates: seq region name, start, end• Species, feature type (gene), source (biotype), genomic unit• Transcript count, transcript stable ids• Exon count, exon stable ids• Peptide count, peptide stable ids, domains• Core xrefs• Variation xrefs (if available)• Funcgen xrefs (if available)• Compara homologs (If available)

12

Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

13

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

CAP indices

• GFF parser extract gene and transcript models.• Name, description, submitter, chromosome location are

indexed.• Very fast• Could be updated overnight if required.

13

Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

14

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Expression data/Population genomics

• Constructed by Bob McCallum (Imperial)

14

Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

15

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

Ontologies

• Ontology term are indexed.• An OBO parser extract each term in turn.• Accession, name, description are parsed by default• Extra fields are parsed depending on the completeness of

each term.

15

Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

16

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

SOAP interface

• 2 procedures: getNbOfResults, getResults (see wiki)

16

Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

17

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012

To do list

• Front-end: • All domain should be queried to produce an ‘Entrez’ like page.• So, search all by default and display count per domain • Could be very simple result page (see next slide for mock-up)

• Updates:• We could update some of the domain more frequently• CAP is a good candidate.

• Other technologies:• Other technologies can be used • Auto-completion • SOLR

17

Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

18

Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 201218

Result page

Genome (1693)

Expression (3693)

Ontology (70)

Population (30)