biosolr - searching the stuff of life - lucene/solr revolution 2015

38
OCTOBER 13-16, 2016 AUSTIN, TX

Upload: charlie-hull

Post on 15-Apr-2017

1.633 views

Category:

Software


3 download

TRANSCRIPT

Page 1: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

OCTOBER 13-16, 2016 • AUSTIN, TX

Page 2: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

Searching the Stuff of Life: BioSolrMatt Pearce & Alan Woodward

Senior Developers, Flax – www.flax.co.uk

Page 3: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

•Building open source search applications since 2001•Independent, honest advice and analysis•Expert design & development, Apache Solr committers•UK Authorized Partner of•Test-driven relevancy and performance tuning•Custom training & mentoring

02

Page 4: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

02

Page 5: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01

•The European Bioinformatics Institute•Part of the European Molecular Biology Laboratory

•Based on the Wellcome Genome Campus in Hinxton, Cambridge, UK.

•Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items.

•BioSolr project involves two teams from EMBL-EBI:•Protein Data Bank in Europe (PDBe)•Samples, Phenotypes and Ontologies (SPOt)

Page 6: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

02The genesis of BioSolr

•Grant Ingersoll visits the Wellcome Campus in July ’13•Around 90 people attend•Show of hands indicates 75% using Lucene/Solr•Sameer Velankar of EMBL-EBI identifies grant funding•Flax and EMBL-EBI apply successfully to the BBSRC

Page 7: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

03BioSolr•One year BBSRC funded project from September 2014

•“to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software”

•Outputs:•Workshops•Papers & presentations•Software (Open Source, of course!)•Documentation

•Inputs: from the PDBe & SPOt teams

Page 8: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01BioSolr

•Tom Winch•Working on site with Sameer Velankar & the PDBe team•Facet.contains, Xjoin, Federated Search

•Matt Pearce•Working on site with Tony Burdett & the SPOt team•Indexing ontologies

Page 9: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01BioSolr & PDBe - Introduction

•Protein Data Bank (PDBe)

•facet.contains (autosuggest)•https://issues.apache.org/jira/browse/SOLR-1387 •In Solr 5.1

•Xjoin (searching external sources)•https://issues.apache.org/jira/browse/SOLR-7341

•Federated search

Page 10: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Xjoin concepts

•The problem – you have data in an external data store which is not suitable for indexing in Solr•The data may be from a live source, for example.

•You need to match data from your search results against data from one or more of these external sources for display or analysis.

Page 11: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Xjoin implementation•Xjoin is implemented as a Solr search component.•There should be one configured instance per external source.

•XJoinResultsFactory interface defines the search behaviour:•Communicates with the external source to carry out the query•Configured in solrconfig.xml, along with presets•Returns results as an XjoinResults object•Results are keyed by a string ID, defined in the configuration

•User is required to provide the implementation of this interface for each external source being used

Page 12: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Xjoin configuration

•Configure the XJoinSearchComponent in solrconfig.xml with details of your XJoinResultsFactory implementations

•Add the search component to the search request handler•Needs to be in both first-components and last-components sections.

Page 13: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Xjoin results handling

•Uses a query parser based on TermsQParserPlugin•Allows the same methods as TermsQParserPlugin•Enables the use of multiple external sources, joined using Boolean operators.

•Results are returned in a separate block•Similar to how highlights are returned.

Page 14: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Federated search - introduction

•Problem: we need to search data sets split across multiple locations (even different countries)•Records may contain different fields.

•Similar to pre-SolrCloud distributed search

Page 15: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Federated search challenges: result counts

•The same document may appear in more than one shard, so they need to be aggregated.•Also applies to facet counts.

One solution:•Shards return all document IDs rather than the number found.•Aggregator builds a set of unique documents from the returned results.•Simple for small result sets but inefficient for large sets.•Estimate the result count, using statistical methods:•If two shards always return similar counts, overlap likely to be high;•If they don’t, overlap will be small so dataset can be treated as independent and number found added to total.

Page 16: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Federated search challenges: merging document sets

•Problem: documents are not unique across data sets.•Default behaviour is to use the first instance of a document, ignoring others.•Datasets may contain different fields – we need all versions.

One solution:•Use a custom MergeStrategy to build the ID list.•Cannot use a Grouper – prevents result grouping in the query.

•Scoring is also a challenge!

Page 17: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Federated search challenges: merging document data

•Problem: document data may not be the same across data sets.

A solution:•Merge documents together into a single, composite document.•Potentially use an aggregation schema describing merge process.•Merge strategy must be capable of merging disparate field types.

Page 18: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01BioSolr & SPOt – Indexing ontologies

Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5

Page 19: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Indexing ontologies – the problem

•You have a collection of documents annotated with ontology references.

•You want to search both the documents and the associated ontology data.•This may include associated nodes – “has location”, “is part of”, etc.

•Faceting the ontology references would be nice!•(especially if the facets can be presented in a tree)

Page 20: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 1

Keep the data separate

documents

Documents

Indexer

Documents

Indexer

ontology

Ontology

Indexer

Page 21: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 1 - steps

•Index the documents with the node annotations but no further detail.

•Index the ontology in its own core.

•Search the documents, then cross-match against the ontology.

•BUT requires multiple calls, doesn’t allow searching both cores at the same time.

Page 22: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2

Add some ontology data to your documents

Documents

Indexer Ontology

documents

Page 23: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2 – step 1

•Index the documents with the node annotations.

•While indexing, look up the node references, with their labels and synonyms.

•Easier to include the ontology references in your search.

•Can boost fields as required.

•Faster to search

Page 24: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2 – step 2

•Expand the ontology data being stored.•Include single (or multi) level parent and child nodes, with their labels.

•Use dynamic fields to store additional relationships.

•Dynamic fields allow searches across specific relationship types (“is part of”, “has location”, etc.)

•BUT requires additional Solr look-ups to be fully dynamic •(using /admin/luke to look through the current schema for dynamic fields).

Page 25: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2 – search screen

•General search box•Options to include child and parent labels (one step removed)

•Dynamically-generated additional relationship search options

Page 26: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2 – search results

Page 27: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 2 – final result

•We have developed an UpdateProcessor to do this as part of the update chain.

•The user defines the ontology location (file, URL) and the field to reference.

•Aims for convention over configuration for remaining properties.•Field names and ontology annotation details all customisable.

•A similar plugin for ElasticSearch has also been developed.

Page 28: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Aside: facet trees

•Additional search component to return facets from ontology references in tree form.

•Extends FacetComponent•Takes initial facets from results, searches hierarchical references to build the tree during facet generation.

•Avoids multiple calls to Solr from client-side to build the tree.

•A second collection can be used to search the ontologies...•… BUT it must be part of the same Solr instance

Page 29: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Aside: facet trees - challenges

•Nodes with multiple parents may appear more than once…•… not yet found a solution for this!

•Default behaviour is to return entire tree.•Not always useful – may need to drill through several layers to get to useful entries.

•Solution: prune the tree!

Page 30: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Aside: facet trees - pruning

•Multiple pruning options available:•Simple pruning: remove layers with no useful information.

•Datapoint pruning: return the highest counts at top-level, remainder in “Other” section.

Page 31: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 3

•Search the ontology, and cross-match with the documents.

•Allow SPARQL queries over the ontology index.•Enables complex searches over ontology relationships.•(SPARQL is a semantic query language)

Page 32: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Approach 3

Page 33: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Adding Apache Jena

•We use Apache Jena to provide TDB-querying with SPARQL.

•Jena uses Solr to search specified text fields – set via configuration files.•Uses its own Triple Store for other fields and relationship queries.•Return the reference URI in the returned fields to cross-match with documents.

•Use a filter query to choose the matched documents.•Not that different from Xjoin!

Page 34: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Generic ontology indexer

•Stand-alone application to index different ontologies•Allows separate configuration for each ontology•Plugins may be used:•At item level, to external data some/all nodes;•At ontology level, to push the ontology into MongoDB, etc.

•Cross-pollination with the EBI's Ontology Lookup Service site (currently in beta)

Page 35: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Conclusions – what have we achieved?

•Searching across multiple external datasets (xjoin)•See SOLR-7341 (and please up-vote!)

•Searching across multiple Solr nodes across different campuses.•Indexing ontologies – both with and without document data.•Enriching documents with ontology data.•(…and facet trees)

Page 36: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Get Involved!

•Check out the github page: https://github.com/flaxsearch/BioSolr

•Vote for Xjoin: https://issues.apache.org/jira/browse/SOLR-7341

Page 37: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Upcoming events

•7th December: SWAT4LS tutorial session in Cambridge, UK

•Will cover ontology indexing and search

•http://www.swat4ls.org/workshops/cambridge2015/

•3rd/4th February: Workshop at the EBI campus, Hinxton, UK

•http://www.ebi.ac.uk/pdbe/about/events/open-source-search-bioinformatics

Page 38: BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

01Thank you for listening!

Matt Pearce

[email protected]

www.flax.co.uk/blog

+44 (0) 8700 118334

Twitter: @flaxsearch