usugm 2014 - kevin clark (genentech): searching project team documents with document to database...

Post on 11-Jul-2015

127 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kevin P. Clark, Ph.D.

Chemaxon UGM, Cambridge, MA

September 2014

Searching Project Team

Documents with D2DB

Outline

• Use case: search and mine project team documents

• Text searching using Apache SOLR™

• Structure searching using D2DB

• Conclusion

Pa

ge

2

Use Case: Searching and mining project team documents

• Project teams generate numerous documents

• Project team reviews

• Target Candidate Profile

• Regular medicinal chemistry design session

• Presentations and reports

• HTS and Fragment screening

• In-vivo reports (e.g. safety and pk/pd)

• Computational chemistry presentations and docking ideas

• Structural biology presentations

• Diagnostics and biomarker reports

• Publications

• Most documents are generated by the distributed project team

3

Google Drive to manage project team documents

• Rationale

• Ease of use

• No workflow functionality required

• Access to our partners and CROs

• Versioning and simultaneous editing of Google native documents

• Much of the administration in the hands of the project teams

• Organization and structure

• Access permissions

• Shortcomings

• No wildcard or substring search

• Users have difficult time finding documents

• Today over 84K project team documents

4

SOLR provides text searching of project team documents

• Open source enterprise search platform from Apache LuceneTM

• Full-text search

• Faceted search

• Hit highlighting

• Wildcard searches

• Regular expression searches

• Proximity searches

• Fuzzy searches

• Documents from Google drive are copied to file system inside our fire wall every 30 minutes

• Security details

• SOLR servers restrict access by internet protocol (IP address)

• LucidWorks implemented the LDAP integration for authorization

5

File

System

Text search input page 6

SOLR text search result page 7

Introducing structure search with D2DB

• SOLR search application allows users to:

• Find documents by full text search

• Facet the results for narrowing the hit list

• Wildcard and regular expression search

• Proximity and fuzzy searches

• With text based search

• Partial corporate identifiers (G*1234)

• Partial corporate identifiers for a project (“Project A” AND G*1234)

• HTS hit follow up for “Project A”

• How did other teams handle time dependent inhibition (TDI)?

• Structure searching examples

• Finding documents by structure without corporate identifier

• Find all documents with a particular substructure

8

Extracting chemical information from documents with D2DB

• ChemAxon’s Naming Technology

• IUPAC names

• Common names

• Drug trade names

• SMILES

• InChi

• CAS registry numbers

• Embedded structures

• ChemDraw

• SymyxDraw

• MarvinSketch

• Optical structure recognition

• OSRA

• CLiDE

• Imago

9

Configuring and running D2DB

• Edit the configuration properties file

• Specify the database parameters

• db.type = oracle

• db.host = orcl.gene.com

• db.port = 1521

• db.name = orcl.gene.com

• db.username = scott

• db.password = tiger

• Specify other options such as

• d2s.options = -osra

• d2db.threads = 16

• d2db.structure_table.format = mol

• Run d2db from command line

• ./d2db d2db.conf create

• ./d2db d2db.conf index <path>

10

Integration of D2DB using ChemAxon’s technology

• Marvin for JS

• Google dropping support for NPAPI in Chrome

• Replacing Chemdraw plug-in with Marvin for JS

• Chemdraw 14 supports copy/paste molfile

• JChem cartridge for structure searching

• Successfully migrated from Accord to JChem cartridge

• Performance improvement

• D2DB

• Naming technology

• Embedded chemistry

• Optical structure recognition

11

Embedded ChemDraw structure for Oseltamivir (Tamiflu) 12

Exact structure search for Oseltamivir (Tamiflu) 13

Results page for the structure searching 14

Extending D2DB to recognize corporate identifiers

• Many of our documents contain references to Gnumbers, our corporate identifiers

• Working with Daniel Bonniot, D2DB now adds structures for corporate identifiers

• Configuration file changes

15

Conclusion

• Indexing results

• 79K out of 84K documents have been indexed

• 5K documents failed due a few common exceptions

• Extracted over 150K structures with 30K from Gnumbers

• D2DB conclusion

• Easy to configure and run

• Chemaxon (Daniel Bonniot) very responsive to requests and questions

• Enabled structure searching of project team documents

• Future directions

• Collaborate with ChemAxon to resolve exceptions

• Evaluate CLiDE

• Combine text and structure searching across project team documents

16

top related