usugm 2014 - kevin clark (genentech): searching project team documents with document to database...

Kevin P. Clark, Ph.D.

Chemaxon UGM, Cambridge, MA

September 2014

Searching Project Team

Documents with D2DB

Outline

• Use case: search and mine project team documents

• Text searching using Apache SOLR™

• Structure searching using D2DB

• Conclusion

Use Case: Searching and mining project team documents

• Project teams generate numerous documents

• Project team reviews

• Target Candidate Profile

• Regular medicinal chemistry design session

• Presentations and reports

• HTS and Fragment screening

• In-vivo reports (e.g. safety and pk/pd)

• Computational chemistry presentations and docking ideas

• Structural biology presentations

• Diagnostics and biomarker reports

• Publications

• Most documents are generated by the distributed project team

Google Drive to manage project team documents

• Rationale

• Ease of use

• No workflow functionality required

• Access to our partners and CROs

• Versioning and simultaneous editing of Google native documents

• Much of the administration in the hands of the project teams

• Organization and structure

• Access permissions

• Shortcomings

• No wildcard or substring search

• Users have difficult time finding documents

• Today over 84K project team documents

SOLR provides text searching of project team documents

• Open source enterprise search platform from Apache LuceneTM

• Full-text search

• Faceted search

• Hit highlighting

• Wildcard searches

• Regular expression searches

• Proximity searches

• Fuzzy searches

• Documents from Google drive are copied to file system inside our fire wall every 30 minutes

• Security details

• SOLR servers restrict access by internet protocol (IP address)

• LucidWorks implemented the LDAP integration for authorization

System

Text search input page 6

SOLR text search result page 7

Introducing structure search with D2DB

• SOLR search application allows users to:

• Find documents by full text search

• Facet the results for narrowing the hit list

• Wildcard and regular expression search

• Proximity and fuzzy searches

• With text based search

• Partial corporate identifiers (G*1234)

• Partial corporate identifiers for a project (“Project A” AND G*1234)

• HTS hit follow up for “Project A”

• How did other teams handle time dependent inhibition (TDI)?

• Structure searching examples

• Finding documents by structure without corporate identifier

• Find all documents with a particular substructure

Extracting chemical information from documents with D2DB

• ChemAxon’s Naming Technology

• IUPAC names

• Common names

• Drug trade names

• SMILES

• InChi

• CAS registry numbers

• Embedded structures

• ChemDraw

• SymyxDraw

• MarvinSketch

• Optical structure recognition

• OSRA

• CLiDE

• Imago

Configuring and running D2DB

• Edit the configuration properties file

• Specify the database parameters

• db.type = oracle

• db.host = orcl.gene.com

• db.port = 1521

• db.name = orcl.gene.com

• db.username = scott

• db.password = tiger

• Specify other options such as

• d2s.options = -osra

• d2db.threads = 16

• d2db.structure_table.format = mol

• Run d2db from command line

• ./d2db d2db.conf create

• ./d2db d2db.conf index <path>

Integration of D2DB using ChemAxon’s technology

• Marvin for JS

• Google dropping support for NPAPI in Chrome

• Replacing Chemdraw plug-in with Marvin for JS

• Chemdraw 14 supports copy/paste molfile

• JChem cartridge for structure searching

• Successfully migrated from Accord to JChem cartridge

• Performance improvement

• D2DB

• Naming technology

• Embedded chemistry

• Optical structure recognition

Embedded ChemDraw structure for Oseltamivir (Tamiflu) 12

Exact structure search for Oseltamivir (Tamiflu) 13

Results page for the structure searching 14

Extending D2DB to recognize corporate identifiers

• Many of our documents contain references to Gnumbers, our corporate identifiers

• Working with Daniel Bonniot, D2DB now adds structures for corporate identifiers

• Configuration file changes

Conclusion

• Indexing results

• 79K out of 84K documents have been indexed

• 5K documents failed due a few common exceptions

• Extracted over 150K structures with 30K from Gnumbers

• D2DB conclusion

• Easy to configure and run

• Chemaxon (Daniel Bonniot) very responsive to requests and questions

• Enabled structure searching of project team documents

• Future directions

• Collaborate with ChemAxon to resolve exceptions

• Evaluate CLiDE

• Combine text and structure searching across project team documents

usugm 2014 - kevin clark (genentech): searching project team documents with document to database...

project project

d2db d2db

structure search

text search faceted

d2db solr search application

integration of d2db

project team documents4solr

osra d2db

Software

case exhibits roche genentech

usugm 2014 - zhengwei peng (merck): in-depth analysis of...

genentech sunshine track - global health care · genentech...

usugm 2014 - gerald wyckoff (chemalytics): development of...

rcpp11 genentech

genentech ppt (2)

roche genentech acquisition analysis

genentech final paper

ocrevus start form - genentech...genentech access solutions:...

avastin prescribing information - genentech

genentech drugs in fine art

usugm 2014 - zane barlow & margaret trombley (pearson...

ocrevus start form patient - genentech...genentech access...

2017 genentech respiratory trend report€¦ · 16/05/2016...

usugm 2014 - evolution of the chemaxon product portfolio -...

genentech presentation 10/22

amoreno caso genentech

pi coresight in genentech ptd -...

ion channels at genentech

usugm 2014 - xin zhang (cubist): a chemistry friendly system...