knime in nibr stories from industry
TRANSCRIPT
![Page 1: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/1.jpg)
Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel
5th KNIME Users Group Meeting
Zurich, 2 February 2012
KNIME in NIBR: Stories from Industry
Basel, Switzerland
Basel, Switzerland
![Page 2: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/2.jpg)
KNIME in NIBR
§ Infrastructure
§ Node development • Open-source & in-house • Sponsored
§ Examples
2
![Page 3: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/3.jpg)
Infrastructure
§ Enterprise servers + cluster integration running in Cambridge, Basel
§ Standard releases for Windows, Linux, Mac
§ Nightly builds for users comfortable on the bleedingleading edge
3
![Page 4: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/4.jpg)
Node development : open source
§ Chemistry nodes based on the RDKit • open-source cheminformatics toolkit • useable from C++, Python, Java
• NIBR scientists/developers actively participate • www.rdkit.org
§ Standard cheminformatics tasks + some nice extras
§ Developed both in-house and together with knime.com
4
![Page 5: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/5.jpg)
Node development : in house
§ Connections to internal data sources
§ Wrappers around in-house developed algorithms
§ Connection to our web service framework for cheminformatics services
5
![Page 6: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/6.jpg)
Generic CIx service node
6
![Page 7: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/7.jpg)
Sponsored node development
§ Modifications to naïve Bayes nodes to support fingerprints
§ Fingerprint naïve Bayes supporting unbalanced datasets
§ Database schema browser
§ Improvements to python integration
§ Improvements to database connector, readers
§ Ensemble tree classifier (in progress)
7
![Page 8: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/8.jpg)
Case studies
8
![Page 9: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/9.jpg)
Combining databases
9
§ Question: what kind of activity might I expect to see for a given compound?
§ Do a similarity search in our database of internal compounds
§ Look up assays where those compounds have been tested
![Page 10: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/10.jpg)
§ More browsing of those results: where are those neighbors most active?
p(Activity) > 8
Combining databases
![Page 11: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/11.jpg)
p(Activity) > 8
Combining databases
11
§ More browsing of those results: show me the most active neighbors
![Page 12: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/12.jpg)
Parallel virtual screening example
§ Goal: find some interesting compounds to be screened for a new project
§ 2D similarity searches across two databases: • NIBR powder archive • Catalogs from trusted vendors
§ About 7 million compounds total.
§ Use several different fingerprints
Finton Sirockin (GDC/CADD)
![Page 13: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/13.jpg)
The basic process
13
§ Generate fingerprints for database and queries
§ Calculate similarities with the Erlwood Fingerprint Similarity node
§ Sort, filter, standardize
§ Report
![Page 14: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/14.jpg)
Combining the pieces
14
• Workflow is run for each query
• Fingerprints calculated for each type of search
• 600 – 11 000s • Needs to be calculated only once, even for n queries
![Page 15: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/15.jpg)
Cluster usage reporting
§ Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure
§ Three Phases of processing : • Input from raw SGE files off of the clusters at each site • Steps A-C : data pre-processing, filtering & date-time object conversion
- All logs are gathered into a single file kept in RAM - Use of java nodes to convert unix time to Knime date objects - Bash nodes for awk manipulations which are faster natively in LINUX
• Steps D – I : execute concurrently - Knime Statistics and grouping are heavily used - Step H spawns cluster jobs to gather user usage statistics
§ Present summarized and aggregated data using spotfire
15
Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)
![Page 16: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/16.jpg)
The workflow
16
• Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters
• Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M
![Page 17: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/17.jpg)
The complexity
17
![Page 18: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/18.jpg)
The report: historical data
18
![Page 19: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/19.jpg)
The dashboard
19
Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.
![Page 20: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/20.jpg)
Predicting which target a molecule will hit
§ Goal: build a model to predict which of a set of targets a molecule is most likely to hit
§ Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)
§ Validation data set: active molecules from 50 different ChEMBL assays1
20
1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).
![Page 21: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/21.jpg)
Predicting which target a molecule will hit
21
§ 11561 data points, 50 classes
§ 50 trees, random descriptor selection
![Page 22: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/22.jpg)
About that scaling…
22
![Page 23: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/23.jpg)
Predicting which target a molecule will hit
23
§ 11561 data points, 50 classes
§ 50 trees, random descriptor selection
§ out-of-bag prediction error: 5.8%
§ mean error from cross validation: 4.2%
![Page 24: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/24.jpg)
Predicting which target a molecule will hit
24
§ mistakes tend to be in families
![Page 25: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/25.jpg)
Drilling into the confusion matrix
25
![Page 26: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/26.jpg)
Drilling into the confusion matrix
26
![Page 27: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/27.jpg)
Drilling into the confusion matrix
27
![Page 28: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/28.jpg)
Drilling into the confusion matrix
28
![Page 29: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/29.jpg)
Drilling into the confusion matrix
29
![Page 30: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/30.jpg)
Drilling into the confusion matrix
30
![Page 31: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/31.jpg)
Acknowledgements
§ NIBR • John Davies (CPC) • Richard Lewis (GDC) • Steve Litster (NIBR IT) • Andy Palmer (NIBR IT) • Patrick Warren (NIBR IT) • Case studies
- Finton Sirockin (GDC) - Mike Derby (NIBR IT) - Varun Shivashankar (NIBR IT) - John Davies (CPC)
• Node development - Manuel Schwarze (NIBR IT) - Dillip Kumar Mohanty (NIBR IT) - Sudip Ghosh (NIBR IT)
• Marc Litherland (NIBR IT)
§ knime.com • Michael Berthold • Bernd Wiswedel • Thorsten Meinl • Peter Ohl
§ Simon Richards (Lilly)
31
![Page 32: KNIME in NIBR Stories from Industry](https://reader036.vdocuments.mx/reader036/viewer/2022071613/61571f6976ee6d48c051b862/html5/thumbnails/32.jpg)
T e a c h • D i s c o v e r • T r e a t
the power of collaborative efforts
Join the Teach-Discover-Treat initiative: participate in our
symposium* and compete on one or more challenges!
*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A
Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases
q Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases
q Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)
q Awards: travel awards to cover travel expenses for presenting work at COMP symposium
q Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)
More information and access to data sets coming in March Bookmark www.teach-discover-treat.org