ugm 2007 miklós vargyas*, judit vaskó-szedlár whats new in librarymcs
TRANSCRIPT
UGM 2007
Miklós Vargyas*, Judit Vaskó-Szedlár
What’s new in LibraryMCS
UGM 2007
Talk Overview
• Introduction to LibraryMCS – Concepts, motivation– Main features– GUI
• 2006 Roadmap accomplishment
• New features in detail– Performance– Iterative clustering– Additive clustering
• Current roadmap and wishlist
UGM 2007
Introduction – Concept of MCS
Maximum Common Substructure
Looks simple, yet hard to compute!
UGM 2007
Introduction – Motivations
• MCS based clustering– More intuitive than similarity based– Closer to chemists golden standard
• Initial requirements– Focused set analysis
• screens: 2000 – 10000 structures• lead optimization: 3000 – 5000 structures
– Should be hierarchical (outliers)– Ultimate goal: cluster 5000 compounds in 5 seconds
• Further application areas– Library profiling– Compound acquisition
UGM 2007
Introduction – Main features
• MCS based hierarchical clustering
• Flexible search options
• No theoretical size limitation
• Fast operation
• Filtering by chemical properties
• Cluster statistics
• Hierarchy browser
UGM 2007
GUI – Dendogram view
• Interactive navigation, selection
• Zoom & move
UGM 2007
GUI – Molecule view
UGM 2007
GUI – SAR-table
• Cluster statistics, structure filtering by properties
UGM 2007
GUI – R-table
UGM 2007
2006 Roadmap accomplishment
...
UGM 2007
Preserving rings
UGM 2007
Iterative clustering
• Outliers– Singletons– Large blobby clusters
• Aim – Minimise number of singletons
– Maintain high quality
UGM 2007
Additive clustering
Corporatedatabase
Pre-clustering, stored
new set
registration
Cluster diversity enrichment
UGM 2007
Performance
• Depends on various factors– average structure size– diversity– minimal required MCS size– atom/bond constraints
0
2
4
6
8
10
12
14
16
CombiLib MixedLib Maybridge
Normal
Fast
Fastest
UGM 2007
Performance
• Scales linearly
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
Linear (2007)
UGM 2007
Performance
• Maximum speed achieved:1 000 structures/s
0
2000
4000
6000
8000
10000
12000
14000
100 1000 10000 20000 40000 100000
library size
run
tim
e (
s)
Ward 512
Jarp 512
LibMCS 6
• Memory requirements– scalable
– 50 000 structures occupy <100MB
UGM 2007
In the pipeline
• Multi-stage clustering
• Additive clustering
• Disconnected MCS (Maximum Overlapping Set)
• Enhanced R-group decomposition
• Markush export
• Further clustering criteria
– Ring count
• Performance tuning
– Easier memory control of memory usage
UGM 2007
Current roadmap and wishlist
• Simpler table view
• IJC integration
• Multi-cluster members
• Clustering million compound libraries
• Integrate Chemical Terms
• Stereo care MCS
•
•
UGM 2007
Acknowledgements
• Co-workers– Péter Vadász– Judit Vaskó-Szedlár
• Ideas– Ferenc Csizmadia, Szabolcs Csepregi,
Ákos Papp, György Pirok
• Partners, early adaptors