representation of markush structures — from molecules
TRANSCRIPT
August 2010, ACS National meeting, Boston
Representation of Markush structures — from molecules towards patents
Szabolcs Csepregi
Solutions for Cheminformatics
August 2010, ACS National meeting, Boston
Contents
• ChemAxon
• What are Markush structures?
• How to get them?
• What can be done with them? – Enumeration – Storage, search
• Challenges in chemical representation
• Under development
August 2010, ACS National meeting, Boston
ChemAxon
• Cheminformatics toolkits and applications
• HQ: Budapest, Hungary
• Founded: 1998
• Main customers: pharma, biotech, publishing
• 3rd party applications and web sites. (e.g. Integrity, Reaxis, PDB ligand search, ELN-s, registration systems, etc)
August 2010, ACS National meeting, Boston
ChemAxon
Main products: – Structure drawing & visualization (Marvin family) – Chemical DB tools (JChem family) – Property predictions (Calculator plugins) – Drug discovery tools (Reactor, JKlustor, etc.)
Development strategy: customer-driven
August 2010, ACS National meeting, Boston
What are Markush structures
and how to get them?
August 2010, ACS National meeting, Boston
Markush structures Generic notation for describing many molecules
(= Markush library) in a compact form.
Main usage: – Combinatorial chemistry – Chemistry-related patents
August 2010, ACS National meeting, Boston
Markush structures
• Current features handled: – R-groups – Atom lists, bond lists – Position variation bond – Link nodes – Repeating units – Homology groups
(aryl, alkyl, etc.)
August 2010, ACS National meeting, Boston
ChemAxon Markush project Goals:
– Extend structural search capabilities to combinatorial Markush structures
– Markush enumeration
Complications: – Practical examples may be very complex, methods using
explicit enumeration may be impossible – Extension of current molecular formats (generic features)
Timeline – Pilot study started in 2005 Q4, – First prototype shown at UGM, 2006 June – Released in JChem 5.0, 2008 – Markush DARC format support 5.3.0 2010
August 2010, ACS National meeting, Boston
How to get Markush structures?
• Drawing – Marvin Sketch
August 2010, ACS National meeting, Boston
How to get Markush structures?
• Patent literature – Markush DARC format (*.vmn)
• Compatible with Thomson Reuters MMS patent Markush database (Test set available.)
August 2010, ACS National meeting, Boston
How to get Markush structures?
Combinatorial chemistry – Reagent clipping 1. Replace reacting group with attachment point
(Reactor tool) 2. Turn fragments to
R-group definitions (Molconvert tool)
3. Add a scaffold (Molconvert tool)
August 2010, ACS National meeting, Boston
How to get Markush structures?
Combinatorial chemistry – R-group decomposition 1. Filter and identify ligands in chemical library 2. Create Markush structure from R-table (R-group decomposition tool)
August 2010, ACS National meeting, Boston
What to do with them?
August 2010, ACS National meeting, Boston
Markush Enumeration
• Markush enumeration plugin – Full enumeration – Selected parts only – Random enumeration – Calculate library size – Scaffold alignment
and coloring – Markush code – Optional example
homology group enumeration
August 2010, ACS National meeting, Boston
Markush storage & search • JChem Base and
Instant JChem
• No enumeration involved
• Can handle complex Markush structures (1040 or more)
• Substructure and Full structure search
• Broad translation of homology groups is supported. (Homology in DB, specific in query.)
August 2010, ACS National meeting, Boston
Markush storage & search
Substructure hit visualization
Query
Result in original Markush
August 2010, ACS National meeting, Boston
Markush storage & search
Substructure hit visualization: „Markush structure reduction”
Query
Result in original Markush
Reduced result
August 2010, ACS National meeting, Boston
Main use cases
• Patent search hits refining / visualization,
• White space analysis,
• Patent busting,
• Markush structure curation,
• In-house storage of small Markush DB,
• etc...
August 2010, ACS National meeting, Boston
MMS evaluation Instant JChem project
August 2010, ACS National meeting, Boston
Challenges in chemical representation (solved)
August 2010, ACS National meeting, Boston
Representation - What we already had
Generic notation in queries:
• Atom lists, bond lists
• R-group queries (Problem: RGFile R-logic and patent R-logic are different! - Solution: Just ignore R-logic.)
• Link nodes
• Some generic atoms (X) – represented as pseudo atoms.
Single or double
August 2010, ACS National meeting, Boston
Challenge 1: Attachment point
• Multiple – ligand order and attachment order Heavily used in Markush DARC (up to 8 attachments!)
• Represented as atom property
Parent group (root)
R-group definitions
Order of ligands for G15 (R15)
Attachment points for definitions
August 2010, ACS National meeting, Boston
Challenge 1: Attachment point
• Embedded R-groups: Grandparent relations may be needed between attachment points:
G3’s attachment point „1” is mapped to
G4’s attachment point „1”
August 2010, ACS National meeting, Boston
Challenge 1: Attachment point
• Temporary representation: attached data – ligand order – attachment point in R-group definition – still an atom property – ligand order sometimes in parent group
(grandparent relation)
Order of ligands for R2
Attachment points for definitions
August 2010, ACS National meeting, Boston
Challenge 1: Attachment point
• Real attachment object with bond (under development)
– eliminates need for grandparent relations table:
Order of ligands for R4
Attachment point for R3
Order of ligands for R2
Attachment points for definitions
August 2010, ACS National meeting, Boston
Challenge 2: Abbreviations
• Superatom S-groups were originally in Marvin (~700 built-in shortcuts) – Expand / Contract – Search code already handled them
in specific structures.
• M. DARC had 21 shortcuts + 31 peptides.
• Attachment point next to abbreviations – Needed to be visible „outside” and handled
correctly „inside”. – New attachment point solves this also:
August 2010, ACS National meeting, Boston
Challenge 3: Homology groups (generics)
• Pseudoatom representation
• Naming (Still looking for the most descriptive „long” names.)
• Extra conditions: general atom property framework (under development)
Markush DARC name „Long name” CHK alkyl CYC carboAlicyclyl ARY carboAryl HEA heteroMonoAryl
August 2010, ACS National meeting, Boston
Challenge 4: Frequency variation
• Link nodes
• Repeating units: modified SRU
• Multipliers: – special SRU, 1 outer bonds. – (Currently visualization only.)
• Moieties: – special SRU, 0 outer bonds – to describe (variable) stoichiometry – (Currently visualization only.)
August 2010, ACS National meeting, Boston
Challenge 5: Position variation bond
• New special S-group type
• Relocatable multicenter atom represents group for bonds
• Also useful to represent multicenter charge and coordination compounds:
August 2010, ACS National meeting, Boston
What (else) keep us busy
August 2010, ACS National meeting, Boston
Under development
• Further improvements in Markush DARC support: – Ring segment groups (XX form a ring) – New, more robust representation for attachment points – Homology properties (low alkyl, fused aryl, C1-3, N2-5, etc)
• Ranking of results • New ways to navigate/zoom Markush structures
• Maximum common substructure search
• Biased enumeration and covering Markush – based on examples in patent.
• Improve search speed to handle larger Markush sets.
• Other Markush formats – Markush InChI standard committee • Overlap analysis of Markush structures
• Conditions for Markush variables
August 2010, ACS National meeting, Boston
Summary
• Markush structure storage, search and enumeration at ChemAxon now patent coverage
• Compatible patent data is available from Thomson Reuters
• Well thought out chemical representation
• Continuous development, improvements in the pipeline
August 2010, ACS National meeting, Boston
Acknowledgements
• Development team: Nóra Máté, Róbert Wágner, Szilárd Dóránt, Tamás Csizmazia, Tim Dudgeon, Erika Bíró, Ali Baharev, Ferenc Csizmadia, et al.
• Tim Miller, Steve Hajkowski, Gez Cross and Linda Clark at Thomson Reuters for useful discussions, help and example Markush DARC files
• Many early adopters and colleagues within the field for suggestions and feedback