A talk at the Molecular Informatics Open Source Meeting (MIOSS) at the European Bioinformatics Institute (EMBL-EBI) in Hinxton, Cambridge, United Kingdon
3. The CDK after 16 years
•16,521 commits made by 115 contributors
•564,171 lines of code
•mostly written in Java
•well established, mature codebase
•maintained by a large development team
•with stable Y-O-Y commits
•estimated 151 years of effort (COCOMO model)
•first commit in October, 2000
•most recent commit 1 day ago
The Chemistry Development Kit (CDK)
Open Source Cheminformatics in Java
20. Development Model
• Open Source Principles
• Release Early, Release Often
• All the Raymond stuff (Cathedral …
• Persistance
• People contribute what they need
• You need a Doctor Who who cares
• code quality, build systems, etc
22. • Maven describes how a project is built and it’s dependencies.
• Simplifies both building from source or linking a distributed JAR.
• Dependencies are dynamically downloaded and kept in sync.
• “Convention over configuration”
• Many new Java projects choose Maven, CDK is 15+ years old this was
a challenge.
• Splitting a one source tree into 76 interdependent modules.
• Modularisation started by Egon Willighagen modularisation in Ant.
• Test Fail=Build Fail, Required resolving 150+ existing regressions.
• CDK 1.5.10+ available from The Central Repository a geographically
distributed collection of dedicated servers.
Maven Switch Over
tinyurl.com/cdk-mavencentral
23. 1.5.x: Cleaner, More Efficient, More Robust, More Stable
Example: Generate depiction of a molecule.
1.4.x 1.5.x
+ Improved Layout
+ Improved Render
+ Easy highlighting
+ Abbreviations
27. Example: SMARTS match for intramolecular Hydrogen Bonds:
O=[C,N]aa[N,O;!H0] in NCI Aug00 (~250,000 molecules) [1-3]
1.4.x: 16 mins (64 err)
1.5.x: 16 secs (0 err)
+ Lazy algorithm
+ Stereochemistry Match
+ Component Grouping
+ Adaptive (e.g. ring membership only if needed)
+ New Pattern API, hides differences between SMARTS/Substructure/
Isomorphism queries
1.5.x: Cleaner, More Efficient, More Robust, More Stable
[1] Weininger D. Chemistry Cartridge CGI Examples. EMug (1998)
[2] Sayle R. Cheminformatics Toolkits: a personal perspective, RDKit UGM (2012)
[3] May J. All The Small Things. http://efficientbits.blogspot.co.uk/2013/10/
28. Robustness
CDK 1.5.x moves away from default atom type perception/sanitisation.
+ Much faster
+ High fidelity IO: round trip [CH2] though SMILES/InChI/Molfile
+ Exact Kekulization
+ Exact ring perception
+ Portable canonical Kekulé SMILES
+ Multiple aromaticity models, “Horse for courses”
+ Accurate MMFF94 partial charges
Stability
Java APIs can be more fluid than native: aim to keep public API fixed.
Continuous integration and regression testing with Jenkins and Travis.
1.5.x: Cleaner, More Efficient, More Robust, More Stable
29. Stereochemistry
Tetrahedral, CisTrans, Extended Tetrahedral (Allene)
Representation and round tripping between formats
Query Matching
Perspective conversion (Haworth, Chairs, Fischer)
File Formats
Molfile Sgroup support: Repeat Units, Display Shortcuts
CXSMILES
Coming soon: HELM 2.0
Fingerprints
Count fingerprints
Efficient Circular Fingerprint and Model Building
Clark et al. JChemInf. 6:38 (2014)
Coming soon: FPS readers, mmap indexes
Updated and Super Quick Fundamental Algorithms
Ring Finding - May and Steinbeck. JChemInf. 6:3 (2014)
Subgraph Isomorphism
Canonical labelling
Aromaticity
Kekulization
Molecular Hash Codes, Automorphism Group, and much more
Other Features of 1.5.x