Artificial Intelligence In Microbiology by Dr. Prince C P
Challenges and solutions on open-source/open-data fronts
1. Some "challenges" on the
open-source/open-data front
Along with a few thoughts on solutions
Greg Landrum
MIOSS, Hinxton
May 2016
T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
This work is licensed under a
Creative Commons Attribution 4.0
International License.
2. T5 Informatics 2
First things first: what's T5 Informatics?
● Commercial organization built around the open-source RDKit toolkit.
● Very new: founded in March 2016
● Offers maintenance contracts, support, training, for the RDKit as well as
custom development work
● Still very much an experiment
● Some thoughts about the business model here: https://medium.com/@greg.
landrum_t5
5. T5 Informatics 5
The interoperability problem
The simple, one-slide version
# Rotatable
bonds
Exact Mass
AMW
TPSAcalculated
logP
# Heavy Atoms
Donors and acceptors, oh my!
RDKit output
CDK output
Task: generate a set of standard “Lipinski” parameters for Esomeprazole
Good luck if any of those descriptors are used in your QSAR model and you
pick the wrong software.
6. T5 Informatics 6
Looking things up is hard too...
ChEMBL
PubChem ChemSpider
Amusingly, they all have different structure drawings
7. T5 Informatics 7
The interoperability problem
● Processing chemical and biological data is hard and people have different
workflows.
● We will always be using multiple tools to analyze and present results
● There are standard algorithms, but different implementations lead to different
results
● One help would be to have a single implementation that’s useable in many
different places
● If the source is open, it can be archived and packaged to provide
reproducibility and allow new work to build on a standard framework
● This is the approach we’ve taken with the RDKit
Note: there’s another big mess around file formats and data quality, but that’s the
topic for another session (or three)
8. T5 Informatics 8
The RDKit code ecosystem1
C++ :
Core data structures and algorithms
PostgreSQL
Boost.Python SWIG
Python Java C#
Jupyter Pandas KNIME
1
“ecodesystem”? Probably not.
The exact same implementation is available in all endpoints
9. T5 Informatics 9
● Business-friendly BSD license
● Runs on Linux/Mac/Windows
● Commercial support available
● Releases every six months
● Active and engaged community
● Usable from Python (2 or 3), C++, C#, or Java
● Basic functionality highlights:
○ Chemical reactions
○ 2D depiction
○ Substructure searching
○ Canonical SMILES
○ Gasteiger-Marsili charges
○ Molecular standardization
● 2D Functionality highlights:
○ RECAP and BRICS support
○ Multi-molecule MCS
○ Similarity maps
○ Functional group filters
○ Diversity picking
● Supported fingerprint highlights:
○ Morgan/Feature Morgan (ECFP/FCFP-like)
○ RDKit (Daylight-like)
○ Atom-pairs and topological torsions
○ MACCS keys
○ Avalon
○ Fast similarity searching from FPB files
● Descriptor highlights:
○ Hall-Kier and descriptors
○ SLogP, SMR, TPSA
○ MQN
○ “MOE-like” VSA
○ Compositional (number of donors, number of
rings, number of heterocycles, etc.)
● 3D Functionality highlights:
○ 2D->3D conversion/conformational analysis
via distance geometry
○ UFF and MMFF94/MMFF94S
implementations for cleaning up structures
○ Feature maps and feature-map vectors
○ Shape-based similarity
○ RMSD-based molecule-molecule alignment
○ Open3DAlign implementation
○ Integration with PyMOL
○ Torsion Fingerprint Differences
The RDKit
An open-source toolkit for cheminformatics
www.rdkit.org
12. T5 Informatics 12
Some questions
1. Where are our most common file/interchange formats actually defined? How
do we know what they mean?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
13. T5 Informatics 13
Question 3: standardizing molecules
● I want to see this molecule the way it'd be stored in pubchem, or ChEMBL, or
OpenPhacts, or ...
● I want to standardize this molecule so that I can register it, if necessary
● … but I want to standardize it using my rules.
Looks like we're going to be talking about this tomorrow.
14. T5 Informatics 14
Question 1: formats
● Definitions, what's the syntax? What does this term mean?:
○ SMILES:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
■ OpenSMILES: http://www.opensmiles.org/opensmiles.html
○ CTAB/MOL/SDF:
■ ctfile.pdf (somewhat publicly available)
■ Various MDL/Symyx/Accelrys manuals (not publicly available)
○ SMARTS:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
● Testing/Visualization, is this valid? What does this represent?
○ SMILES: used to be the depict.cgi server.
○ CTAB/MOL/SDF: your most trusted chemical editor, maybe two of them
○ SMARTS: used to be depictmatch.cgi
I picked this subset because I think it covers the most common molecular
interchange formats. There are of course many other possibilities
15. T5 Informatics 15
Reasons you might want this:
● Is "C1.C1" a valid SMILES? What does it correspond to?
● Is "C1CCC=1" a valid SMILES? What does it correspond to?
● What does this mean?
Formats and validation
Amusing fact: there's a 12+ page explanation of how
tetrahedral stereochemistry should be handled in
MOL blocks in one of those non-public documents
That's bad enough and I didn't even talk about S-groups, R-groups or
query features in CTAB/MOL...
… or recursive SMARTS
16. T5 Informatics 16
A concrete suggestion
● Formats:
○ OpenSMILES: revive this effort and address outstanding questions (already happening)
○ OpenSMARTS: find a group of interested participants and assemble and publish an open
definition (similar to what happened with OpenSMILES).
■ Requires: organizer, participants, sample data
○ OpenCTAB: find a group of interested participants, agree on the subset that will be included,
and assemble and publish an open definition
■ Requires: organizer, participants, sample data
● Validation/Visualization:
○ A fully open-source (and permissively licensed) web service that returns images (PNG or
SVG) for a provided input in one of the supported formats. This service would ideally have
good error reporting to help identify problems in the input
○ A hosted version of this service useable by the community
○ A fully open-source (and permissively licensed) basic web application for providing input and
seeing the results
○ A hosted version of the web application
As long as we don't extend any of the formats, we don't need to worry (too
much) about adoption or vendor support: it's already there
17. T5 Informatics 17
Question 2: new format(s)?
Some possible reasons for this:
● Efficiently storing large groups of molecules with associated data. Perhaps
data beyond basic types like text and numbers
● Having something well documented and clear
● Having something a bit easier to parse (for both computers and humans)
● Andrew provided others in his talk
Functional:
● Doing something reasonable with partial or "odd" stereochemistry
● Doing something reasonable with non-traditional bond types (like what you
find in organometallics)
18. T5 Informatics 18
Dealing with metals
Just a quick example to show what a train-wreck things currently are
25. T5 Informatics 25
A concrete suggestion
Ok, really just a collection of bullet points, mainly reasons why this is nuts
● The biggest problem is going to be adoption
● Assumption: anything that is used only (or mostly) by toolkits is going to be
easier than anything requiring a sketcher
● Some parts are easier than others:
○ A format for dealing with large numbers of molecules + data is probably not that bad. Adoption
is at the toolkit level
○ A format for molecules is harder… It needs support within both sketchers and readers. Oh,
and reference data that can be used to develop and validate the format.
● Still, maybe HELM and (maybe) MMTF show that this is possible?
● Get a group of interested people together and start a discussion?
26. T5 Informatics 26
Wrapping up
The questions:
1. Where are our most common file/interchange formats actually defined?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
And the RDKit:
● Liberally licensed open-source chemistry toolkit accessible from many places