SlideShare a Scribd company logo
1 of 27
Download to read offline
Some "challenges" on the
open-source/open-data front
Along with a few thoughts on solutions
Greg Landrum
MIOSS, Hinxton
May 2016
T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
This work is licensed under a
Creative Commons Attribution 4.0
International License.
T5 Informatics 2
First things first: what's T5 Informatics?
● Commercial organization built around the open-source RDKit toolkit.
● Very new: founded in March 2016
● Offers maintenance contracts, support, training, for the RDKit as well as
custom development work
● Still very much an experiment
● Some thoughts about the business model here: https://medium.com/@greg.
landrum_t5
T5 Informatics 3
Background
T5 Informatics 4
Flashback to earlier this year
T5 Informatics 5
The interoperability problem
The simple, one-slide version
# Rotatable
bonds
Exact Mass
AMW
TPSAcalculated
logP
# Heavy Atoms
Donors and acceptors, oh my!
RDKit output
CDK output
Task: generate a set of standard “Lipinski” parameters for Esomeprazole
Good luck if any of those descriptors are used in your QSAR model and you
pick the wrong software.
T5 Informatics 6
Looking things up is hard too...
ChEMBL
PubChem ChemSpider
Amusingly, they all have different structure drawings
T5 Informatics 7
The interoperability problem
● Processing chemical and biological data is hard and people have different
workflows.
● We will always be using multiple tools to analyze and present results
● There are standard algorithms, but different implementations lead to different
results
● One help would be to have a single implementation that’s useable in many
different places
● If the source is open, it can be archived and packaged to provide
reproducibility and allow new work to build on a standard framework
● This is the approach we’ve taken with the RDKit
Note: there’s another big mess around file formats and data quality, but that’s the
topic for another session (or three)
T5 Informatics 8
The RDKit code ecosystem1
C++ :
Core data structures and algorithms
PostgreSQL
Boost.Python SWIG
Python Java C#
Jupyter Pandas KNIME
1
“ecodesystem”? Probably not.
The exact same implementation is available in all endpoints
T5 Informatics 9
● Business-friendly BSD license
● Runs on Linux/Mac/Windows
● Commercial support available
● Releases every six months
● Active and engaged community
● Usable from Python (2 or 3), C++, C#, or Java
● Basic functionality highlights:
○ Chemical reactions
○ 2D depiction
○ Substructure searching
○ Canonical SMILES
○ Gasteiger-Marsili charges
○ Molecular standardization
● 2D Functionality highlights:
○ RECAP and BRICS support
○ Multi-molecule MCS
○ Similarity maps
○ Functional group filters
○ Diversity picking
● Supported fingerprint highlights:
○ Morgan/Feature Morgan (ECFP/FCFP-like)
○ RDKit (Daylight-like)
○ Atom-pairs and topological torsions
○ MACCS keys
○ Avalon
○ Fast similarity searching from FPB files
● Descriptor highlights:
○ Hall-Kier and descriptors
○ SLogP, SMR, TPSA
○ MQN
○ “MOE-like” VSA
○ Compositional (number of donors, number of
rings, number of heterocycles, etc.)
● 3D Functionality highlights:
○ 2D->3D conversion/conformational analysis
via distance geometry
○ UFF and MMFF94/MMFF94S
implementations for cleaning up structures
○ Feature maps and feature-map vectors
○ Shape-based similarity
○ RMSD-based molecule-molecule alignment
○ Open3DAlign implementation
○ Integration with PyMOL
○ Torsion Fingerprint Differences
The RDKit
An open-source toolkit for cheminformatics
www.rdkit.org
T5 Informatics 10
Let's go back a few slides
T5 Informatics 11
End of the flashback
T5 Informatics 12
Some questions
1. Where are our most common file/interchange formats actually defined? How
do we know what they mean?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
T5 Informatics 13
Question 3: standardizing molecules
● I want to see this molecule the way it'd be stored in pubchem, or ChEMBL, or
OpenPhacts, or ...
● I want to standardize this molecule so that I can register it, if necessary
● … but I want to standardize it using my rules.
Looks like we're going to be talking about this tomorrow.
T5 Informatics 14
Question 1: formats
● Definitions, what's the syntax? What does this term mean?:
○ SMILES:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
■ OpenSMILES: http://www.opensmiles.org/opensmiles.html
○ CTAB/MOL/SDF:
■ ctfile.pdf (somewhat publicly available)
■ Various MDL/Symyx/Accelrys manuals (not publicly available)
○ SMARTS:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
● Testing/Visualization, is this valid? What does this represent?
○ SMILES: used to be the depict.cgi server.
○ CTAB/MOL/SDF: your most trusted chemical editor, maybe two of them
○ SMARTS: used to be depictmatch.cgi
I picked this subset because I think it covers the most common molecular
interchange formats. There are of course many other possibilities
T5 Informatics 15
Reasons you might want this:
● Is "C1.C1" a valid SMILES? What does it correspond to?
● Is "C1CCC=1" a valid SMILES? What does it correspond to?
● What does this mean?
Formats and validation
Amusing fact: there's a 12+ page explanation of how
tetrahedral stereochemistry should be handled in
MOL blocks in one of those non-public documents
That's bad enough and I didn't even talk about S-groups, R-groups or
query features in CTAB/MOL...
… or recursive SMARTS
T5 Informatics 16
A concrete suggestion
● Formats:
○ OpenSMILES: revive this effort and address outstanding questions (already happening)
○ OpenSMARTS: find a group of interested participants and assemble and publish an open
definition (similar to what happened with OpenSMILES).
■ Requires: organizer, participants, sample data
○ OpenCTAB: find a group of interested participants, agree on the subset that will be included,
and assemble and publish an open definition
■ Requires: organizer, participants, sample data
● Validation/Visualization:
○ A fully open-source (and permissively licensed) web service that returns images (PNG or
SVG) for a provided input in one of the supported formats. This service would ideally have
good error reporting to help identify problems in the input
○ A hosted version of this service useable by the community
○ A fully open-source (and permissively licensed) basic web application for providing input and
seeing the results
○ A hosted version of the web application
As long as we don't extend any of the formats, we don't need to worry (too
much) about adoption or vendor support: it's already there
T5 Informatics 17
Question 2: new format(s)?
Some possible reasons for this:
● Efficiently storing large groups of molecules with associated data. Perhaps
data beyond basic types like text and numbers
● Having something well documented and clear
● Having something a bit easier to parse (for both computers and humans)
● Andrew provided others in his talk
Functional:
● Doing something reasonable with partial or "odd" stereochemistry
● Doing something reasonable with non-traditional bond types (like what you
find in organometallics)
T5 Informatics 18
Dealing with metals
Just a quick example to show what a train-wreck things currently are
T5 Informatics 19
Dealing with metals: cisplatin
T5 Informatics 20
Dealing with metals: cisplatin
T5 Informatics 21
Dealing with metals: cisplatin
T5 Informatics 22
Dealing with metals: cisplatin
T5 Informatics 23
Dealing with metals: hemin
Representation from DrugBank
Representation from PubChem
T5 Informatics 24
Dealing with metals: hemin
T5 Informatics 25
A concrete suggestion
Ok, really just a collection of bullet points, mainly reasons why this is nuts
● The biggest problem is going to be adoption
● Assumption: anything that is used only (or mostly) by toolkits is going to be
easier than anything requiring a sketcher
● Some parts are easier than others:
○ A format for dealing with large numbers of molecules + data is probably not that bad. Adoption
is at the toolkit level
○ A format for molecules is harder… It needs support within both sketchers and readers. Oh,
and reference data that can be used to develop and validate the format.
● Still, maybe HELM and (maybe) MMTF show that this is possible?
● Get a group of interested people together and start a discussion?
T5 Informatics 26
Wrapping up
The questions:
1. Where are our most common file/interchange formats actually defined?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
And the RDKit:
● Liberally licensed open-source chemistry toolkit accessible from many places
T5 Informatics 27
Thanks!
greg.landrum@t5informatics.com
Interested? Want More?
www.rdkit.org
5th User Group meeting 26-28 October in Basel
@RDKit_org
@dr_greg_landrum

More Related Content

What's hot

10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with PythonRalf Gommers
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphTigerGraph
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018TigerGraph
 
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...TigerGraph
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
 
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...MLconf
 
On Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScriptOn Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScriptMatthias Keil
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonInsuk (Chris) Cho
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Justin Basilico
 
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...Han Xiao
 

What's hot (20)

10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
GraphQL & Ratpack
GraphQL & RatpackGraphQL & Ratpack
GraphQL & Ratpack
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
 
On Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScriptOn Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScript
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
 
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
 
PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)
 

Viewers also liked

Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2richiandres
 
יהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגונייהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגונייהודה הופמן
 
Elaboración de material didáctico maría vede
Elaboración de material didáctico maría vedeElaboración de material didáctico maría vede
Elaboración de material didáctico maría vedemariavede
 
CV Bolaños 2016
CV Bolaños 2016CV Bolaños 2016
CV Bolaños 2016Bol nene
 
Public Opinion Landscape - Election 2016
Public Opinion Landscape  - Election 2016 Public Opinion Landscape  - Election 2016
Public Opinion Landscape - Election 2016 GloverParkGroup
 
기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용gojipcap
 
Photos from the Microsoft Challenge
Photos from the Microsoft ChallengePhotos from the Microsoft Challenge
Photos from the Microsoft ChallengeCapitaSymonds
 
Role of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA CollegeRole of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA CollegeAmit Chandra
 
Very Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile PlatformsVery Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile PlatformsBranded Ltd
 
Synthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrateSynthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrateDiponegoro University
 
The tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqiThe tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqiMuhammad Tariq
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference KeynoteKingsley Uyi Idehen
 

Viewers also liked (15)

Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
 
יהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגונייהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגוני
 
Elaboración de material didáctico maría vede
Elaboración de material didáctico maría vedeElaboración de material didáctico maría vede
Elaboración de material didáctico maría vede
 
CV Bolaños 2016
CV Bolaños 2016CV Bolaños 2016
CV Bolaños 2016
 
111011 jlpt n2_presen
111011 jlpt n2_presen111011 jlpt n2_presen
111011 jlpt n2_presen
 
Public Opinion Landscape - Election 2016
Public Opinion Landscape  - Election 2016 Public Opinion Landscape  - Election 2016
Public Opinion Landscape - Election 2016
 
기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용
 
Photos from the Microsoft Challenge
Photos from the Microsoft ChallengePhotos from the Microsoft Challenge
Photos from the Microsoft Challenge
 
Workshop Usability
Workshop UsabilityWorkshop Usability
Workshop Usability
 
Role of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA CollegeRole of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA College
 
Fotos 1°
Fotos 1°Fotos 1°
Fotos 1°
 
Very Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile PlatformsVery Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile Platforms
 
Synthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrateSynthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrate
 
The tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqiThe tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqi
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 

Similar to Challenges and solutions on open-source/open-data fronts

ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsGreg Landrum
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Neo4j
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learningTheodoros Vasiloudis
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
GenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAIGenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAIShakeelAhmed286165
 
Model Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisModel Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisVivek Raja P S
 
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~Toshihiko Yamasaki
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering PrimerGeorg Buske
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 

Similar to Challenges and solutions on open-source/open-data fronts (20)

ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
GenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAIGenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAI
 
Model Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisModel Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model Analysis
 
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 

More from Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registrationGreg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Greg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningGreg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataGreg Landrum
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knimeGreg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesGreg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 

More from Greg Landrum (15)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Recently uploaded

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 

Recently uploaded (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 

Challenges and solutions on open-source/open-data fronts

  • 1. Some "challenges" on the open-source/open-data front Along with a few thoughts on solutions Greg Landrum MIOSS, Hinxton May 2016 T5 Informatics GmbH greg.landrum@t5informatics.com @dr_greg_landrum This work is licensed under a Creative Commons Attribution 4.0 International License.
  • 2. T5 Informatics 2 First things first: what's T5 Informatics? ● Commercial organization built around the open-source RDKit toolkit. ● Very new: founded in March 2016 ● Offers maintenance contracts, support, training, for the RDKit as well as custom development work ● Still very much an experiment ● Some thoughts about the business model here: https://medium.com/@greg. landrum_t5
  • 4. T5 Informatics 4 Flashback to earlier this year
  • 5. T5 Informatics 5 The interoperability problem The simple, one-slide version # Rotatable bonds Exact Mass AMW TPSAcalculated logP # Heavy Atoms Donors and acceptors, oh my! RDKit output CDK output Task: generate a set of standard “Lipinski” parameters for Esomeprazole Good luck if any of those descriptors are used in your QSAR model and you pick the wrong software.
  • 6. T5 Informatics 6 Looking things up is hard too... ChEMBL PubChem ChemSpider Amusingly, they all have different structure drawings
  • 7. T5 Informatics 7 The interoperability problem ● Processing chemical and biological data is hard and people have different workflows. ● We will always be using multiple tools to analyze and present results ● There are standard algorithms, but different implementations lead to different results ● One help would be to have a single implementation that’s useable in many different places ● If the source is open, it can be archived and packaged to provide reproducibility and allow new work to build on a standard framework ● This is the approach we’ve taken with the RDKit Note: there’s another big mess around file formats and data quality, but that’s the topic for another session (or three)
  • 8. T5 Informatics 8 The RDKit code ecosystem1 C++ : Core data structures and algorithms PostgreSQL Boost.Python SWIG Python Java C# Jupyter Pandas KNIME 1 “ecodesystem”? Probably not. The exact same implementation is available in all endpoints
  • 9. T5 Informatics 9 ● Business-friendly BSD license ● Runs on Linux/Mac/Windows ● Commercial support available ● Releases every six months ● Active and engaged community ● Usable from Python (2 or 3), C++, C#, or Java ● Basic functionality highlights: ○ Chemical reactions ○ 2D depiction ○ Substructure searching ○ Canonical SMILES ○ Gasteiger-Marsili charges ○ Molecular standardization ● 2D Functionality highlights: ○ RECAP and BRICS support ○ Multi-molecule MCS ○ Similarity maps ○ Functional group filters ○ Diversity picking ● Supported fingerprint highlights: ○ Morgan/Feature Morgan (ECFP/FCFP-like) ○ RDKit (Daylight-like) ○ Atom-pairs and topological torsions ○ MACCS keys ○ Avalon ○ Fast similarity searching from FPB files ● Descriptor highlights: ○ Hall-Kier and descriptors ○ SLogP, SMR, TPSA ○ MQN ○ “MOE-like” VSA ○ Compositional (number of donors, number of rings, number of heterocycles, etc.) ● 3D Functionality highlights: ○ 2D->3D conversion/conformational analysis via distance geometry ○ UFF and MMFF94/MMFF94S implementations for cleaning up structures ○ Feature maps and feature-map vectors ○ Shape-based similarity ○ RMSD-based molecule-molecule alignment ○ Open3DAlign implementation ○ Integration with PyMOL ○ Torsion Fingerprint Differences The RDKit An open-source toolkit for cheminformatics www.rdkit.org
  • 10. T5 Informatics 10 Let's go back a few slides
  • 11. T5 Informatics 11 End of the flashback
  • 12. T5 Informatics 12 Some questions 1. Where are our most common file/interchange formats actually defined? How do we know what they mean? 2. Do we need new interchange format(s)? 3. How should we standardize molecules?
  • 13. T5 Informatics 13 Question 3: standardizing molecules ● I want to see this molecule the way it'd be stored in pubchem, or ChEMBL, or OpenPhacts, or ... ● I want to standardize this molecule so that I can register it, if necessary ● … but I want to standardize it using my rules. Looks like we're going to be talking about this tomorrow.
  • 14. T5 Informatics 14 Question 1: formats ● Definitions, what's the syntax? What does this term mean?: ○ SMILES: ■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html ■ OpenSMILES: http://www.opensmiles.org/opensmiles.html ○ CTAB/MOL/SDF: ■ ctfile.pdf (somewhat publicly available) ■ Various MDL/Symyx/Accelrys manuals (not publicly available) ○ SMARTS: ■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html ● Testing/Visualization, is this valid? What does this represent? ○ SMILES: used to be the depict.cgi server. ○ CTAB/MOL/SDF: your most trusted chemical editor, maybe two of them ○ SMARTS: used to be depictmatch.cgi I picked this subset because I think it covers the most common molecular interchange formats. There are of course many other possibilities
  • 15. T5 Informatics 15 Reasons you might want this: ● Is "C1.C1" a valid SMILES? What does it correspond to? ● Is "C1CCC=1" a valid SMILES? What does it correspond to? ● What does this mean? Formats and validation Amusing fact: there's a 12+ page explanation of how tetrahedral stereochemistry should be handled in MOL blocks in one of those non-public documents That's bad enough and I didn't even talk about S-groups, R-groups or query features in CTAB/MOL... … or recursive SMARTS
  • 16. T5 Informatics 16 A concrete suggestion ● Formats: ○ OpenSMILES: revive this effort and address outstanding questions (already happening) ○ OpenSMARTS: find a group of interested participants and assemble and publish an open definition (similar to what happened with OpenSMILES). ■ Requires: organizer, participants, sample data ○ OpenCTAB: find a group of interested participants, agree on the subset that will be included, and assemble and publish an open definition ■ Requires: organizer, participants, sample data ● Validation/Visualization: ○ A fully open-source (and permissively licensed) web service that returns images (PNG or SVG) for a provided input in one of the supported formats. This service would ideally have good error reporting to help identify problems in the input ○ A hosted version of this service useable by the community ○ A fully open-source (and permissively licensed) basic web application for providing input and seeing the results ○ A hosted version of the web application As long as we don't extend any of the formats, we don't need to worry (too much) about adoption or vendor support: it's already there
  • 17. T5 Informatics 17 Question 2: new format(s)? Some possible reasons for this: ● Efficiently storing large groups of molecules with associated data. Perhaps data beyond basic types like text and numbers ● Having something well documented and clear ● Having something a bit easier to parse (for both computers and humans) ● Andrew provided others in his talk Functional: ● Doing something reasonable with partial or "odd" stereochemistry ● Doing something reasonable with non-traditional bond types (like what you find in organometallics)
  • 18. T5 Informatics 18 Dealing with metals Just a quick example to show what a train-wreck things currently are
  • 19. T5 Informatics 19 Dealing with metals: cisplatin
  • 20. T5 Informatics 20 Dealing with metals: cisplatin
  • 21. T5 Informatics 21 Dealing with metals: cisplatin
  • 22. T5 Informatics 22 Dealing with metals: cisplatin
  • 23. T5 Informatics 23 Dealing with metals: hemin Representation from DrugBank Representation from PubChem
  • 24. T5 Informatics 24 Dealing with metals: hemin
  • 25. T5 Informatics 25 A concrete suggestion Ok, really just a collection of bullet points, mainly reasons why this is nuts ● The biggest problem is going to be adoption ● Assumption: anything that is used only (or mostly) by toolkits is going to be easier than anything requiring a sketcher ● Some parts are easier than others: ○ A format for dealing with large numbers of molecules + data is probably not that bad. Adoption is at the toolkit level ○ A format for molecules is harder… It needs support within both sketchers and readers. Oh, and reference data that can be used to develop and validate the format. ● Still, maybe HELM and (maybe) MMTF show that this is possible? ● Get a group of interested people together and start a discussion?
  • 26. T5 Informatics 26 Wrapping up The questions: 1. Where are our most common file/interchange formats actually defined? 2. Do we need new interchange format(s)? 3. How should we standardize molecules? And the RDKit: ● Liberally licensed open-source chemistry toolkit accessible from many places
  • 27. T5 Informatics 27 Thanks! greg.landrum@t5informatics.com Interested? Want More? www.rdkit.org 5th User Group meeting 26-28 October in Basel @RDKit_org @dr_greg_landrum