SlideShare a Scribd company logo
1 of 104
Hosting public domain chemicals
data online for the community – the
challenges of handling materials
Antony Williams
Opportunities in Materials Informatics, University of Wisconsin-Madison
February 9th
, 2015
0000-0002-2668-4821
About Me…
• I am NOT a materials chemist
• I am an NMR spectroscopist by training
• Worked on a LIMS while at Kodak
• 10 years in commercial cheminformatics
• Built the ChemSpider database as a hobby
• Worked on validating compounds on Wikipedia
• Manage cheminformatics team for RSC
• Believer in the value of social networking and
Open Data for science
• Dane Morgan asked me to tell jokes…
I would tell a chemistry joke…
But all of the good ones…
An ambitious idea….
• Let’s map together all online chemistry data
and build systems to integrate it
• Heck, let’s integrate chemistry and biology
data and add in disease data too if we can
• Let’s extract property data and model it and
see if we can extract new relationships –
quantitative and qualitative
• Let’s make it all available on the web…for free
What about this….
• We’re going to map the world
• We’re going to take photos of as many places
as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, give it away…
Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
• Structure searching DOZENS of websites, each
with different information or…
Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
• Structure searching DOZENS of websites, each
with different information or…
• Search ONE website integrating the others!
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
• Note…NOT all websites connected
ChemSpider
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
RSC Books
Google Books
Vendors and data sources
APIs
APIs
Organic Chemistry is hard…
…it has alkynes of trouble
Flavors of Chemistry
Molfiles
10 9 0 0 1 0 0 0 0 0 1 V2000
31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3 1 2 0 0 0 0
4 1 1 0 0 0 0
9 1 1 0 0 0 0
7 2 1 0 0 0 0
5 2 2 0 0 0 0
8 2 1 0 0 0 0
6 4 1 0 0 0 0
4 10 1 6 0 0 0
7 6 1 0 0 0 0
M END
Molfiles
• Molfiles are the primary exchange format
between structure drawing packages
• Can be different between different drawing
packages
• Most commonly carry X,Y coordinates for layout
• Can support polymers, organometallics, etc.
• Can carry 3D coordinates
SMILES
• SMILES is a common format
• Can support polymers,
organometallics, etc.
• Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
• Generally different between
drawing packages
Stereo
Tautomeric forms
Vendor-dependent SMILES
ACD/Labs
CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2
=O
OpenEye
CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CC
C[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBL
CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=
Chemists are good…
The InChI Identifier
InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages. No
variability as with SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases, blogs,
Wikipedia) – good for searching the internet
Multiple Layers
Tautomers
Stereo
InChIStrings Hash to InChIKeys
Structure search the web
Exact Search
Skeleton Search
Data Quality/Standardization
• MANY structures meant to be something
online are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations
is laborious work – and it’s shocking to review
data…
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Data quality is a known issue
Data quality is a known issue
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
Only 34 out of 149 structures were correct!
Patent data in public
databases
Patent data in public
databases
You just can’t trust atoms!
You just can’t trust atoms!
They make up everything…
ALL variants of Yohimbine!!!
What’s Methane? OLD PUBCHEM
What ELSE is Methane???
NEW PUBCHEM
Depiction vs Accurate
Representation
Depiction vs Accurate
Representation
What is the Structure of Vitamin K1?
Standardize
• Use the SRS as a guidance document for
standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Can we MAKE Quality Data?
• We are building systems for everyone to
validate and standardize their data
DICTIONARIES are powerful
• Search all forms of structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Strings
• InChIKeys
• Database IDs
• Registry Number
Many Names, One Structure
But big and often noisy
Text-Mining and Markup…
Text-Mining and Markup…
With links out to platforms
Dictionaries are invaluable
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
Name to Structure Conversion
Name to Structure Conversion
ChemSpider “Annotations”
• Users can add
• Descriptions, Syntheses and Commentaries
• Links to PubMed articles
• Links to articles via DOIs
• Add spectral data
• Add Crystallographic Information Files
• Add photos
• Add MP3 files
• Add Videos
Spectral Data
• Spectral data to be deposited in standard
formats – JCAMP or images
• All spectra available at:
http://www.chemspider.com/spectra.aspx
• Data are deposited on a regular basis
• Students
• Chemical vendors
• Growing collection now
Student Submissions
Data on ChemSpider
Spectral Data EXTRACTION
ORIGINAL
EXTRACTED
It’s exactly the WRONG WAY!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive,
preferably with OPEN DATA
An Adventure into the World of
Small but significant contribution..
ChemSpider SyntheticPages
Micropublishing with Peer Review
(a chemical synthesis blog?)
Multi-Step Synthesis
Interactive Data
Chemistry data is of value?
• Reference databases generate hundreds of
millions of dollars/euros per year
• So much data generated that could go public
• Maybe 5% of all data generated is published
• There is no “Journal of Failed Experiments”
• Funding agencies start to demand Open Data
• Scientists want funding but also recognition
A shift to Openness
How will I get recognized?
• Who in the room has an ORCID?
Deposition of Research Data
• If we manage compounds, syntheses and
analytical data…
• If we have security and provenance of data…
• If we deliver user interfaces to satisfy the
various use cases…
• Then we have delivered electronic lab
notebooks for chemistry laboratories. ELNs
are research data repositories
Recognition: need to have Impact
Quantitating scientists?
National Information Standards
Organization and “Altmetrics”
http://www.niso.org/apps/group_public/download.php/13295/niso_altmetrics_white_paper_draft_v4.pdf
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, tabular data
• Algorithms for data validation and standardization
• Flexible indexing and search technologies
• A platform for modeling data and hosting existing
models and predictive algorithms
Compounds
Reactions
Analytical data
Crystallography data
Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chemical compounds
• Check for balanced reactions
• Checks spectral data
• EXAMPLE Future work
• Properties – compare experimental to pred.
• Automated structure verification - NMR
So we know about ORGANICS
• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and cheminformatics companies, have
solved MANY, but not all of the issues
regarding organic chemistry management
• The majority of our approaches do not map to
materials
• No standard ways to represent compounds
• No InChI for materials
Questions to consider…
• Organics are hard enough!
• What are your best dictionaries of materials?
• We have chemical ontologies. Status for materials?
• Is open annotation of your databases possible?
• What standards do you have for materials data
exchange?
Polymorphism is common
Known Challenges
• Many materials are non-stoichiometric
• How to represent composite materials (e.g.
supported catalysts)?
• Methods to distinguish novelty in materials
(equivalent to diversity in organic structures)?
• Many more I will learn at this workshop..?
Collaboration is key
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

What's hot

Seven questions about ResearchGate
Seven questions about ResearchGateSeven questions about ResearchGate
Seven questions about ResearchGateEllen Fest
 
CrossRef Annual Meeting 2012 Global Panel YAN Shuai
CrossRef Annual Meeting 2012 Global Panel YAN ShuaiCrossRef Annual Meeting 2012 Global Panel YAN Shuai
CrossRef Annual Meeting 2012 Global Panel YAN ShuaiCrossref
 
Leveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsLeveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsPaul Albert
 
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?Keita Bando
 
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...GigaScience, BGI Hong Kong
 
Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsManuel Corpas
 
Social media cafe ResearchGate
Social media cafe ResearchGateSocial media cafe ResearchGate
Social media cafe ResearchGateHugo Besemer
 
2015 12 ebi_ganley_final
2015 12 ebi_ganley_final2015 12 ebi_ganley_final
2015 12 ebi_ganley_finalEmma Ganley
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
 

What's hot (20)

Contributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of ChemistryContributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of Chemistry
 
RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 
The Possibilities and Pitfalls of Internet-Based Chemical Data
The Possibilities and Pitfalls of Internet-Based Chemical Data The Possibilities and Pitfalls of Internet-Based Chemical Data
The Possibilities and Pitfalls of Internet-Based Chemical Data
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Seven questions about ResearchGate
Seven questions about ResearchGateSeven questions about ResearchGate
Seven questions about ResearchGate
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
CrossRef Annual Meeting 2012 Global Panel YAN Shuai
CrossRef Annual Meeting 2012 Global Panel YAN ShuaiCrossRef Annual Meeting 2012 Global Panel YAN Shuai
CrossRef Annual Meeting 2012 Global Panel YAN Shuai
 
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
 
Leveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsLeveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reports
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
 
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
 
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
 
Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics Datasets
 
Social media cafe ResearchGate
Social media cafe ResearchGateSocial media cafe ResearchGate
Social media cafe ResearchGate
 
Predatory publishing
Predatory publishingPredatory publishing
Predatory publishing
 
2015 12 ebi_ganley_final
2015 12 ebi_ganley_final2015 12 ebi_ganley_final
2015 12 ebi_ganley_final
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
 

Similar to Hosting public domain chemicals data online for the community – the challenges of handling materials

How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 

Similar to Hosting public domain chemicals data online for the community – the challenges of handling materials (20)

Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Delivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the worldDelivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the world
 
Navigating an Internet of Chemistry via ChemSpider
Navigating an Internet of Chemistry via ChemSpiderNavigating an Internet of Chemistry via ChemSpider
Navigating an Internet of Chemistry via ChemSpider
 
Accessing chemical health and safety data online using Royal Society of Chemi...
Accessing chemical health and safety data online using Royal Society of Chemi...Accessing chemical health and safety data online using Royal Society of Chemi...
Accessing chemical health and safety data online using Royal Society of Chemi...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
The application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archiveThe application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archive
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspnRSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 

Recently uploaded

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 

Recently uploaded (20)

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 

Hosting public domain chemicals data online for the community – the challenges of handling materials

  • 1. Hosting public domain chemicals data online for the community – the challenges of handling materials Antony Williams Opportunities in Materials Informatics, University of Wisconsin-Madison February 9th , 2015 0000-0002-2668-4821
  • 2. About Me… • I am NOT a materials chemist • I am an NMR spectroscopist by training • Worked on a LIMS while at Kodak • 10 years in commercial cheminformatics • Built the ChemSpider database as a hobby • Worked on validating compounds on Wikipedia • Manage cheminformatics team for RSC • Believer in the value of social networking and Open Data for science • Dane Morgan asked me to tell jokes…
  • 3. I would tell a chemistry joke… But all of the good ones…
  • 4. An ambitious idea…. • Let’s map together all online chemistry data and build systems to integrate it • Heck, let’s integrate chemistry and biology data and add in disease data too if we can • Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web…for free
  • 5.
  • 6. What about this…. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, give it away…
  • 7. Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science
  • 8. Chemistry on the Internet… • Most searching for chemistry on the internet… • Name searching Google/Bing/Yahoo • Name searching Wikipedia • Name searching Wolfram Alpha • Name, name, name, name…searching • Structure searching DOZENS of websites, each with different information or…
  • 9. Chemistry on the Internet… • Most searching for chemistry on the internet… • Name searching Google/Bing/Yahoo • Name searching Wikipedia • Name searching Wolfram Alpha • Name, name, name, name…searching • Structure searching DOZENS of websites, each with different information or… • Search ONE website integrating the others!
  • 10. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!! • Note…NOT all websites connected
  • 19. Vendors and data sources
  • 20. APIs
  • 21. APIs
  • 23. …it has alkynes of trouble
  • 25. Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
  • 26. Molfiles • Molfiles are the primary exchange format between structure drawing packages • Can be different between different drawing packages • Most commonly carry X,Y coordinates for layout • Can support polymers, organometallics, etc. • Can carry 3D coordinates
  • 27. SMILES • SMILES is a common format • Can support polymers, organometallics, etc. • Does NOT carry X,Y or Z coordinates for layout so requires layout algorithms – can be problematic! • Generally different between drawing packages
  • 33. InChI • SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES • InChI Strings can be reversed to structures – same problem as with SMILES – no layout • Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
  • 37. InChIStrings Hash to InChIKeys
  • 41. Data Quality/Standardization • MANY structures meant to be something online are MISREPRESENTED. • Commonly you will have better success finding information by name searches than structure – with many caveats of course… • Validating chemical structure representations is laborious work – and it’s shocking to review data…
  • 42. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 43. Data quality is a known issue
  • 44. Data quality is a known issue
  • 45. Substructure # of Hits # of Correct Hits No stereochemistry Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10 Only 34 out of 149 structures were correct!
  • 46. Patent data in public databases
  • 47. Patent data in public databases
  • 48. You just can’t trust atoms!
  • 49. You just can’t trust atoms! They make up everything…
  • 50. ALL variants of Yohimbine!!!
  • 52. What ELSE is Methane???
  • 56. What is the Structure of Vitamin K1?
  • 57. Standardize • Use the SRS as a guidance document for standardization • Adjust as necessary to our needs
  • 59. Salt and Ionic Bonds
  • 61. Can we MAKE Quality Data? • We are building systems for everyone to validate and standardize their data
  • 62. DICTIONARIES are powerful • Search all forms of structure IDs • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  • 63. Many Names, One Structure
  • 64. But big and often noisy
  • 67. With links out to platforms
  • 69. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 70. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 71. Name to Structure Conversion
  • 72. Name to Structure Conversion
  • 73. ChemSpider “Annotations” • Users can add • Descriptions, Syntheses and Commentaries • Links to PubMed articles • Links to articles via DOIs • Add spectral data • Add Crystallographic Information Files • Add photos • Add MP3 files • Add Videos
  • 74. Spectral Data • Spectral data to be deposited in standard formats – JCAMP or images • All spectra available at: http://www.chemspider.com/spectra.aspx • Data are deposited on a regular basis • Students • Chemical vendors • Growing collection now
  • 79. It’s exactly the WRONG WAY! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive, preferably with OPEN DATA
  • 80. An Adventure into the World of Small but significant contribution..
  • 82. Micropublishing with Peer Review (a chemical synthesis blog?)
  • 85. Chemistry data is of value? • Reference databases generate hundreds of millions of dollars/euros per year • So much data generated that could go public • Maybe 5% of all data generated is published • There is no “Journal of Failed Experiments” • Funding agencies start to demand Open Data • Scientists want funding but also recognition
  • 86. A shift to Openness
  • 87. How will I get recognized? • Who in the room has an ORCID?
  • 88. Deposition of Research Data • If we manage compounds, syntheses and analytical data… • If we have security and provenance of data… • If we deliver user interfaces to satisfy the various use cases… • Then we have delivered electronic lab notebooks for chemistry laboratories. ELNs are research data repositories
  • 89. Recognition: need to have Impact
  • 91. National Information Standards Organization and “Altmetrics” http://www.niso.org/apps/group_public/download.php/13295/niso_altmetrics_white_paper_draft_v4.pdf
  • 92. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  • 97. Deposition of Data • Developing systems that provides feedback to users regarding data quality • Validate/standardize chemical compounds • Check for balanced reactions • Checks spectral data • EXAMPLE Future work • Properties – compare experimental to pred. • Automated structure verification - NMR
  • 98. So we know about ORGANICS • Comment – you don’t know all of the challenges until you start to work in the area! • We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management • The majority of our approaches do not map to materials • No standard ways to represent compounds • No InChI for materials
  • 99. Questions to consider… • Organics are hard enough! • What are your best dictionaries of materials? • We have chemical ontologies. Status for materials? • Is open annotation of your databases possible? • What standards do you have for materials data exchange?
  • 101. Known Challenges • Many materials are non-stoichiometric • How to represent composite materials (e.g. supported catalysts)? • Methods to distinguish novelty in materials (equivalent to diversity in organic structures)? • Many more I will learn at this workshop..?
  • 103. Internet Data The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
  • 104. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams