The Royal Society of Chemistry hosts one of the worlds’ richest collections of online chemistry data that is free-to-access for the community. ChemSpider presently hosts over 30 million unique chemical compounds together with associated data and accessible via a number of search techniques. With almost 50,000 unique users per day from around the world the site offers scientists the ability to investigate the world of small molecules via property searches, analytical data and predictive models. The challenges associated with providing a similar platform for “materials” are manifold but, if they could be addressed, would offer a valuable service to the materials community. This presentation will provide an overview of how ChemSpider was built, our efforts to expand the capabilities to a more encompassing data repository and some of the challenges faced to embrace the diverse world of materials informatics and online data access.
Hosting public domain chemicals data online for the community – the challenges of handling materials
1. Hosting public domain chemicals
data online for the community – the
challenges of handling materials
Antony Williams
Opportunities in Materials Informatics, University of Wisconsin-Madison
February 9th
, 2015
0000-0002-2668-4821
2. About Me…
• I am NOT a materials chemist
• I am an NMR spectroscopist by training
• Worked on a LIMS while at Kodak
• 10 years in commercial cheminformatics
• Built the ChemSpider database as a hobby
• Worked on validating compounds on Wikipedia
• Manage cheminformatics team for RSC
• Believer in the value of social networking and
Open Data for science
• Dane Morgan asked me to tell jokes…
3. I would tell a chemistry joke…
But all of the good ones…
4. An ambitious idea….
• Let’s map together all online chemistry data
and build systems to integrate it
• Heck, let’s integrate chemistry and biology
data and add in disease data too if we can
• Let’s extract property data and model it and
see if we can extract new relationships –
quantitative and qualitative
• Let’s make it all available on the web…for free
5.
6. What about this….
• We’re going to map the world
• We’re going to take photos of as many places
as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, give it away…
7. Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
8. Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
• Structure searching DOZENS of websites, each
with different information or…
9. Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
• Structure searching DOZENS of websites, each
with different information or…
• Search ONE website integrating the others!
10. • ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
• Note…NOT all websites connected
26. Molfiles
• Molfiles are the primary exchange format
between structure drawing packages
• Can be different between different drawing
packages
• Most commonly carry X,Y coordinates for layout
• Can support polymers, organometallics, etc.
• Can carry 3D coordinates
27. SMILES
• SMILES is a common format
• Can support polymers,
organometallics, etc.
• Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
• Generally different between
drawing packages
33. InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages. No
variability as with SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases, blogs,
Wikipedia) – good for searching the internet
41. Data Quality/Standardization
• MANY structures meant to be something
online are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations
is laborious work – and it’s shocking to review
data…
69. Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
70. Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
73. ChemSpider “Annotations”
• Users can add
• Descriptions, Syntheses and Commentaries
• Links to PubMed articles
• Links to articles via DOIs
• Add spectral data
• Add Crystallographic Information Files
• Add photos
• Add MP3 files
• Add Videos
74. Spectral Data
• Spectral data to be deposited in standard
formats – JCAMP or images
• All spectra available at:
http://www.chemspider.com/spectra.aspx
• Data are deposited on a regular basis
• Students
• Chemical vendors
• Growing collection now
79. It’s exactly the WRONG WAY!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive,
preferably with OPEN DATA
85. Chemistry data is of value?
• Reference databases generate hundreds of
millions of dollars/euros per year
• So much data generated that could go public
• Maybe 5% of all data generated is published
• There is no “Journal of Failed Experiments”
• Funding agencies start to demand Open Data
• Scientists want funding but also recognition
87. How will I get recognized?
• Who in the room has an ORCID?
88. Deposition of Research Data
• If we manage compounds, syntheses and
analytical data…
• If we have security and provenance of data…
• If we deliver user interfaces to satisfy the
various use cases…
• Then we have delivered electronic lab
notebooks for chemistry laboratories. ELNs
are research data repositories
92. What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, tabular data
• Algorithms for data validation and standardization
• Flexible indexing and search technologies
• A platform for modeling data and hosting existing
models and predictive algorithms
97. Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chemical compounds
• Check for balanced reactions
• Checks spectral data
• EXAMPLE Future work
• Properties – compare experimental to pred.
• Automated structure verification - NMR
98. So we know about ORGANICS
• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and cheminformatics companies, have
solved MANY, but not all of the issues
regarding organic chemistry management
• The majority of our approaches do not map to
materials
• No standard ways to represent compounds
• No InChI for materials
99. Questions to consider…
• Organics are hard enough!
• What are your best dictionaries of materials?
• We have chemical ontologies. Status for materials?
• Is open annotation of your databases possible?
• What standards do you have for materials data
exchange?
101. Known Challenges
• Many materials are non-stoichiometric
• How to represent composite materials (e.g.
supported catalysts)?
• Methods to distinguish novelty in materials
(equivalent to diversity in organic structures)?
• Many more I will learn at this workshop..?
103. Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
104. Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams