SlideShare a Scribd company logo
1 of 118
Activities at the Royal Society of
Chemistry to Gather, Extract and
Analyze Big Datasets in Chemistry
RSC-CICAG Meeting
April 22d 2015
What of the World of Chemistry?
What of the World of Chemistry?
Prophetic Enumeration
What of the World of Chemistry?
What of the World of Chemistry?
“The InChIKey indexing has therefore turned
Google into a de-facto open global chemical
information hub by merging links to most
significant sources, including over 50 million
PubChem and ChemSpider records.”
What of the World of Chemistry?
RSC’s ChemSpider
>34 million chemicals from >500 sources and
>40,000 users per day
Not Dealing With Big Data…
Is Openness Changing Things?
Open Access/Data Mandates
Open Access funder mandates…
We hear about the Open Data…
Chemistry Open Data???
• Where are all of the Open Chemistry Data?
• Is there a willingness to contribute more?
• Can we harvest more?
Chemistry Open Data???
• Where are all of the Open Chemistry Data?
• Not that much showing up yet from scientists
• Is there a willingness to contribute more?
• Can we harvest more?
Chemistry Open Data???
• Where are all of the Open Chemistry Data?
• Not that much showing up yet from scientists
• Is there a willingness to contribute more?
• Many concerns about IP and much lip service
• Can we harvest more?
Chemistry Open Data???
• Where are all of the Open Chemistry Data?
• Not that much showing up yet from scientists
• Is there a willingness to contribute more?
• Many concerns about IP and much lip service
• Can we harvest more?
• Yes
There are Efforts…
RSC >36,000 Articles in 2015
• Consider articles published by RSC in 2015
• How many compounds?
• How many reactions?
• How many figures?
• How many properties?
• How many spectra?
• How many, how many, how many?
The Graph of Relationships is Lost
The flexibility of querying…
What’s the
structure?
What’s the
structure?
Are they in
our file?
Are they in
our file?
What’s
similar?
What’s
similar?
What’s the
target?
What’s the
target?Pharmacology
data?
Pharmacology
data?
Known
Pathways?
Known
Pathways?
Working On
Now?
Working On
Now?Connections
to disease?
Connections
to disease?
Expressed in
right cell type?
Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
Publications-summary of work
• Scientific publications are a summary of work
• Is all work reported?
• How much science is lost to pruning?
• What of value sits in notebooks and is lost?
• Publications offering access to “real data”?
• How much data is lost?
• How many compounds never reported?
• How many syntheses fail or succeed?
• How many characterization measurements?
If I wanted to share data…
• I’ve performed a few dozen chemical syntheses
• I’ve run thousands of analytical spectra
• I’ve generated thousands of NMR assignments
• I’ve probably published <5% of all work..most lost
•
• Things can be different today in terms of sharing
• I would like to share more data, would like at
least provenance traced to me and somehow
to be acknowledged for the contribution
How Many Structures Can You
Generate From a Formula?
My research…in this CASE
Some NMR…
In researcher mode…
• I want to access and use data
• I want to:
• Download molecules
• Download tables
• Download spectra
• Download figures
• Then reprocess, replot, repurpose
The Challenge of Data Analysis
• NO access to raw data files – in binary or even
standard file formats for processing
• Figures are close to USELESS for 2D NMR –
representative not accurate shifts
• Tabulated shifts are in PDF files and needed
transcribing – where are CSV files???
• TORTUROUS WORK!!!!
• What if we wanted to do this for all manuscripts
submitted to RSC? Of course it is Feasible…
Community Norms
• Some wonderful community norms & mandates!
• Deposit crystal structures in CSD
• Deposit Proteins in PDB
• Deposit gene sequences in Genbank
• Increasingly deposit bioassay data in Pubchem
But what of general chemistry?
• We publish into document formats
• Could publishers help drive a community
norm for:
• Chemical compound registration
• Spectral data
• Property data
• What else?
• Who would host it? How would it be funded?
Not even a References Standard
We can solve for Authors…
Will it be used though??? YES!
Moves in Supplementary Info
The challenges of analytical data
• Vendors produce complex proprietary data
formats and standard formats are required
(JCAMP, NetCDF, AniML)
• ChemSpider already hosts thousands of JCAMP spectra
• Data validation approaches understood
• There are a myriad of analytical data types…
Analytical data
Encouraging data deposition
• Open Data mandates don’t offer solutions
• We would like to host:
• Compounds, Reactions, Spectra, Images,
Figures, Graphs etc.
• We will offer embargoing, collaborative sharing
and public release of data
• Integration to Electronic Lab Notebooks and
Institutional Repositories for deposition
RSC Repository Architecture
doi: 10.1007/s10822-014-9784-5
Registering of Data
• We hear…“We need standards”
There are Standards!
There are Standards!
There are Standards!
There are standards
• JCAMP, NetCDF, SPC, AnIML for analytical
data
• Plus newer efforts in development – Allotrope
Foundation efforts
There are Ontologies in Use
Registering of Data
• We hear…“We need standards”
• Many standards exist already!
• GREAT progress can be made with
•Data checking and “warnings”
•Normalization and standardization
•SIMPLE checks would help databases
•“High-quality databases” have rigorous checks
in place
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Data quality is a known issue
Data quality is a known issue
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
Only 34 out of 149 structures were correct!
Patent data in public databases
Patent data in public databases
EXPERTS must get it right?!
The value of a validated dictionary
Compounds are challenging…
The Open PHACTS community ecosystem
Open PHACTS
• Innovative Medicines Initiative EU project
• 16 Million Euros, 3 years – meshing chemistry
and biology Open Data primarily
• Semantic web project and driven by ODOSOS
– Open Data, Open Source, Open Standards
• RSC developed the chemistry registration
system and “CVSP”
CVSP: Validate and Standardize
CVSP Rules Sets
CVSP Filtering of DrugBank
CVSP Filtering of DrugBank
CVSP is Open to Anyone!
What if…
• CVSP was used to check molecular files
before submitting to publishers or databases?
• Publishers used CVSP to check their data?
• All rules were openly available for adoption?
• Standards, a community norm, access to data
What if we could do the same…
• Check/validate procedures:
• File format checking (think CIF checker)
• Nomenclature checking
• Compare experimental vs. predicted data
and flag suspicious data for inspection
• Physchem parameter comparisons
• NMR shift prediction (and assignment)
Building a BIG Data Repository
• We have validation procedures in place:
• Compound validation
• Reaction checking
• Analytical data formats (in development)
• But how long to get to a Big Data Repository?
• Users want to get data more than contribute!
• Where can we find data???
The RSC Archive
• Over 300,000 articles containing chemistry
• Compounds, reactions, property data,
spectral data, the usual….
• Document formats to analyze and extract
• Previous experience with “Prospecting”
compounds
Electronic Supplementary Info
What was our NextMove?
• Daniel Lowe worked on text-mining and
named-entity recognition at University of
Cambridge
• Extracted millions of chemical reactions from
US Patents
• Working with NextMove products (LeadMine
and CaffeineFix) and optimization by Daniel
What could we get?
PhysChem first: Melting Points
• Melting/sublimation/decomposition points
extracted for 287,635 distinct compounds from
1976-2014 USPTO patent applications/grants
• Sanity checks used to flag dubious values –
probably 130-4°C
• Non-melting outcomes recorded e.g. mp 147-
150°C. (subl.)
• What models could be built?
QSPR/QSAR modelling in
OCHEM http://ochem.eu
Modeling “BIG data”
• Melting point models developed with ca. 300k compounds
• Required 34Gb memory and about 400MB disk space (zipped)
• Matrix with 2*1011
entries (300k molecules x 700k descriptors)
• >12k core-hours (>600 CPU-days) for parameter optimization
• Parallelized on > 600 cores with up to 24 cores per one task
• Consensus model as average of individual models
• Accuracy of consensus model is ~33.6 °C for drug-like region
compounds
• Models publicly available at http://ochem.eu
Distribution of MPs in the analyzed
sets
–200 –100 0 100 200 300 400 500
0.0
0.0
0.1
datadensity
OCHEM
Enamine
Bradley
Bergström
PATENTS
PhysChem parameters
• Melting point model and data – good data
extracted and filtered “automagically”
• Boiling point data next – pressure dependence
• What next – logP, pKa, aq/non-aq. Solubility
• Prove the algorithms on US Patent Collection
then apply to RSC archive
• Ideally plumb the algorithms for all new papers
• More ideal – authors submit DATA!
A Recent Talk at ACS/Denver
ttp://www.slideshare.net/AntonyWilliams/
Spectral Data
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 C13 NMR
ChemSpider ID 24528095 HHCOSY
ESI – Text Spectra
We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
MestreLabs Mnova NMR
NMR Spectra
• 2,316,005 distinct spectra in 2001-2015 USPTO
Nucleus Count
H 1993384
C 173970
Unknown 107439
F 22158
P 16333
B 980
Si 715
Pt 275
N 170
V 101
<parse>
<nmrElement isotope="1" element="H">1H</nmrElement>
<nmrMethodAndSolvent>DMSO-d6, 400 MHz</nmrMethodAndSolvent>
<peak>
<peakValue>1.04</peakValue>
<peakAnnotation>t, 6H; J=7.9 Hz, -CH3</peakAnnotation>
</peak>
<peak>
<peakValue>1.38</peakValue>
<peakAnnotation>q, 4H; J=7.9 Hz, Ge-CH2-</peakAnnotation>
</peak>
<peak>
<peakValue>6.88</peakValue>
<peakAnnotation>d, 4H; J=8.5 Hz, Ar-H3,5</peakAnnotation>
</peak>
<peak>
<peakValue>7.58</peakValue>
<peakAnnotation>d, 4H; J=8.5 Hz, Ar-H2,6</peakAnnotation>
</peak>
<peak>
<peakValue>10.53</peakValue>
<peakAnnotation>s, 2H, OH</peakAnnotation>
</peak>
</parse>
1H-NMR (DMSO-d6, 400 MHz): δ=1.04 (t, 6H; J=7.9 Hz, -CH3), 1.38
(q, 4H; J=7.9 Hz, Ge-CH2-), 6.88 (d, 4H; J=8.5 Hz, Ar-H3,5), 7.58 (d,
4H; J=8.5 Hz, Ar-H2,6), 10.53 (s, 2H, OH)
1H-NMR (DMSO-d6, 400 MHz): 1.04 (t, 6H; J=7.9 Hz, -CH3), 1.38 (q,
4H; J=7.9 Hz, Ge-CH2-), 6.88 (d, 4H; J=8.5 Hz, Ar-H3,5), 7.58 (d, 4H;
J=8.5 Hz, Ar-H2,6), 10.53 (s, 2H, OH)
Original
spectra
Parse
tree
Normalized
spectra
NMR extracted as f(year)
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CumulativedistinctNMRextracted
Year of Publication
USPTO grants
USPTO applications
NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl,
dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic
acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2,
CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4,
1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year of patent filing
Sounds easy right?
• Potential for errors with names
• No name extracted for structure
• Incomplete names extracted
• Misassociation of names with structures
• Incorrect conversion of names to structures
BIGGEST problem - BRACKETS
• Brackets in names is a big problem- either
an additional bracket or a missing bracket
Cannot be converted
• https://www.google.co.uk/patents/US20050187390A1
• 2-[2-(4′-carbamoyl-4-methoxy-biphen-2-yl)-
quinolin-6-yl]-1-cyclohexyl-1H-
benzoimidazole-5-carboxylic Acid
• OPSIN expects biphenyl-2-yl
OCR error Correction
• https://www.google.co.uk/patents/WO2012150220A1
• di-terf-butyl (4S)-/V-(fert-butoxycarbonyl)-4-{4-[3-
(tosyloxy)propyl]benzyl}-L-glutamate
CaffeineFix corrected to:
• di-tert-butyl (4S)-N-(tert-butoxycarbonyl)-4-{4-[3-
(tosyloxy)propyl]benzyl}-L-glutamate
Corrections made: f--> t , / V --> N, f --> t
Sounds easy right?
• Textual Spectrum descriptions have issues
• Transcription errors (rare)
• Subjective interpretation (very common)
• Incomplete listing of shifts
• No/incomplete couplings/multiplicities listed
• Overlap of multiplets (very common)
• Labile protons – included/excluded/partial
Sounds easy right?
• Textual Spectrum descriptions have issues
• No peak width indications – especially labiles
• No peak shape indications – dynamic exchange
• Presence of rotamers
• Impurities included or misidentified
• Solvent peak belonging to the compound
• Wrong number of nuclei
Problems Generating Spectra
• Multiplicities no coupling constants
• δ 1H NMR (300 MHz, CDCl3): 1.48 (t, 3H),
4.15 (q, 2H), 7.03 (td, 1H), 7.16 (td, 1H),
7.49 (m, 1H), 7.70 (dd, 1H), 7.88 (dd, 1H),
8.77 (d, 1H)
Problems Generating Spectra
• PARTIAL couplings only for ca. 90% of spectra!
• δ 1H NMR (300 MHz, CDCl3): 0.48-0.66 (m, 2H)
0.75-0.95 (m, 2H), 1.80 (s, 1H), 3.86 (s, 3H),
5.56 (s, 2H), 6.59 (d, J=8.50 Hz, 1H), 7.03 (dd,
J=8.50, 2.15 Hz, 1H), 7.60 (s, 1H)
Error Detection
1H NMR (400 MHz, CDCl3) d ppm 11.47-12.05
(1H), 7.97-8.24 (1H), 7.61-7.97 (2H), 7.28-7.61
(2H), 7.21 (1H), 5.27 (1H), 3.70-4.74 (8H), 2.80-
3.16 (2H), 2.46-2.80 (2H), 1.87-2.45 (2H), 1.35-
1.77 (11H), 1.24 (18H), 0.87 (3H) associated
with Glyceryl Monolaurate
Error Detection
• 54 hydrogens counted in the reported spectrum.
Glyceryl Monolaurate has only 30 hydrogens.
• Title was: “Polymerization of Monomer 4 with
Glyceryl Monolaurate”
• Text-mining title missed compound: Monomer 4
is the compound below
Text-mined spectra
• In the process of converting spectra into visual
depictions many challenges identified
• Validation approaches include:
• NMR prediction and validation
• Hosting “extracted text spectra” plus depictions
– full provenance to source
• Application to RSC archive will come later
ESI Data also contains figures
“Where is the real data please?”
FIGURE
DATA
Data added to ChemSpider
Manual Curation Layer
• ChemSpider has had a manual curation
layer for >8 years
• Users can annotate data on ChemSpider
• We do receive useful feedback from the
community on the data and are optimistic!
Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
• Can we encourage depositions????
An EPSRC Call
“…the identification of the need for a UK
national service for the provision of a
searchable, electronic chemical database
for the UK academic research community.”
National Chemical Database Service
Community Data Repository
• Automated depositions of data
• Electronic Lab Notebooks as feeds
• National services feeding the repository –
crystallography, mass spectrometry
• Accessing open data from other projects
The PharmaSea Website
What can drive participation?
• What can drive scientists to participate and
contribute?
• Ensuring provenance of their data for reuse
• Mandates from funding agencies
• Improved systems to ease contribution
• Additional contributions to science
• Improved publishing processes
• Recognition for contributions
AltMetrics as Scientist Impact
My opinions…
• Yes, platform development is critical
• Yes, ease-of-use/efficiency is necessary
• Yes, standards can be improved
• The greatest shifts will come from:
• An increased willingness to share
• More training in chemical information
• Working towards new community norms
• The majority of change is bottom-up
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
Acknowledgments
• Data Repository Team and ChemSpider Team
• Daniel Lowe (NextMove software)
• Igor Tetko (HelmholtzZentrum München)
• Carlos Coba (Mestrelab Research)
Thank you
Email: tony27587@gmail.com
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

What's hot

What's hot (20)

Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
The application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archiveThe application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archive
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Overview of open resources to support automated structure verification and e...
Overview of open resources to support automated structure verification  and e...Overview of open resources to support automated structure verification  and e...
Overview of open resources to support automated structure verification and e...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 

Viewers also liked (20)

HOLLYWOOD SUPERSTAR.1.
HOLLYWOOD SUPERSTAR.1.HOLLYWOOD SUPERSTAR.1.
HOLLYWOOD SUPERSTAR.1.
 
Bikes
BikesBikes
Bikes
 
MVC
MVCMVC
MVC
 
Theresia's rozen parkinggarage with surrounding houses
Theresia's rozen parkinggarage with surrounding housesTheresia's rozen parkinggarage with surrounding houses
Theresia's rozen parkinggarage with surrounding houses
 
Homelessnessin Brooklyn 2014
Homelessnessin Brooklyn 2014Homelessnessin Brooklyn 2014
Homelessnessin Brooklyn 2014
 
Modelo negocio
Modelo negocioModelo negocio
Modelo negocio
 
Galapagos
GalapagosGalapagos
Galapagos
 
Hiv and women
Hiv and womenHiv and women
Hiv and women
 
Integracion continua
Integracion continuaIntegracion continua
Integracion continua
 
Uml
UmlUml
Uml
 
Evolución del diseño converse
Evolución del diseño converseEvolución del diseño converse
Evolución del diseño converse
 
Historias de usuario
Historias de usuarioHistorias de usuario
Historias de usuario
 
Calidad en el desarrollo del software
Calidad en el desarrollo del softwareCalidad en el desarrollo del software
Calidad en el desarrollo del software
 
Linea de tiempo editada
Linea de tiempo editadaLinea de tiempo editada
Linea de tiempo editada
 
Retrospectiva
RetrospectivaRetrospectiva
Retrospectiva
 
Agritourism - What It Is and Can Be
Agritourism - What It Is and Can BeAgritourism - What It Is and Can Be
Agritourism - What It Is and Can Be
 
Elevator pitch
Elevator pitchElevator pitch
Elevator pitch
 
Iii cortes . 2013 , teoria digital
Iii   cortes  . 2013 , teoria digitalIii   cortes  . 2013 , teoria digital
Iii cortes . 2013 , teoria digital
 
BDD TDD ATDD
BDD TDD ATDDBDD TDD ATDD
BDD TDD ATDD
 
Ceremonias scrum
Ceremonias scrumCeremonias scrum
Ceremonias scrum
 

Similar to Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...robertstevens65
 

Similar to Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry (20)

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Delivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the worldDelivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the world
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Data integration
Data integrationData integration
Data integration
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
 

Recently uploaded

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.k64182334
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 

Recently uploaded (20)

Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 

Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry

  • 1. Activities at the Royal Society of Chemistry to Gather, Extract and Analyze Big Datasets in Chemistry RSC-CICAG Meeting April 22d 2015
  • 2.
  • 3. What of the World of Chemistry?
  • 4. What of the World of Chemistry?
  • 6. What of the World of Chemistry?
  • 7. What of the World of Chemistry? “The InChIKey indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records.”
  • 8. What of the World of Chemistry?
  • 9. RSC’s ChemSpider >34 million chemicals from >500 sources and >40,000 users per day
  • 10. Not Dealing With Big Data…
  • 12. Open Access/Data Mandates Open Access funder mandates…
  • 13. We hear about the Open Data…
  • 14. Chemistry Open Data??? • Where are all of the Open Chemistry Data? • Is there a willingness to contribute more? • Can we harvest more?
  • 15. Chemistry Open Data??? • Where are all of the Open Chemistry Data? • Not that much showing up yet from scientists • Is there a willingness to contribute more? • Can we harvest more?
  • 16. Chemistry Open Data??? • Where are all of the Open Chemistry Data? • Not that much showing up yet from scientists • Is there a willingness to contribute more? • Many concerns about IP and much lip service • Can we harvest more?
  • 17. Chemistry Open Data??? • Where are all of the Open Chemistry Data? • Not that much showing up yet from scientists • Is there a willingness to contribute more? • Many concerns about IP and much lip service • Can we harvest more? • Yes
  • 19. RSC >36,000 Articles in 2015 • Consider articles published by RSC in 2015 • How many compounds? • How many reactions? • How many figures? • How many properties? • How many spectra? • How many, how many, how many?
  • 20. The Graph of Relationships is Lost
  • 21. The flexibility of querying… What’s the structure? What’s the structure? Are they in our file? Are they in our file? What’s similar? What’s similar? What’s the target? What’s the target?Pharmacology data? Pharmacology data? Known Pathways? Known Pathways? Working On Now? Working On Now?Connections to disease? Connections to disease? Expressed in right cell type? Expressed in right cell type? Competitors?Competitors? IP?IP?
  • 22. Publications-summary of work • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
  • 23. If I wanted to share data… • I’ve performed a few dozen chemical syntheses • I’ve run thousands of analytical spectra • I’ve generated thousands of NMR assignments • I’ve probably published <5% of all work..most lost • • Things can be different today in terms of sharing • I would like to share more data, would like at least provenance traced to me and somehow to be acknowledged for the contribution
  • 24. How Many Structures Can You Generate From a Formula?
  • 27. In researcher mode… • I want to access and use data • I want to: • Download molecules • Download tables • Download spectra • Download figures • Then reprocess, replot, repurpose
  • 28. The Challenge of Data Analysis • NO access to raw data files – in binary or even standard file formats for processing • Figures are close to USELESS for 2D NMR – representative not accurate shifts • Tabulated shifts are in PDF files and needed transcribing – where are CSV files??? • TORTUROUS WORK!!!! • What if we wanted to do this for all manuscripts submitted to RSC? Of course it is Feasible…
  • 29. Community Norms • Some wonderful community norms & mandates! • Deposit crystal structures in CSD • Deposit Proteins in PDB • Deposit gene sequences in Genbank • Increasingly deposit bioassay data in Pubchem
  • 30. But what of general chemistry? • We publish into document formats • Could publishers help drive a community norm for: • Chemical compound registration • Spectral data • Property data • What else? • Who would host it? How would it be funded?
  • 31. Not even a References Standard
  • 32. We can solve for Authors… Will it be used though??? YES!
  • 34. The challenges of analytical data • Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML) • ChemSpider already hosts thousands of JCAMP spectra • Data validation approaches understood • There are a myriad of analytical data types…
  • 36. Encouraging data deposition • Open Data mandates don’t offer solutions • We would like to host: • Compounds, Reactions, Spectra, Images, Figures, Graphs etc. • We will offer embargoing, collaborative sharing and public release of data • Integration to Electronic Lab Notebooks and Institutional Repositories for deposition
  • 37. RSC Repository Architecture doi: 10.1007/s10822-014-9784-5
  • 38. Registering of Data • We hear…“We need standards”
  • 42. There are standards • JCAMP, NetCDF, SPC, AnIML for analytical data • Plus newer efforts in development – Allotrope Foundation efforts
  • 44. Registering of Data • We hear…“We need standards” • Many standards exist already! • GREAT progress can be made with •Data checking and “warnings” •Normalization and standardization •SIMPLE checks would help databases •“High-quality databases” have rigorous checks in place
  • 45. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 46. Data quality is a known issue
  • 47. Data quality is a known issue
  • 48. Substructure # of Hits # of Correct Hits No stereochemistry Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10 Only 34 out of 149 structures were correct!
  • 49. Patent data in public databases
  • 50. Patent data in public databases
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56. EXPERTS must get it right?!
  • 57. The value of a validated dictionary
  • 59. The Open PHACTS community ecosystem
  • 60. Open PHACTS • Innovative Medicines Initiative EU project • 16 Million Euros, 3 years – meshing chemistry and biology Open Data primarily • Semantic web project and driven by ODOSOS – Open Data, Open Source, Open Standards • RSC developed the chemistry registration system and “CVSP”
  • 61. CVSP: Validate and Standardize
  • 63. CVSP Filtering of DrugBank
  • 64. CVSP Filtering of DrugBank
  • 65. CVSP is Open to Anyone!
  • 66. What if… • CVSP was used to check molecular files before submitting to publishers or databases? • Publishers used CVSP to check their data? • All rules were openly available for adoption? • Standards, a community norm, access to data
  • 67. What if we could do the same… • Check/validate procedures: • File format checking (think CIF checker) • Nomenclature checking • Compare experimental vs. predicted data and flag suspicious data for inspection • Physchem parameter comparisons • NMR shift prediction (and assignment)
  • 68. Building a BIG Data Repository • We have validation procedures in place: • Compound validation • Reaction checking • Analytical data formats (in development) • But how long to get to a Big Data Repository? • Users want to get data more than contribute! • Where can we find data???
  • 69. The RSC Archive • Over 300,000 articles containing chemistry • Compounds, reactions, property data, spectral data, the usual…. • Document formats to analyze and extract • Previous experience with “Prospecting” compounds
  • 71. What was our NextMove? • Daniel Lowe worked on text-mining and named-entity recognition at University of Cambridge • Extracted millions of chemical reactions from US Patents • Working with NextMove products (LeadMine and CaffeineFix) and optimization by Daniel
  • 73. PhysChem first: Melting Points • Melting/sublimation/decomposition points extracted for 287,635 distinct compounds from 1976-2014 USPTO patent applications/grants • Sanity checks used to flag dubious values – probably 130-4°C • Non-melting outcomes recorded e.g. mp 147- 150°C. (subl.) • What models could be built?
  • 74. QSPR/QSAR modelling in OCHEM http://ochem.eu
  • 75. Modeling “BIG data” • Melting point models developed with ca. 300k compounds • Required 34Gb memory and about 400MB disk space (zipped) • Matrix with 2*1011 entries (300k molecules x 700k descriptors) • >12k core-hours (>600 CPU-days) for parameter optimization • Parallelized on > 600 cores with up to 24 cores per one task • Consensus model as average of individual models • Accuracy of consensus model is ~33.6 °C for drug-like region compounds • Models publicly available at http://ochem.eu
  • 76. Distribution of MPs in the analyzed sets –200 –100 0 100 200 300 400 500 0.0 0.0 0.1 datadensity OCHEM Enamine Bradley Bergström PATENTS
  • 77. PhysChem parameters • Melting point model and data – good data extracted and filtered “automagically” • Boiling point data next – pressure dependence • What next – logP, pKa, aq/non-aq. Solubility • Prove the algorithms on US Patent Collection then apply to RSC archive • Ideally plumb the algorithms for all new papers • More ideal – authors submit DATA!
  • 78. A Recent Talk at ACS/Denver ttp://www.slideshare.net/AntonyWilliams/
  • 83. ESI – Text Spectra
  • 84. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  • 85. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 87. NMR Spectra • 2,316,005 distinct spectra in 2001-2015 USPTO Nucleus Count H 1993384 C 173970 Unknown 107439 F 22158 P 16333 B 980 Si 715 Pt 275 N 170 V 101
  • 88. <parse> <nmrElement isotope="1" element="H">1H</nmrElement> <nmrMethodAndSolvent>DMSO-d6, 400 MHz</nmrMethodAndSolvent> <peak> <peakValue>1.04</peakValue> <peakAnnotation>t, 6H; J=7.9 Hz, -CH3</peakAnnotation> </peak> <peak> <peakValue>1.38</peakValue> <peakAnnotation>q, 4H; J=7.9 Hz, Ge-CH2-</peakAnnotation> </peak> <peak> <peakValue>6.88</peakValue> <peakAnnotation>d, 4H; J=8.5 Hz, Ar-H3,5</peakAnnotation> </peak> <peak> <peakValue>7.58</peakValue> <peakAnnotation>d, 4H; J=8.5 Hz, Ar-H2,6</peakAnnotation> </peak> <peak> <peakValue>10.53</peakValue> <peakAnnotation>s, 2H, OH</peakAnnotation> </peak> </parse> 1H-NMR (DMSO-d6, 400 MHz): δ=1.04 (t, 6H; J=7.9 Hz, -CH3), 1.38 (q, 4H; J=7.9 Hz, Ge-CH2-), 6.88 (d, 4H; J=8.5 Hz, Ar-H3,5), 7.58 (d, 4H; J=8.5 Hz, Ar-H2,6), 10.53 (s, 2H, OH) 1H-NMR (DMSO-d6, 400 MHz): 1.04 (t, 6H; J=7.9 Hz, -CH3), 1.38 (q, 4H; J=7.9 Hz, Ge-CH2-), 6.88 (d, 4H; J=8.5 Hz, Ar-H3,5), 7.58 (d, 4H; J=8.5 Hz, Ar-H2,6), 10.53 (s, 2H, OH) Original spectra Parse tree Normalized spectra
  • 89. NMR extracted as f(year) 0 500000 1000000 1500000 2000000 2500000 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 CumulativedistinctNMRextracted Year of Publication USPTO grants USPTO applications
  • 90. NMR solvents 48.5% 38.3% 8.7% 1.1% 1.0% 1.0% 1.4% CDCl3 DMSO-d6 CD3OD D2O Acetone-d6 MeOD Others Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4,
  • 91. 1H-NMR frequency over time 0 Mhz 50 Mhz 100 Mhz 150 Mhz 200 Mhz 250 Mhz 300 Mhz 350 Mhz 400 Mhz 450 Mhz 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year of patent filing
  • 92. Sounds easy right? • Potential for errors with names • No name extracted for structure • Incomplete names extracted • Misassociation of names with structures • Incorrect conversion of names to structures
  • 93. BIGGEST problem - BRACKETS • Brackets in names is a big problem- either an additional bracket or a missing bracket
  • 94. Cannot be converted • https://www.google.co.uk/patents/US20050187390A1 • 2-[2-(4′-carbamoyl-4-methoxy-biphen-2-yl)- quinolin-6-yl]-1-cyclohexyl-1H- benzoimidazole-5-carboxylic Acid • OPSIN expects biphenyl-2-yl
  • 95. OCR error Correction • https://www.google.co.uk/patents/WO2012150220A1 • di-terf-butyl (4S)-/V-(fert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate CaffeineFix corrected to: • di-tert-butyl (4S)-N-(tert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate Corrections made: f--> t , / V --> N, f --> t
  • 96. Sounds easy right? • Textual Spectrum descriptions have issues • Transcription errors (rare) • Subjective interpretation (very common) • Incomplete listing of shifts • No/incomplete couplings/multiplicities listed • Overlap of multiplets (very common) • Labile protons – included/excluded/partial
  • 97. Sounds easy right? • Textual Spectrum descriptions have issues • No peak width indications – especially labiles • No peak shape indications – dynamic exchange • Presence of rotamers • Impurities included or misidentified • Solvent peak belonging to the compound • Wrong number of nuclei
  • 98. Problems Generating Spectra • Multiplicities no coupling constants • δ 1H NMR (300 MHz, CDCl3): 1.48 (t, 3H), 4.15 (q, 2H), 7.03 (td, 1H), 7.16 (td, 1H), 7.49 (m, 1H), 7.70 (dd, 1H), 7.88 (dd, 1H), 8.77 (d, 1H)
  • 99. Problems Generating Spectra • PARTIAL couplings only for ca. 90% of spectra! • δ 1H NMR (300 MHz, CDCl3): 0.48-0.66 (m, 2H) 0.75-0.95 (m, 2H), 1.80 (s, 1H), 3.86 (s, 3H), 5.56 (s, 2H), 6.59 (d, J=8.50 Hz, 1H), 7.03 (dd, J=8.50, 2.15 Hz, 1H), 7.60 (s, 1H)
  • 100. Error Detection 1H NMR (400 MHz, CDCl3) d ppm 11.47-12.05 (1H), 7.97-8.24 (1H), 7.61-7.97 (2H), 7.28-7.61 (2H), 7.21 (1H), 5.27 (1H), 3.70-4.74 (8H), 2.80- 3.16 (2H), 2.46-2.80 (2H), 1.87-2.45 (2H), 1.35- 1.77 (11H), 1.24 (18H), 0.87 (3H) associated with Glyceryl Monolaurate
  • 101. Error Detection • 54 hydrogens counted in the reported spectrum. Glyceryl Monolaurate has only 30 hydrogens. • Title was: “Polymerization of Monomer 4 with Glyceryl Monolaurate” • Text-mining title missed compound: Monomer 4 is the compound below
  • 102. Text-mined spectra • In the process of converting spectra into visual depictions many challenges identified • Validation approaches include: • NMR prediction and validation • Hosting “extracted text spectra” plus depictions – full provenance to source • Application to RSC archive will come later
  • 103. ESI Data also contains figures
  • 104. “Where is the real data please?” FIGURE DATA
  • 105. Data added to ChemSpider
  • 106. Manual Curation Layer • ChemSpider has had a manual curation layer for >8 years • Users can annotate data on ChemSpider • We do receive useful feedback from the community on the data and are optimistic!
  • 107. Extraction is the WRONG WAY • We should NOT mine data out – digital form! • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance • Can we encourage depositions????
  • 108. An EPSRC Call “…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”
  • 110. Community Data Repository • Automated depositions of data • Electronic Lab Notebooks as feeds • National services feeding the repository – crystallography, mass spectrometry • Accessing open data from other projects
  • 111.
  • 113. What can drive participation? • What can drive scientists to participate and contribute? • Ensuring provenance of their data for reuse • Mandates from funding agencies • Improved systems to ease contribution • Additional contributions to science • Improved publishing processes • Recognition for contributions
  • 115. My opinions… • Yes, platform development is critical • Yes, ease-of-use/efficiency is necessary • Yes, standards can be improved • The greatest shifts will come from: • An increased willingness to share • More training in chemical information • Working towards new community norms • The majority of change is bottom-up
  • 116. Internet Data The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
  • 117. Acknowledgments • Data Repository Team and ChemSpider Team • Daniel Lowe (NextMove software) • Igor Tetko (HelmholtzZentrum München) • Carlos Coba (Mestrelab Research)
  • 118. Thank you Email: tony27587@gmail.com ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Editor's Notes

  1. US20140329929A1, The melting point and both NMR spectra are associated with the compound. Other physical quantities e.g. volumes, pressures etc. are also detected
  2. Mostly melting points (as opposed to sublimation/decomposition). Dubious values usually mistakes in the original document e.g. in this case probably a missing hyphen.
  3. All duplicates were removed. Patents dataset was used to develop models, other sets were used to evaluate performance of models The number of compounds was smallest in Bergström dataset N=277 (see re-print for more statistics)
  4. Despite some descriptors had high dimensionality, the number of non-zero elements was rather small. It allowed to efficiently use sparse data storage format and SVM method to analyse them.
  5. The parallelized code is no Size of ANN model does not depend on the number of training samples but on the size of descriptors x size of hidden neurons LibSVM model size proportional to data (number of support vectors, i.e., training points) selected by model
  6. Since there are only few compounds outside of “drug-like” region (50 - 250) °C – shown on the plot, the prediction accuracy of the models is low for these subspaces.
  7. Unknown spectra are almost always hydrogen. As carbon shifts are so different to hydrogen a very crude check could partition the unknowns into proton and carbon NMR. Small numbers of other obscure spectra also found (but also false positives due to really bizarre “OCR” errors of hydrogen or the likei.e. 1 in a million errors :-p)
  8. Parse tree is intentionally quite course as in the general case the nmrMethodAndSolvent and peakAnnotations are essentially free text. A normalized spectra can be generated from the parse tree so as to present a common spectral format to downstream tools. In this case the normalized Spectra is virtually identical to the original (δ= removed)
  9. List of others probably isn’t completely comprehensive (solvent is free text!). 2 million spectra (from USPTO applications) have identified solvents
  10. Excluded results &amp;lt; 1MHz and &amp;gt;1GHz (…mixing up Hz and MHz not uncommon!). Just to confuse things this is from the grant data while the previous data was from applications :-p