SlideShare a Scribd company logo
1 of 35
Download to read offline
Does bigger mean better in the
world of chemistry databases?
Antony Williams1 and Christopher Southan2
1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC, USA
ORCID ID 0000-0002-2668-4821
2) TW2Informatics Ltd, Göteborg, Sweden 42166
ORCID ID 0000-0001-9580-0446
Views expressed in this presentation are those of the authors and do not necessarily reflect the views or policies of the U.S. EPA
The Good News….
1
The Good News….
2
• We’ve never had it so good (UniChem~160 million)
• Sustained growth - since 2Q2017
– Scifinder +25 million
– ChemSpider +16 million
– UniChem +14 million
– PubChem +6 million
• Massively enabling for chemistry and bioactivity
• All four should be congratulated! Public databases
in particular (where InChI is the great enabler)
Data quality in public domain
databases is challenging…
• Data quality in free web-based databases!
3
Database Quality and Noise
4
• Intuitively understood but difficult to quantitate
• Some aspects inherently cheminformatically
challenging (e.g. Tautomer handling, Kekulisation of
complex cycles, atroposiomers, exotic metalloorganic
compounds, challenging layout and renderings)
• Other challenges are just difficult (e.g. which stereo
enumerations were experimentally confirmed or did
the bioassays use undefined racemates)
Taxol (Paclitaxel) is noisy….
• CID 36314 most “popular” with 304 singleton submissions and 532 mixtures
• First submitted by NIAID on 2004-09-15 as SID: 598380 (but is it correct?)
• 154 have different stereo (some MAY be correctly synthesized)
• 34 have different isotopes and 12 of these have same stereo
• 532 mixture SIDs merge to 354 distinct CID mixtures and components
• 66 vendors will sell you CID 36314
• 59 vendors will sell you one of the other 166 different CIDs
• Sigma-Aldrich submitted the identical structure 10 times (as different SIDs)
• ZINC links to vendors for 17 of the 167
• 64 of 167 CIDs are single-sources, 17 of which are vendors
• 12 of 167 CIDs include RN 33069-62-4 as a synonym
• 12 of 167 are flagged as active in different BioAssays
5
Will the correct Microcystin LR Stand Up?
ChemSpider Skeleton Search
6
Comparing ChemSpider Structures
7
Comparing ChemSpider Structures
8
Other Searches
9
CASRN 3022-92-2 on PubChem
https://pubchem.ncbi.nlm.nih.gov/#query=3022-92-2
10
FOUR Different structures, THREE different skeletons
Comparisons…
11
ChemIDPlus ChemSpider
SciFinder
Database Quality and Noise
12
• Common problems: source errors for CAS-RN
mappings, name-to-structure conversion errors,
authors ignoring IUPAC rules for chemical naming
• We accept some intrinsically noisy sources for their
value compromise (e.g. large vendor aggregations
and automated document extraction feeds)
• Some databases index substances without
structures: antibodies, large peptides and molasses –
not currently mappable but may have linked data
Known issues with public databases (1)
13
• Different sets of chemistry rules and submission filters
• Operations seem to be focussed on data expansion
but less effort into quality
• No inter-resource intersection statistics
• Some useful boutique databases do not submit
• Massive coverage gaps from the literature are not
extracted into the public databases
• Coverage gaps from non-document sources (e.g.
open drug discovery ELNs)
• Not all are fully open, searchable and downloadable
14
• Unknown extent of contamination by virtuals
• Confounding circularity – identical submissions
between systems, with consequent degradation of
mappings
• Expert chemical curation, biocuration and crowd-
source fixing does not scale
• Public databases are susceptable to exploitation by
opportunistic and low-quality submitters
• Large databases aggregate different types of errors
• No real indication of collaboration between the public
databases to solve the issues of data quality
Known issues with public databases (II)
Quality has many aspects
15
• Getting structures to round-trip (Molfile, IUPAC,
SMILES, InChI String and Keys all concordant and
rendered at least reasonably) – but no surprise
– Issues of v2000/v3000 exchange and molfiles imperfect
– InChI is powerful but imperfect and extensions are underway
– Manually generated IUPAC Names can be very low quality
• Submission filtering rules to ensure plausible
structures (e.g. ”Chessboardanes”)
• Tracking molecular ”multiplexing” (i.e. InChIKey inner
layer)
• Automated document extraction of chemistry is noisy
(SureChEMBL, IBM, Springer, Thieme)
Applications of public databases to
non-targeted analysis
• Non-targeted analysis for structure identification
and forensics analysis
• Number of hits retrieved based on mass/formula
searches explodes based on poorly represented
chemicals – especially stereo issues
• The number of hits makes it much harder to
rank candidate collections based on meta-data
16
Quantifying noise in PubChem
No other database offers this!
PubChem chemistry rules not perfect but are transparent
and can be sliced and diced in useful detail, e.g.
• Mixture counts (covalent units <1)
• Explicit interogation of stereo
• Counts of unique structures (single-source)
• Relationship mapping via individual entries and the
PubChem Identifier Exchange Service (up to ~5K)
• These types of stats are informative but should not
be overinterpreted 17
Surprising result (I)
• A big increase in unique single-source content
• Judging by metrics above PubChem has not changed
much from doubling in content since 2013
• Except big < uniqueness plus slight < undefined chirality18
Surprising result (II)
• Patents high in mixtures
• Vendors low for partial chirality
• Uniqueness in patents is underestimated (i.e. millions of structures
extracted by SureChEMBL and IBM but only those two) 19
Not such a surprising result
• Sources can be quite different e.g. comparison between ZINC and
EPA/DSSTox above
• ZINC virtually enumerates stereo which < uniqueness
• The intersect is 275,000 CIDs
20
Challenges with making improvements
21
• No quick fixes – we’ve been discussing it for over a
decade...
• Acknowledging quality and noise issues gives us a
chance of not being confounded by them
• But this is problematic for less experienced users
• PubChem allows you to filter just about anything,
either pre- or post-analysis
Challenges with making improvements
22
• Uniqueness is a two-edged sword - value or junk?
• Would be nice if someone made a widget that gave a
quick quality stats overview for chemicals sets
– Chemical structures vs. CASRNs va names and other identifiers
• Standalone curated databases can give cleaner
results compared with the same content registered
elsewhere. e.g. 875k chemicals from CompTox
Chemicals Dashboard nested in 96 million in
PubChem. Standardization is not lossless...
Standardization and standards
V3000 Stereochemistry Support
23
Original Dashboard
PubChem Standardized
Standardization and standards
Markush Representations
24
Standardization Efforts
25
The Power but Confusion of CASRNs
• CASRNs have only one true validation path
• CommonChemistry was a GREAT START for
Wikipedia CAS Validation – but out of date
26
Validation of CASRNs
• Automated bulk validation of CASRNs is
possible only with assistance from CAS
27
Automated Patent Extraction
28
• Classic dilema between very high value and noise
• ChemSpider chose to forego patent data because of
quality issues
• PubChem have done a herculean job on their feeds
from IBM, SCRIPDB, SureChEMBL and NextMove!
(e.g. indexing 3 mill patent documents in the new
interface)
Patent CIDs by year (cumulative)
• SureChEMBL is the only major source regularly updating
• Will there be a post-2017 IBM refresh?
• “News flash” Google Patents has started incorporating searchable
chemistry extraction – so will this become a complementary feed?
29
Virtual deuteration: Is there really d-51 Paclitaxel??
• Left:PubChem CID42599845 drawn by Thomson/Derwent
• Right: Exemplification in US20090069410 from Protia
• Filed 100s of deuterated drug patents 2008/9, Czarnik
sole inventor (but no evidence he actually made ‘em)
• Protia, Auspex and Concert filings have led to 1000s of
virtually deuterated drugs > PubChem
30
Observations
31
• Our massively-valuable open chemical database
ecosystem is noisy, vulnerable and under-resourced
– so we need to engage collectively for enhancements
• Expansion of big databases is good but unless they
push back against the primary quality of submitters it’s
a losing battle
• Crowdsourcing does not scale – so could artificial
intelligence/machine learning improve some of
strutural standardisation/noise/quality issues?
32
• Are 64 million/50% unique, vendor compounds in
PubChem too much? (e.g. cap the number of suppliers for
common compounds?)
• None of us would have a problem with virtual ”make on
demand” compounds if they are clearly tagged
• Springer and Theime index their automatically extracted
chemistry against documents – so what about ACS, RSC,
Wiley, Elsevier, ChemRxiv, others?
• Data changes - ChemSpider July 2016: 57 million from
517 sources; August 2019 75 Million from 270 sources
Observations
Conclusions
• How do we get the situation to change???
– More collaboration?
– More sharing?
– More standards?
• For now the biggest shift is likely education
– the community needs awareness of the
issues in large public resources
33
Acknowledgements
• All of the contributors of data to the public databases
• The hosts (and funders) of the individual databases
• The PubChem and ChemSpider team for answering
queries
34

More Related Content

What's hot

ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLDr. Haxel Consult
 
SeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesSeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesEUDAT
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overviewdgarijo
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...Paolo Missier
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSGeorge Papadatos
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
 

What's hot (14)

The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
SeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesSeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary Services
 
Assay Development and Drug Repurposing Core
Assay Development and Drug Repurposing CoreAssay Development and Drug Repurposing Core
Assay Development and Drug Repurposing Core
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTS
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
 

Similar to Quality and noise in big chemistry databases

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 

Similar to Quality and noise in big chemistry databases (20)

TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
Progress in delivering transparency in research data
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Chemis...
Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Chemis...Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Chemis...
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Chemis...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 

More from Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 

More from Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 

Recently uploaded

Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Recently uploaded (20)

Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Quality and noise in big chemistry databases

  • 1. Does bigger mean better in the world of chemistry databases? Antony Williams1 and Christopher Southan2 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC, USA ORCID ID 0000-0002-2668-4821 2) TW2Informatics Ltd, Göteborg, Sweden 42166 ORCID ID 0000-0001-9580-0446 Views expressed in this presentation are those of the authors and do not necessarily reflect the views or policies of the U.S. EPA
  • 3. The Good News…. 2 • We’ve never had it so good (UniChem~160 million) • Sustained growth - since 2Q2017 – Scifinder +25 million – ChemSpider +16 million – UniChem +14 million – PubChem +6 million • Massively enabling for chemistry and bioactivity • All four should be congratulated! Public databases in particular (where InChI is the great enabler)
  • 4. Data quality in public domain databases is challenging… • Data quality in free web-based databases! 3
  • 5. Database Quality and Noise 4 • Intuitively understood but difficult to quantitate • Some aspects inherently cheminformatically challenging (e.g. Tautomer handling, Kekulisation of complex cycles, atroposiomers, exotic metalloorganic compounds, challenging layout and renderings) • Other challenges are just difficult (e.g. which stereo enumerations were experimentally confirmed or did the bioassays use undefined racemates)
  • 6. Taxol (Paclitaxel) is noisy…. • CID 36314 most “popular” with 304 singleton submissions and 532 mixtures • First submitted by NIAID on 2004-09-15 as SID: 598380 (but is it correct?) • 154 have different stereo (some MAY be correctly synthesized) • 34 have different isotopes and 12 of these have same stereo • 532 mixture SIDs merge to 354 distinct CID mixtures and components • 66 vendors will sell you CID 36314 • 59 vendors will sell you one of the other 166 different CIDs • Sigma-Aldrich submitted the identical structure 10 times (as different SIDs) • ZINC links to vendors for 17 of the 167 • 64 of 167 CIDs are single-sources, 17 of which are vendors • 12 of 167 CIDs include RN 33069-62-4 as a synonym • 12 of 167 are flagged as active in different BioAssays 5
  • 7. Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search 6
  • 11. CASRN 3022-92-2 on PubChem https://pubchem.ncbi.nlm.nih.gov/#query=3022-92-2 10 FOUR Different structures, THREE different skeletons
  • 13. Database Quality and Noise 12 • Common problems: source errors for CAS-RN mappings, name-to-structure conversion errors, authors ignoring IUPAC rules for chemical naming • We accept some intrinsically noisy sources for their value compromise (e.g. large vendor aggregations and automated document extraction feeds) • Some databases index substances without structures: antibodies, large peptides and molasses – not currently mappable but may have linked data
  • 14. Known issues with public databases (1) 13 • Different sets of chemistry rules and submission filters • Operations seem to be focussed on data expansion but less effort into quality • No inter-resource intersection statistics • Some useful boutique databases do not submit • Massive coverage gaps from the literature are not extracted into the public databases • Coverage gaps from non-document sources (e.g. open drug discovery ELNs) • Not all are fully open, searchable and downloadable
  • 15. 14 • Unknown extent of contamination by virtuals • Confounding circularity – identical submissions between systems, with consequent degradation of mappings • Expert chemical curation, biocuration and crowd- source fixing does not scale • Public databases are susceptable to exploitation by opportunistic and low-quality submitters • Large databases aggregate different types of errors • No real indication of collaboration between the public databases to solve the issues of data quality Known issues with public databases (II)
  • 16. Quality has many aspects 15 • Getting structures to round-trip (Molfile, IUPAC, SMILES, InChI String and Keys all concordant and rendered at least reasonably) – but no surprise – Issues of v2000/v3000 exchange and molfiles imperfect – InChI is powerful but imperfect and extensions are underway – Manually generated IUPAC Names can be very low quality • Submission filtering rules to ensure plausible structures (e.g. ”Chessboardanes”) • Tracking molecular ”multiplexing” (i.e. InChIKey inner layer) • Automated document extraction of chemistry is noisy (SureChEMBL, IBM, Springer, Thieme)
  • 17. Applications of public databases to non-targeted analysis • Non-targeted analysis for structure identification and forensics analysis • Number of hits retrieved based on mass/formula searches explodes based on poorly represented chemicals – especially stereo issues • The number of hits makes it much harder to rank candidate collections based on meta-data 16
  • 18. Quantifying noise in PubChem No other database offers this! PubChem chemistry rules not perfect but are transparent and can be sliced and diced in useful detail, e.g. • Mixture counts (covalent units <1) • Explicit interogation of stereo • Counts of unique structures (single-source) • Relationship mapping via individual entries and the PubChem Identifier Exchange Service (up to ~5K) • These types of stats are informative but should not be overinterpreted 17
  • 19. Surprising result (I) • A big increase in unique single-source content • Judging by metrics above PubChem has not changed much from doubling in content since 2013 • Except big < uniqueness plus slight < undefined chirality18
  • 20. Surprising result (II) • Patents high in mixtures • Vendors low for partial chirality • Uniqueness in patents is underestimated (i.e. millions of structures extracted by SureChEMBL and IBM but only those two) 19
  • 21. Not such a surprising result • Sources can be quite different e.g. comparison between ZINC and EPA/DSSTox above • ZINC virtually enumerates stereo which < uniqueness • The intersect is 275,000 CIDs 20
  • 22. Challenges with making improvements 21 • No quick fixes – we’ve been discussing it for over a decade... • Acknowledging quality and noise issues gives us a chance of not being confounded by them • But this is problematic for less experienced users • PubChem allows you to filter just about anything, either pre- or post-analysis
  • 23. Challenges with making improvements 22 • Uniqueness is a two-edged sword - value or junk? • Would be nice if someone made a widget that gave a quick quality stats overview for chemicals sets – Chemical structures vs. CASRNs va names and other identifiers • Standalone curated databases can give cleaner results compared with the same content registered elsewhere. e.g. 875k chemicals from CompTox Chemicals Dashboard nested in 96 million in PubChem. Standardization is not lossless...
  • 24. Standardization and standards V3000 Stereochemistry Support 23 Original Dashboard PubChem Standardized
  • 27. The Power but Confusion of CASRNs • CASRNs have only one true validation path • CommonChemistry was a GREAT START for Wikipedia CAS Validation – but out of date 26
  • 28. Validation of CASRNs • Automated bulk validation of CASRNs is possible only with assistance from CAS 27
  • 29. Automated Patent Extraction 28 • Classic dilema between very high value and noise • ChemSpider chose to forego patent data because of quality issues • PubChem have done a herculean job on their feeds from IBM, SCRIPDB, SureChEMBL and NextMove! (e.g. indexing 3 mill patent documents in the new interface)
  • 30. Patent CIDs by year (cumulative) • SureChEMBL is the only major source regularly updating • Will there be a post-2017 IBM refresh? • “News flash” Google Patents has started incorporating searchable chemistry extraction – so will this become a complementary feed? 29
  • 31. Virtual deuteration: Is there really d-51 Paclitaxel?? • Left:PubChem CID42599845 drawn by Thomson/Derwent • Right: Exemplification in US20090069410 from Protia • Filed 100s of deuterated drug patents 2008/9, Czarnik sole inventor (but no evidence he actually made ‘em) • Protia, Auspex and Concert filings have led to 1000s of virtually deuterated drugs > PubChem 30
  • 32. Observations 31 • Our massively-valuable open chemical database ecosystem is noisy, vulnerable and under-resourced – so we need to engage collectively for enhancements • Expansion of big databases is good but unless they push back against the primary quality of submitters it’s a losing battle • Crowdsourcing does not scale – so could artificial intelligence/machine learning improve some of strutural standardisation/noise/quality issues?
  • 33. 32 • Are 64 million/50% unique, vendor compounds in PubChem too much? (e.g. cap the number of suppliers for common compounds?) • None of us would have a problem with virtual ”make on demand” compounds if they are clearly tagged • Springer and Theime index their automatically extracted chemistry against documents – so what about ACS, RSC, Wiley, Elsevier, ChemRxiv, others? • Data changes - ChemSpider July 2016: 57 million from 517 sources; August 2019 75 Million from 270 sources Observations
  • 34. Conclusions • How do we get the situation to change??? – More collaboration? – More sharing? – More standards? • For now the biggest shift is likely education – the community needs awareness of the issues in large public resources 33
  • 35. Acknowledgements • All of the contributors of data to the public databases • The hosts (and funders) of the individual databases • The PubChem and ChemSpider team for answering queries 34