Mass spectrometry analyses at the US-EPA, especially non-targeted analysis studies, are highly dependent on the cheminformatics efforts which have been underway within the agency for almost a decade. These research efforts have resulted in a rich data infrastructure based on the DSSTox database, data integration approaches based on a structure standardization approach to produce “MS-ready” structures, and a number of supporting data types to facilitate ranking of non-targeted analysis candidates. This presentation will provide an overview of all tools in development and the integrated nature of the applications based on the underlying chemistry data. This includes the development of the underlying chemistry database of >1.2 million chemical substances (DSSTox), approaches to structure standardization to facilitate structure-substance mapping, development of a spectral database of >150,000 spectra for >25,000 chemicals, a database of >3000 analytical methods, prediction models for LCMS amenability, and an application for the profiling of toxicity hazards for batches of chemical substances. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Cheminformatics tools and chemistry data underpinning mass spectrometry analyses at the US Environmental Protection Agency
1. Cheminformatics tools and chemistry data
underpinning mass spectrometry
analyses at the US-EPA
March 2024: Spring Fall Meeting, New Orleans, LA
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Antony Williams
Center for Computational Toxicology and Exposure, US-EPA, RTP, NC
2. The role of cheminformatics at EPA
• Our branch is in the Center for Computational
Toxicology and Exposure (CCTE)
• We develop curated chemistry data streams to
support our applications and models
• We develop prediction models, web-based
applications and data streams to support others
• Today’s presentation: how do our efforts
support mass spec. and especially NTA efforts
– What’s public and what’s in development?
1
3. Why Does EPA Need Measurement Data?
2
• Measurement data needed to ensure chemical
safety
• Characterize risk
• Regulate use & disposal
• Manage human & ecological exposures
• Ensure compliance under federal statutes
Chemical Monitoring Needs
Exposure
Assessment
Dose-
Response
Assessment
Risk
Characterization
Hazard
Identification
4. Challenges
• High-quality monitoring data are unavailable for most chemicals
• Measurement data normally generated using “targeted” methods
• Targeted analytical methods:
- Require a priori knowledge of chemicals of interest
- Produce data for few selected analytes (10s-100s)
- Standards for method development & compound quantitation
- Are blind to emerging contaminants
- Can’t keep pace with needs of 21st century risk characterizations
• Data gaps being filled with exposure models and “NTA” methods
3
5. Relevant Questions of NTA Studies?
• Which chemicals are where?
• Do we see any “new” chemicals?
• Do observed co-occurrences highlight:
– Important exposure sources?
– Stressor-response relationships?
– What is the concentration of each chemical?
– Do estimated concentrations suggest unacceptable risk?
• How does cheminformatics support this effort?
4
6. Everything is underpinned by the
DSSTox Database
5
• >1.2M substances
• Highly curated data
• Mapped relationships
• The data are made
available via the
Dashboard…
10. Batch Searching is a big enabler
https://pubs.acs.org/doi/10.1021/acs.jcim.0c01273
9
11. Batch Searching
• Singleton searches are useful but we work
with thousands of masses and formulae!
• Typical questions
– What is the list of chemicals for the formula CxHyOz
– What is the list of chemicals for a mass +/- error
– Can I get chemical lists in Excel files? In SDF files?
– Can I include properties in the download file?
10
14. Chemical Lists
• Chemical lists are focused on regulations,
specific research efforts and categories
• 425 lists and growing
– TSCA Inventory
– Clean Water Act Hazardous Substances
– Consumer Products database
– Chemicals of Emerging Concern
– PFAS lists
– Extractables and Leachables
– …lists are versioned and updated and new lists added
13
20. Benefits of bringing it all together
• The true dashboard benefit is integration
• Rank potential candidates for toxicity using
available data – hazard, exposure, in vitro
19
21. Supporting Exposomics Research
• DSSTox database substances map to
– Their structures (mass/formulae/InChIs etc)
– Hazard data : human, mammalian and ecotox
– Exposure data: products in commerce, categories and
functional use, measured concentrations, etc.
• There are many types of metadata that can
be used for candidate ranking (old approach)
20
22. Data Source Ranking of
“known unknowns”
21
• A mass and/or formula search is
for an unknown chemical but it
is a known chemical contained
within a reference database
• Most likely candidate chemicals
have the most associated data
sources, most associated
literature articles or both
C14H22N2O3
266.16304
Chemical
Reference
Database
Sorted candidate
structures
23. Data Streams for Ranking
• Dashboard Data Sources
• PubChem Data Source Count
• PubMed Reference Count
• Toxcast in vitro bioactivity
• Presence in Consumer Products database
• Predicted physicochemical Properties
24. BIG databases are GREAT!
P
u
b
C
h
e
m
C
A
S
R
e
g
i
s
t
r
y
C
h
e
m
S
p
i
d
e
r
E
P
A
D
S
S
T
o
x
B
l
o
o
d
E
x
p
o
s
o
m
e
1 0 4
1 0 5
1 0 6
1 0 7
1 0 8
1 0 9
C
h
e
m
ic
a
l
S
u
b
s
ta
n
c
e
s
• Thanks to all of the public database efforts
• So much benefit from what’s been done
• There are hundreds of them at this point…
25. Is a bigger database better?
24
• ChemSpider was 26 million chemicals for
the original work
• Much BIGGER today
• Is bigger better??
• Are there other metadata to use for ranking?
26. Comparing Search Performance
25
• When dashboard contained 720k chemicals
• Only 3% of ChemSpider size
• What was the comparison in performance?
27. How did performance compare?
26
For the same 162 chemicals,
Dashboard outperforms
ChemSpider for both Mass and
Formula Ranking
31. PubChem – “virtual chemistry”
• Other databases grow quickly…a lot of “virtual
chemistry” and “make on demand” compounds.
• Efforts such as the BloodExposome and
PubChemLite are critical to focus efforts
30
32. Applications at the EPA
• We have ongoing efforts applying
NTA to multiple challenges including
– PFAS identification
– Pesticides in various matrices
– CECs in water
– Biosolids
• Examples include…
31
34. Example 1: Consumer Product Analysis
33
Many chemicals
observed in
consumer product
extracts
More observed
chemicals not
known to be in
consumer products
Why might the
‘other’ chemicals be
in the products?
Many observed
chemicals known to
be in consumer
products
36. Example 2: Recycled Product Analysis
35
Significant differences
between chemicals in
recycled vs. virgin products
for certain product & use
categories
Most differences observed in
paper products and
construction materials
Some uses (e.g., fragrances)
highly represented across all
product/use categories
38. Supporting Exposomics Research
• DSSTox database substances map to
– Their structures (mass/formulae/InChIs etc)
– Hazard data : human, mammalian and ecotox
– Exposure data: products in commerce, categories
and functional use, measured concentrations, etc.
• Structures have to be standardized…
38
39. “MS-Ready Chemicals”
• MS-Ready chemical standardization is ESSENTIAL to our
support of Non-Targeted Analysis
• It links chemicals across the Dashboard and facilitates
detection linking back to products in commerce
39
https://jcheminf.biomedcentral.com/articles/
10.1186/s13321-018-0299-2
40. Predicted Mass Spectra
http://cfmid.wishartlab.com/
• MS/MS spectra prediction for ESI+, ESI-, and EI
• Predictions generated for MS-Ready structures
• Use experimental vs predicted spectral searches
for candidate identification
40
41. Predicted Data Already Public
Publication and Data Files
41
https://epa.figshare.com/articles/CFM-ID_Paper_Data/7776212/1
45. Candidate Identification is only
PART of the process
• Whatever the approach for candidate
identification chemical hazard is important
• Hazard Comparison Profiling is important
https://www.epa.gov/chemical-research/cheminformatics
45
47. AMOS: Analytical Methods and
Spectra Database
• Three types of data in the database:
– Methods (regulatory, lab manuals and SOPs, publications,
tech notes)
– Spectra (from public domain and our own laboratories)
– Fact Sheets (harvested from SWGDRUG and other sites)
• Some methods have associated spectra
• Some data are just externally linked
• Currently contains around 200,000 spectra,
700,000 external links, 3000 “Fact Sheets”
and ~4000 methods
• ALL data are growing in number
47
52. Linking to actual spectra
52
• We are doing a lot of chemical curation as
we build the database
53. Rules need optimizing for
MS-Ready standardization
• We can now add/tweak the rules…add new
rules, edit existing rules
53
54. Example: Tautomer Rules
• We control rules for
– Tautomers
– Mesomers
– Neutralize/De-radicalize
– Break salts
– Standard checks
– etc….
• Necessary for mapping
chemicals in DSSTox
54
55. Manual Curation and Annotation
Analytical QC data for Tox21
• ~9000 chemicals with tens of thousands of
spectra (LCMS, GCMS & NMR)
• These data will feed prediction algorithms…
55
60. Conclusions
• Our data resources underpin our research
efforts – data quality and curation is key
• Our web-based applications deliver our data
to the community for multiple use cases
• Our support for Exposomics is multi-fold
– Curated chemistry data streams
– Experimental and predicted properties, toxicity, etc.
• The NTA WebApp in development will use
all of these data streams to support analysis
60
61. Acknowledgments
• DSSTox curation team
• CCTE IT team for software development, DevOPs
• Mass spectrometry scientists across EPA,
especially the NTA team
• Open Databases: PubChem, ChEMBL, Mona,
MassBank, GNPS, SWGDRUG, Cayman Chems.
• Instrument vendors – many have contributed
methods to the AMOS database
• …and thank you to you for your time 61
62. Contact Information
• Contact info: williams.antony@epa.gov
• We fully support Open Data so ask us for what
you need
62