Big Data Helps Elucidate Natural Product Structures

Cheminformatics and the Structure
Elucidation of Natural Products
(or can Big Data help elucidate structures!)
Antony Williams
5th
Brazilian Conference of Natural Products
October 27th
2015
ORCID ID:0000-0002-2668-4821

A Bit About Me…
• NMR spectroscopist by training
• Chief Science Officer ACD/Labs Software
• One of founders of ChemSpider database
• VP for Cheminformatics at RSC

Why is this important?
• Structure verification and elucidation of
1000s of compounds
• NMR predictors with >2,000,000 shifts &
Computer-Assisted Structure Elucidation
• Made >20,000,000 chemical compounds
& data freely accessible to the community
• Grew the dataset to over >30,000,000
chemicals & used for structure elucidation
• Big data can assist structure identification

The Agenda…
• Dereplication using prior knowledge
• The increasing prevalence of online content
• Data generation is not the issue. Analysis is.
• Computer-assisted structure elucidation
• New experiments to improve elucidation
• Rethink data-sharing through publications!

…for each natural product dereplicated, at an
average cost of $300 … a savings of $50,000 is
incurred in isolation and identification time.

Dereplication
• There are ca. 200,000 known natural products
• The chance for rediscovery is very high!
• We need efficient “dereplication” processes
• Most general approach – acquire analytical
data and search existing databases…

Scale of Dereplication Exercise
0.5 – 2 mg extract
4 mL agar slope Petri dish
Bioassay & HPLC/UV/MS/NMR evaluation
100 mg sponge
With gratitude to John Blunt

Approaches to Dereplication
Desirable to know:
For each compound isolated:
If new then acquire data:
Fully elucidate structure
Taxonomy of organism
Molecular Wt/formula
UV Spectrum
1H NMR Spectrum
[13C NMR Spectrum if possible]
1D and 2D NMR array, MS with
fragmentation, IR, [α]D, ORD
Identify as known or new compound. If known STOP.

What Databases are Available?
Public
ChemSpider
CSLS
PubChem
NMRShift DB
Naproc-13
SuperNatural
SDBS
Private
All Pharma
GVK Biosciences NPD
UC UV DB
DTU UV DB
Marine NP DB
GVK NP DB
InterMed UV DB
InterMed NMR DB
Novartis IR DB
Natl. Centre Plant Metabol.
CH-NMR-NP
Commercial
SciFinder
SpecInfo
(Crossfire) Beilstein
Crossfire Gmelin
Reaxys
ACD Spectral Libraries
NaprAlert
Dict. Natural Products
Dict. Marine Nat. Prods
AntiBase
MarinLit
AntiMarin
With gratitude to John Blunt

PU10-F2
m/z
220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560
%
0
100
SSA0006 291 (3.284) Cm (241:343) 1: TOF MS ES+
3.92e4261.564
241.060
241.560
241.974
242.062
262.071
481.122262.517
304.098
263.024
282.074
465.101305.100
482.127
483.122
511.102
M+H
Search MW = 480 in Dict. Nat. Prod.
562 hits out of 230,000 compounds!!!
MW 480
MF = C28H36N2O5
Nominal Mass Searching

Molecular Formula Searching
Search MF=C28H36N2O5
in Dict. Nat. Prod.
2 hits out of 230,000
compounds!!!
Compare UV spectrum and 1H NMR features

How many isomers for a formula?
C10H17Br2ClO2, 50,502,293 C15H22O2, 138,136,211,624
C15H20O1, 37,568,150,635 C12H12O3, 68,930,547,646
C13H20O3, 14,431,269,166 C11H12N2O2, 3⋅1011
<n1012

1 x triplet methyl
3 x methoxy
3 x olefinic H
solvent
ppm1234567
6.42
6.27
6.24
19.96
15.15
24.03
21.93
1
H NMR spectrum, CD3OD

• 1 of 5 hits from 230,000 compounds
• The ONLY hit if MW = 480 included
NMR Features Dereplication

Dereplication in MarinLit Online
• Can be achieved using
• 1
H NMR features e.g. number of Me groups
• 13
C and 1
H chemical shifts
• Molecular formula (complete or partial)
• UV maxima
• Exact mass
• OR a combination of any or all of the above.

1
H NMR Spectrum - new or known?
9 Me groups are obvious (from integrals)
Search of MarinLit: 9 Me gave 628 answers

4 Me singlets
4 Me doublets
1 OMe singlet
Aromatic protons
Characterizing the spectrum further
Search MarinLit for 9 total methyls: 4 singlets, 4 doublets,
1 OMe there were 39 answers,

COSY spectrum
This implies a 1,2,4-
trisubstituted
aromatic system
A broad singlet coupled/on-coupled to 2 doublets

4 Me singlets 4 Me doublets
1 OMe singlet
4 singlets, 4 doublets, 1 OMe, 1,2,4-trisubstituted aromatic
2 answers only

Comparison of
NMR data
confirmed that the
unknown had this
structure

Commercial Assigned Databases
>320,000 assigned
chemical structures
>2,500,000 shifts

Searching Assigned Databases
• mI = 306.1 – 306.2
• 591/322,319 hits

Searching Assigned Databases
• 10 13
C shifts to +/- 3.0ppm
• 5 1
H shifts to +/- 0.3ppm
• 7 hits – very different

Including 15
N, 19
F and 31
P data

Experimental vs. Experimental
Differences between C13 shifts are generally small

Searching experimental data
30 seconds from peak-picking to suggested molecules

Experimental vs. Predicted
Differences between exp. and pred. C13 shifts can be
larger – useful to limit number of shifts searched

The Agenda…
• Dereplication using prior knowledge
• Increasing prevalence of free online content
• Data generation is not the issue. Analysis is.
• Computer-assisted structure elucidation
• New experiments to improve elucidation
• Rethink data-sharing through publications!

Online content also available!
NMRShiftDB http://nmrshiftdb.nmr.uni-koeln.de/

Online content also available!
www.nmrdb.org

• ~35 million chemicals and growing
• Data sourced from ~500 different sources
• Structure centric hub for web-searching
• Already used many mass spectrometry
software packages for structure ID
Mining Big Data for
Natural Products???

ChemSpider Interface – no NMR

26/35,000,000 Million Hits
Ranked by # of References

What can I find on ChemSpider?

What can I find? All for free…

NMR Predictions on ChemSpider
Data for Dereplication

1
2
• fC = full composition (C0-100
H0-100 O0-20 N0-10)
• lC= limited composition
(C10-30 H25-40 O0-15 N0-5)
NMR Predictions on ChemSpider
Data for Dereplication
Compound 1 Compound 2

Large Fragments can be found
Top 2 hits searched by 1
H chemical shifts. Hits ranked by the
1
H NMR deviation and filtered with C10-30 H25-40 O0-15 N0-
5,Good List and Bad List. Good List was determined from 1
H
shifts, integrals and 1
H-1
H COSY

• Search nominal mass 490-491 gave the following results:
ChemSpider : 46,234
SciFinder: 171,904
Dictionary of Natural Products: 537
Dictionary of Marine Natural Products 90
MarinLit: 94
AntiMarin: 131
• Molecular formula obtained C30H50O5 (490.3658):
ChemSpider: 208
SciFinder 2,366
Dictionary of Natural Products 238
MarinLit 43
AntiMarin 48
Marine Natural Product Example

• Search nominal mass 490-491 gave the following results:
ChemSpider : 46,234
SciFinder: 171,904
Dictionary of Natural Products: 537
MarinLit: 94
AntiMarin: 131
• Molecular formula obtained C30H50O5 (490.3658):
ChemSpider: 208
SciFinder 2,366
Dictionary of Natural Products 238
MarinLit 43
AntiMarin 48
Marine Natural Product Example
Focused
Datasets
Valuable

Approaches to Dereplication
Desirable to know:
For each compound isolated:
If new then acquire data:
Fully elucidate structure
Taxonomy of organism
Molecular wt/formula
UV Spectrum
1H NMR Spectrum
[13C NMR Spectrum]
1D and 2D NMR array, MS with
fragmentation, IR, [α]D, ORD
Identify as known or new compound. If known STOP.

Modern NMR Technologies
• Even a basic array of 1D/2D experiments can
provide the relevant data in the majority of cases
• The past few years have seen improvements in:
• Hardware: Magnets, Probes and RF
• Software: Data acquisition and processing
• Pulse sequences to probe direct and (very) long-
range homo- and heteronuclear correlations

Magnetic Field Strength over time

NMR Developments –
30 years of improvements
• 1984 – First report of cryogenic NMR probe
• 1986 – HMBC experiment reported
• 1991 – First commercial 3 mm gradient inverse probes.
• 1996 – ADEQUATE NMR experiments first reported.
• 1996 – 1
H-15
N HMBC applications reported.
• 1998 – Commercial 1.7 mm gradient inverse triple probes.
• 1999 – First commercial cryogenic NMR probes delivered.
• 2000 – First 3 mm prototype cryoprobe developed.
• 2006 – First 1.7 mm MicroCryoProbes™ delivered.
• 2009 – Pure shift HSQC experiments developed.
• 2014 –1,1- and -1,n-HD-ADEQUATE experiments
With gratitude to Gary E. Martin

COSY Correlations
Vicinal H-H couplings
Geminal H-H couplings
9
19
N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11
1213
14
16
17
18
20
21
22
23

HMBC Correlations (8Hz Optimized)
9
17a/b
N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11a
1213
1416
18a
20a
21
22
23a
23b
18b
20b
11b

Always new sequences coming:
1,1- and -1,n-HD-ADEQUATE
Examples show all three scenarios for 1,1- and 1,n-HD- ADEQUATE
correlations for cryptospirolepine.

Adoption can take a long time
HSQC vs. HMQC took > 20 years!
• HMQC is an older technique and affords lower F1 resolution.
• HSQC is a better technique but SLOWLY supplanted HMQC!
Year Range #HMQC
reports
#HSQC
reports
1990-94 52 10
1995-99 177 39
2000-04 346 111
2005-09 358 266
2010-14 345 423
Totals 1278 849
From: A. Williams, G.E. Martin, & D.J. Rovnyak, “Increasing the Adoption of Advanced Techniques for the Structure Elucidation of Natural
Products,” from Modern NMR Approaches to the Structure Elucidation of Natural Products, vol. 1, A.J. Williams, G.E. Martin, and D.J. Rovnyak,
Eds., RSC, London, 2015.

50 years of iterative development
DENDRAL
NMR-SAMS
SENECA
SpecInfo
ACD/Labs
CMC-SE
LSD
Others…

Computer Assisted Structure
Elucidation: Methodology
• Interpret data to extract knowledge
• Molecular Formula
• Integrals
• Chemical shifts
• Multiplicity
• Connectivity
• Known fragments
• Known exclusions
• Search structure space to derive all structures
• Rank-order based on set criteria
• Predicted chemical shift
• Mass Spec Fragmentation

Remember how many isomers
C10H17Br2ClO2, 50,502,293 C15H22O2, 138,136,211,624
C15H20O1, 37,568,150,635 C12H12O3, 68,930,547,646
C13H20O3, 14,431,269,166 C11H12N2O2, 3⋅1011
<n1012

Computer-Aided Structure Elucidation
• Eliminate “superfluous” isomers by
imposing different structural constraints
• Structural constraints are from:
• Spectral data of various types:
• NMR shifts/multiplicity constrain atom
types; Correlations constrain connectivities
• MS constrains formula and fragments
• IR constrains functional groups
• Prior information – sample origin
• Chemical rules – valence, ring size,
charge, etc.

CH3
17.60
CH3
18.13 CH3
20.20
CH3
31.40
18.09
19.10
19.50
19.50
28.20
29.20
41.20
34.30
42.20
63.30
33.40
61.20
67.80
68.10
80.40
174.10
OH
O
O
O
COSY
1
H - 1
H coupling
through 3 bonds
HMBC
1
H – 13
C coupling
through 2/3 bonds
2D NMR spectra: Extraction of
Structural Information: COSY/HMBC

1D & 2D NMR Synchronized
Processing
The Software displays correlations for assigned spectra and structures, and highlights
correlations that are likely to be erroneous.

CH3
17.60(fb)
CH2
18.09(fb)
CH3
18.13(fb)
CH2
19.10(fb)
CH2
19.50(fb)
CH2
19.50(fb)
CH3
20.20(fb)
CH2
28.20(fb)
CH2
29.20(fb)
CH
34.30(fb)
CH2
41.20(fb)
CH
42.20(fb)
C
61.20
CH
63.30 C
67.80
C
68.10
C
80.40
C
174.10
O
H
CH3
31.40(fb)
C
33.40(fb)
O
O
O
Molecular Connectivity Diagram (MCD)
Molecular Formula C20H30O4
Use spectroscopists experience to add bonds:
Create C=O, COOH, Ring systems, etc.

Not that easy though…
“Nonstandard Correlations”
“Standard” and “Nonstandard”
correlations are experimentally
indistinguishable
If 2D NMR data contain both
“Standard” and “Nonstandard”
correlations we see
contradictions in interpretation
H
Ñ
Ñ
Ñ
Ñ
Ñ
H
H
H
H
Ñ
Ñ
Ñ
Ñ
Ñ
Ñ
COSY
HMBC
Standard

CH3
1
CH3
2 CH3
3
CH3
4
5 67
8
9
CH2
10
11
12
13
14
15
16
17
18
19
20
OH
21
OH
22
Non-standard Correlation Example
6-bond
6-bond

Strychnine Non-standard Correlations
9
17a/b
N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11a
1213
1416
18a
20a
21
22
23a
23b
18b
20b
11b
19
9
17a/b
N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11a
1213
1416
18a
20a
21
22
23a
23b
18b
20b
11b
9
17a/b
N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11a
1213
1416
18a
20a
21
22
23a
23b
18b
20b
11b
9
19N
N
O
O
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
HHH
H
H
H
H
1
2
3
4
5
6
7
8
10
11a
12
13
14
16
17
18b
20
21
22
23
11b
2
JC
2
JCH
4
JCH
3
JCH
5
JCH

Structure Generation combined with
Structural and Spectral Filtering
• Internal Badlist
• User Badlist
• User Goodlist
• Rings: Obligatory,
Forbidden
• Bredt’s Rule
• Maximum Match
Factor
• Filter Tolerance: Tight,
Medium, Loose

Selection of the Preferable Structure
• Remove duplicates
• 1
H and 13
C shift calculation for all output structures
• Rank structures in ascending order of average
chemical shift deviation
• Structure with minimum d is the most probable.

Low Structural Information in 2D
Spectral Data: Use Fragment DB
• Number of observed 2D NMR correlations is
smaller than expected
• Deficit of hydrogen atoms results in a low number of
correlations
• Search in Fragment Library using the 13C NMR
spectrum and embed in the MCD

Example of Fragment Usage.
Symmetric molecule C56H78O12S1
CH
5.76
CH
6.42
CH
C
C
C
CH
2.661.38
CH
1.10
1.60
CH2
CH2
CH
CH2
CH
H2C
CH3
0.65HC
CH3
0.88
CH
4.29
CH2
2.36C
C
OC
OH
5.35OH
3.73
CH3
1.12
CH3
1.99
CH2
4.13
OH
4.18
O
O
S
CH
5.76
CH
6.42
CH
C
C
C
CH
2.66
1.38
CH
1.10 1.60
H2C
CH2
CH
CH2
CH
CH2
CH
CH3
0.88
CH
4.29
CH2
2.36
C
C
O C
O
CH2
4.13
OH
4.18
CH3
1.99
CH3
0.65
CH3
1.12
O
OH
5.35
OH
3.73
Ashwaganhanolide
Small number of
correlations

13
C NMR Fragment search - 5524 found
Exp.
Frag.
Fragment # 1
С17Н22О2

Solution
• 960 MCDs were created using fragment #1
• Structure Generation from 960 MCDs gave 24
structures after filtering and 6 output structures.
• Total time was tg= 29 m 30 s

Wrong Molecular Formula
Only CHNO in formula assumed
J. Am. Chem. Soc., 2001, 123, 10870-10876.
Tetrahedron Letters, 2002, 43, 5707-5710.
FAB-MS: C31H54N4O8 ESI-MS: C31H54N4SO6

Wrong Initial Suggestion
13C shift at 173.50 ppm is O-C=O group
J. Nat. Prod., 2000, 63, 1677-1678.
J. Nat. Prod., 2003, 66, 716-718.
13
C signal at 173 ppm led to COO bias Data compared to a similar compound

J. Nat. Prod., 2000, 63, 1677-1678.
J. Nat. Prod., 2003, 66, 716-718.
13
C signal at 173 ppm led to COO
bias
Data compared to a similar compound
Wrong Initial Suggestion
13C shift at 173.50 ppm is O-C=O group
13
C signal at 173 ppm led to COO bias Data compared to a similar compound

Misinterpretation of 2D NMR Data
Presence of a guanidine group substituted with 2xCH3 groups
was hypothesized. Absence of an expected HMBC correlation
from methyls to C(159.0) ignored.
J. Org. Chem., 2004, 69,9025-9029.
J. Org. Chem., 2008, 73, 8719-8722.
Misinterpreted HMBC signal Verified by X-ray crystallography

Misinterpretation of 2D NMR Data
Presence of a guanidine group substituted with 2xCH3 groups
was hypothesized. Absence of an expected HMBC correlation
from methyls to C(159.0) ignored.
J. Org. Chem., 2004, 69,9025-9029
J. Org. Chem., 2008, 73, 8719-8722
Misinterpreted HMBC signal Verified by X-ray crystallography

Number of Skeletal Atoms
J. Cheminf. 2012, 4:5

MW Distribution
J. Cheminf. 2012, 4:5

New Experiments Influence CASE!
Cervinomycin
O
NO
O
O
OO
OH
O
O
1
4
7
9
10
12
14 16
1922
26
29
30
CH3
(fb)
CH2
CH2CH2
(ob)
C
(ob)
C
CH C
CH
CCH
C
CC
C
(ob)
C
(ob)
C
(ob)
C
C
O O
O
O
O
H
CH3
(ob)
CH3
(ob)
CH
CH
C C
(ob)
C
(ob)
C
(ob)
C
O
O
O
O

The Influence of Data on
Elucidation Time: Cervinomycin
COSY,
HSQC
1
H-13
C
HMBC
1
H-13
C
LR-HSQMBC
Structure
Generation
Time
# of
Structure
s
Generated
8 Hz 4 Hz 4 Hz 2 Hz
+ + + 49 h 314
+ + + + 37 h 4
+ + + + 150 s 7
+ + + + + 104 s 1

New Experiments
Cryptospirolepine over 20 years!
Inexplicably,
the vinyl proton has no
evident 2
JCH correlation
to the carbonyl! DFT
predicted ~0.3 Hz
coupling!
Synergistic interpretation and
CASE applied to an array of 2D
data elucidated this compound.
Included new 1,1-ADEQUATE
and 1,n-ADEQUATE data.
The absence of a 2
JCH correlation
from the vinyl proton to the
adjacent carbonyl is perplexing.
A new long-range heteronuclear
correlation NMR experiment was
acquired: LR-HSQMBC.

Key 1,1-HD-ADEQUATE Correlations
• Experiment was
optimized for 60 Hz
• Typical range for 1
JCC sp2
couplings is 60-75 Hz
• The 2
JCC coupling from
C13 to C1/C11’ was
calculated (DFT) to be
15.4 Hz, which would give
a calculated intensity of
0.16 in this experiment.

• Experiment optimized for 7 Hz
• Typical range for n
JCC couplings is
approximately 2-7 Hz
• 2
JCC correlations across
carbonyls are typically 10-16 Hz
• Correlations were observed,
including the 1
JCC correlations
from C13 to C2 and C13a that
unavoidably “leak” into all 1,n-
ADEQUATE spectra.
Key 1,1-HD-ADEQUATE Correlations

Revision of the [7.5.5] Core of
Cryptospirolepine to a [6.6.5] System
• Based on correlations from the 1,1- and -1,n-HD-ADEQUATE spectra,
the [7.5.5] core shown in red was revised to a [6.6.5] system.
• The γ-lactam was rearranged to a dehydropiperidinone.
• Key correlations were the 1
JCC correlation from the vinyl CH to the
flanking carbonyl and quaternary carbons.

Could CASE methods sort out the
structure?
1,1-
ADEQUATE
1,n-
ADEQUATE
1
H-13
C HMBC
IDR
HSQC-
TOCSY
1
H-13
C LR-
HSQMBC
1
H-15
N LR-
HSQMBC
GENERATION
60 Hz 7 Hz 8 Hz 4 Hz 15 ms 2 Hz 4 Hz 2 Hz Time (s)
#
Structures
+ >420 h >10,400
+ + + 140 6816
+ + + + 142 3360
+ + + + 40 522
+ + + + + 45 258
+ + + + + + + + 7 24
• Modern “1993” data set used as input failed to lead to
the generation of the structure in 3 week calculation!
• More complete input data reduced calculation to secs!

Errors in published structures…

ChemSpider ID 24528095 C13 NMR

What would it take???
• PDFs containing text descriptions of spectra
are problematic for reinterpretation of data
• Publishers should host at least high
resolution images of all spectra
• Really we need the data files!!!

Conclusions
• Dereplication is increasingly feasible using
online content
• Analysis of data is generally a bigger issue
than data generation itself
• Computer-assisted structure elucidation works
• Data-sharing associated with publications
needs rethinking

Acknowledgements
RSC/ChemSpider/Marinlit
•John Blunt
•Serin Dabb
•Valery Tkachenko
NMR (Book) Collaborators
•Gary Martin
•David Rovnyak
ACD/Labs
•Structure Elucidator
•Mikhail Elyashberg
•Kirill Blinov
•Arvin Moser
•Patrick Wheeler

Thank you
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Big Data Helps Elucidate Natural Product Structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Helps Elucidate Natural Product Structures

Similar to Big Data Helps Elucidate Natural Product Structures (20)

Recently uploaded

Recently uploaded (20)

Big Data Helps Elucidate Natural Product Structures

Editor's Notes