SlideShare a Scribd company logo
1 of 31
Download to read offline
from Reaction Databases
Orr Ravitz
SimBioSys Inc.
246th ACS National Meeting
Extracting Synthetic Knowledge
ARChem – main concepts
A computer-aided synthesis design system.
The Approach:
 Comprehensive rule- and precedent-based retrosynthetic analysis back to
available starting materials.
 Automated rule generation with manual rule curation.
 Generate many alternatives.
 Provide supporting literature examples.
 Allow user guidance and control.
Solution Display
Exploring Alternative Paths
Supporting Examples
Chemical Interference
Functional groups that may interfere with transformations are highlighted.
Functional Group Tolerance
Break down of example set based
on the presence of functional
groups beyond the reaction center
provides evidence for compatibility.
Examples can be exported to
database’s web interface for further
analysis.
Stereochemistry
Currently:
 Exact matches
 Starting materials
Coming soon:
 Rule-based
Essential Information
Automated extraction of knowledge
 Reaction rules
 Yield values
 Chemical interference - functional group tolerance
 Regioselectivity
 Stereochemistry
Data Information Knowledge
Perceive
Generalize
System Design
Reactions
Reaction Rules
Starting Materials
Expert Knowledge-
bases
Target
Source reactions
Esterification examples
Other examples
··· → ···
··· → ···
··· → ···
Esterification rule
Other rule
··· → ···
Reactions
Reaction Rules
Rule Extraction
Reactions
Reaction Rules
Reaction Perception
Source reaction:
Extracted core
Extended core
Reaction file with atom mapping
Atoms attached to bonds changed, made or broken in the reaction
Include all structural motifs that are essential for the reaction to occur
Extending the Core:
Passengers vs Drivers
The goal of chemical perception is to discriminate between structural features
that are essential for the reaction, and those that are passengers.
Shell-based approach: 1st shell
2nd shell
Graph-based methods are inappropriate.
Mechanism-Dependent
Core Extension
Nucleophilic aromatic substitution:
Addition /elimination
mechanism
Requires a π acceptor
group in ortho or para
position
Via organometallic
intermediate
Reactions
Reaction Rules
Rule Extraction
Similar extended cores
Completed reaction rule
Common extracted core
Nucleofuge (NF) -
a leaving group which
carries away the bonding
electron pair.
Generalized rule
Generalized group (NF) is
replaced by the most
common group.
Interfering Functionality
Following rule abstraction, compatible functionality is detected by examining the
examples:
Compatible
Interfering
 Moieties outside the extended core are
listed as compatible.
 Other functional groups will be inferred as
`possibly interfering’.
 Possibly interfering functionality will be
penalized in scoring and highlighted to the
user.
Regioselectivity – Main Steps
 Recognize rule’s reaction type – electrophilic substitution, nucleophilic
addition etc.
 Only reactions prone to regioselectivity are subject to regio calculations.
 Identify competing sites
 Identify substituents and other structural motifs that may influence the
directionality
 Collect statistics from example set regarding selectivity in the reaction core
as well as elsewhere in the molecule (chemoselectivity)
 Assign regioselectivity to rule if predefined statistical requirements are
met.
? ? ?
Collecting Statistics
Electrophilic aromatic substitutions
For each example in DB:
 Evaluate ring activation including for heteroaromatic rings and fused rings
 Evaluate location, type and neighborhood of ring substituents
 Identify symmetry
 Compute environment signatures that include all aromatic features plus
relevant substituents
For each rule:
 Cluster reacting vs. non-reacting signature-equivalent sites for reactions
with yield > 20%
 Define regioselectivity if examples ratio is 10:1
Regio Example
X=Cl, 84% X=Cl, 5.5%Rejected
Misinterpreted yield
value provided
positive evidence
Stereochemistry – the challenge
 Efficient machine perception and representation of a broad range of synthetically
important stereogenic types
Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers
 Representation of stereochemical reaction rules and stereochemical strategies
 Develop a versatile stereochemical substructure algorithm to support retron
matching
 Efficient discovery of symmetry in stereochemically defined molecules and rules -
avoid duplicate routes
 Stereoselectivity is captured inaccurately and inconsistently across common
databases.
The Data
Database content Portion of data Notes
Number of unmapped examples 14% Reaction type unknown
Number of examples belonging to reactions
with 5000 or more examples
4% Ubiquitous protection / deprotection reactions
Number of examples belonging to reactions
with 20 or less examples
16%
Bad atom maps (database errors)
Multistep reaction sequences
General useable examples 65% 65 %
0
10
20
30
40
50
60
70
80
90
100
yield cs de ee
%ofdatabase
Examples with quoted selectivity values
Selectivity metric
0
10
20
30
40
50
60
70
80
90
100
> 0% > 25% > 50% > 75% > 90% > 95% > 98%
yield
cs
de
ee
Examples with selectivity above a threshold
%ofavailable
Threshold selectivity values
Stereo-Rules Generation –
A Different Approach
 Manually code rules for a diverse set of useful enantioselective and
generally selective reaction types.
 Mine supporting examples from existing large reaction databases to
discover reaction scope and limitations for each rule.
 Find effective strategies to aid planning of a stereo controlled synthesis
Reactions
Diels Alder Sharpless Reduction of C=C Reduction of C=O
70 reaction types with ee>95% and more than 50 examples
Designing a Rule-Set
Reaction type Bond alterations Examples with ee ≥95% Notes
Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions
Reduction of C=O C=O → HCOH 1553 Any type of carbonyl
Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations
Reduction of C=C C=C → HCCH 1120 Wide variety of environments
Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N
Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc
Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones
Addition via R2Zn to C=O C-Zn + C=C → CCCH 306
Dihydroxylation of C=C C=C → HOCCOH 266
Reduction of C=N C=N → HCNH 256 Any type of C=N
Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder
Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene)
Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210
C substitution of Br CH + CBr → CC 199
[2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198
Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones
Addition via R3B to C=O CB + C=O → CCOH 141
Oxidation of sulphides S → S=O 137 Chiral sulphoxides
Perception of stereochemistry in structural diagrams
Enabling Technology
Stereocenter manipulation and stereo descriptors
Op 1 2 3 4
A E 1 2 3 4
B C2
3 1 3 4 2
C C1
3 1 4 2 3
D C2 2 1 4 3
E C1
3 2 3 1 4
F C2
3 2 4 3 1
G C2
3 3 1 2 4
H C1
3 3 2 4 1
J C2 3 4 1 2
K C1
3 4 1 3 2
L C2
3 4 2 1 3
M C2 4 3 2 1
Op 1 2 3 4
s 2 1 3 4
E + 8C3 + 3C2
Rotations
Reflection
Conceptual Model Stereo Descriptor
Chemical constraints layer of representation
Enabling Technology
CONNECTIONS=1,2,3 FUSION=BIARYL
RINGS=5+6,6+7 BRIDGEHEAD=YES
DIFFRING=1 EPS=0,1
SAMERING=1 HETS=0,1,2
DIFF=1 NONAROMHETS=0,1,2
SAME=1 HALOGENS=0,1,2
ARYL=YES FGS=ALCOHOL
SPCENTRE=1,2,3 FGNOT=CARBONYL
CHARGE=YES PROP=EWG
HS=0,1,2 PROPNOT=Lg
Substructure search/match
Reduction of Ketones to
Secondary Alcohols
Level 1: + Environment constraints
Level 0: Bond change constraints only
Level 1: + Stereochemical constraints
Base ARChem rule Hits ee de (screen)
10,004 (10,004) Not unique to
ketone → secondary alcohol conversion
8,442 (10,004) Unique to ketone → secondary
alcohol conversion
140 tolerated functional groups
6,525 3,457 4,711 (6,765) Enantioselective and
diastereoselective examples
Dihydroxylation
of Alkenes
Level 1: Bond changes with environment constraints
Level 2: + Stereochemical constraints
Level 3: + Substitution patterns
2253 examples
(2416 screened)
Hits ee de (screen)
1,428 1,008 1,151 (1,634)
428 117 352 (444)
Hits ee de
681 578 552
526 289 418
206 131 168
12 10 11
236 89 191
123 51 103
51 27 41
8 4 7
Conclusions
 Useful chemical knowledge can be extracted algorithmically from reaction
databases.
 Automation is crucial given the size and growth of databases.
 Different layers of knowledge are tightly entangled: regioselectivity,
chemoselectivity and stereoselectivity overlap considerably.
 The extracted knowledge can be applied effectively in computer-aided
synthesis design, and empower chemists by offering new ideas a broader
perspective on the literature.
But...
The quality of extracted knowledge highly depends on the accuracy and scope
of the source data!
The Rule-Set
Cut-off threshold
Useful reactions
Noise
Distractions
Low utility
reactions
Bad atom maps (avoid)
Rare multistep reaction sequences (low utility)
Multiple concurrent reactions on substrate (very low utility)
Exotic heterocycle formation (promote)
Ubiquitous protection / deprotection FGIs such as
alcohol/ester, amine/amide etc (demote)
Conclusions
 Significant portion of data is being lost due to mapping errors and other problems.
 Yield and selectivity information is captured inconsistently.
What can be done:
 Meta data perception can be improved. (in progress)
 Mapping algorithms should reflect contemporary mechanistic understanding of
reactions.
 Systematic mapping errors can be manually fixed (planned)
 Extracted rules can be manually curated (continuous).
Acknowledgements
SimBioSys
James Law - Regioselectivity
Victoria Lubitch
Yasamin Salmasi
Aniko Simon
Zsolt Zsoldos
Reaction Data
Elsevier – Reaxys
Wiley - CIRX
RSC - MOS
Accelrys - RefLib
University of Leeds
Tony Cook - Stereochemistry
Peter Johnson
Steve Marsden
Other Collaborators
ChemAxon
And…
ARChem users! THANK YOU!

More Related Content

What's hot

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 
Chapter 1
Chapter 1Chapter 1
Chapter 1MEI MEI
 
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureSaramita De Chakravarti
 
Pharmaceutical analysis,
Pharmaceutical analysis,Pharmaceutical analysis,
Pharmaceutical analysis,Ravi Sheoran
 
Computer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related MethodsComputer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related MethodsJahan B Ghasemi
 
orthogonal hplc methods
orthogonal hplc methodsorthogonal hplc methods
orthogonal hplc methodsfarhat shaik
 

What's hot (11)

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative Structure
 
QSAR
QSARQSAR
QSAR
 
Pharmaceutical analysis,
Pharmaceutical analysis,Pharmaceutical analysis,
Pharmaceutical analysis,
 
Computer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related MethodsComputer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related Methods
 
orthogonal hplc methods
orthogonal hplc methodsorthogonal hplc methods
orthogonal hplc methods
 

Similar to Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
EcoEngines Chemical Kinetics
EcoEngines Chemical KineticsEcoEngines Chemical Kinetics
EcoEngines Chemical KineticsEdward Blurock
 
Randomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networksRandomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networksAreejit Samal
 
chemical reaction engineering
chemical reaction engineeringchemical reaction engineering
chemical reaction engineeringH.M.Azam Azam
 
Cad introduction 2019 30 min
Cad introduction 2019 30 minCad introduction 2019 30 min
Cad introduction 2019 30 minOskari Aro
 
Saponification Presentation
Saponification PresentationSaponification Presentation
Saponification PresentationJennifer Kellogg
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
ICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction WileyICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction WileyDr. Haxel Consult
 
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016Muhammad Rashid Usman
 
Analytical Method Development
Analytical Method DevelopmentAnalytical Method Development
Analytical Method DevelopmentBijesh Verma
 
Predicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph MiningPredicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph MiningKarthik Raman
 
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic ProcessesReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processesplaced1
 

Similar to Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS (20)

Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
A new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChemA new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChem
 
EcoEngines Chemical Kinetics
EcoEngines Chemical KineticsEcoEngines Chemical Kinetics
EcoEngines Chemical Kinetics
 
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
 
Randomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networksRandomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networks
 
chemical reaction engineering
chemical reaction engineeringchemical reaction engineering
chemical reaction engineering
 
Cad introduction 2019 30 min
Cad introduction 2019 30 minCad introduction 2019 30 min
Cad introduction 2019 30 min
 
TOC I&ECPDD Oct67
TOC I&ECPDD Oct67TOC I&ECPDD Oct67
TOC I&ECPDD Oct67
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
Saponification Presentation
Saponification PresentationSaponification Presentation
Saponification Presentation
 
CHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdf
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
ICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction WileyICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction Wiley
 
Chap3 1
Chap3 1Chap3 1
Chap3 1
 
Retrosynth
RetrosynthRetrosynth
Retrosynth
 
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
 
Analytical Method Development
Analytical Method DevelopmentAnalytical Method Development
Analytical Method Development
 
Predicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph MiningPredicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph Mining
 
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic ProcessesReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

  • 1. from Reaction Databases Orr Ravitz SimBioSys Inc. 246th ACS National Meeting Extracting Synthetic Knowledge
  • 2. ARChem – main concepts A computer-aided synthesis design system. The Approach:  Comprehensive rule- and precedent-based retrosynthetic analysis back to available starting materials.  Automated rule generation with manual rule curation.  Generate many alternatives.  Provide supporting literature examples.  Allow user guidance and control.
  • 6. Chemical Interference Functional groups that may interfere with transformations are highlighted.
  • 7. Functional Group Tolerance Break down of example set based on the presence of functional groups beyond the reaction center provides evidence for compatibility. Examples can be exported to database’s web interface for further analysis.
  • 8. Stereochemistry Currently:  Exact matches  Starting materials Coming soon:  Rule-based
  • 9. Essential Information Automated extraction of knowledge  Reaction rules  Yield values  Chemical interference - functional group tolerance  Regioselectivity  Stereochemistry Data Information Knowledge Perceive Generalize
  • 10. System Design Reactions Reaction Rules Starting Materials Expert Knowledge- bases Target
  • 11. Source reactions Esterification examples Other examples ··· → ··· ··· → ··· ··· → ··· Esterification rule Other rule ··· → ··· Reactions Reaction Rules Rule Extraction
  • 12. Reactions Reaction Rules Reaction Perception Source reaction: Extracted core Extended core Reaction file with atom mapping Atoms attached to bonds changed, made or broken in the reaction Include all structural motifs that are essential for the reaction to occur
  • 13. Extending the Core: Passengers vs Drivers The goal of chemical perception is to discriminate between structural features that are essential for the reaction, and those that are passengers. Shell-based approach: 1st shell 2nd shell Graph-based methods are inappropriate.
  • 14. Mechanism-Dependent Core Extension Nucleophilic aromatic substitution: Addition /elimination mechanism Requires a π acceptor group in ortho or para position Via organometallic intermediate
  • 15. Reactions Reaction Rules Rule Extraction Similar extended cores Completed reaction rule Common extracted core Nucleofuge (NF) - a leaving group which carries away the bonding electron pair. Generalized rule Generalized group (NF) is replaced by the most common group.
  • 16. Interfering Functionality Following rule abstraction, compatible functionality is detected by examining the examples: Compatible Interfering  Moieties outside the extended core are listed as compatible.  Other functional groups will be inferred as `possibly interfering’.  Possibly interfering functionality will be penalized in scoring and highlighted to the user.
  • 17. Regioselectivity – Main Steps  Recognize rule’s reaction type – electrophilic substitution, nucleophilic addition etc.  Only reactions prone to regioselectivity are subject to regio calculations.  Identify competing sites  Identify substituents and other structural motifs that may influence the directionality  Collect statistics from example set regarding selectivity in the reaction core as well as elsewhere in the molecule (chemoselectivity)  Assign regioselectivity to rule if predefined statistical requirements are met. ? ? ?
  • 18. Collecting Statistics Electrophilic aromatic substitutions For each example in DB:  Evaluate ring activation including for heteroaromatic rings and fused rings  Evaluate location, type and neighborhood of ring substituents  Identify symmetry  Compute environment signatures that include all aromatic features plus relevant substituents For each rule:  Cluster reacting vs. non-reacting signature-equivalent sites for reactions with yield > 20%  Define regioselectivity if examples ratio is 10:1
  • 19. Regio Example X=Cl, 84% X=Cl, 5.5%Rejected Misinterpreted yield value provided positive evidence
  • 20. Stereochemistry – the challenge  Efficient machine perception and representation of a broad range of synthetically important stereogenic types Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers  Representation of stereochemical reaction rules and stereochemical strategies  Develop a versatile stereochemical substructure algorithm to support retron matching  Efficient discovery of symmetry in stereochemically defined molecules and rules - avoid duplicate routes  Stereoselectivity is captured inaccurately and inconsistently across common databases.
  • 21. The Data Database content Portion of data Notes Number of unmapped examples 14% Reaction type unknown Number of examples belonging to reactions with 5000 or more examples 4% Ubiquitous protection / deprotection reactions Number of examples belonging to reactions with 20 or less examples 16% Bad atom maps (database errors) Multistep reaction sequences General useable examples 65% 65 % 0 10 20 30 40 50 60 70 80 90 100 yield cs de ee %ofdatabase Examples with quoted selectivity values Selectivity metric 0 10 20 30 40 50 60 70 80 90 100 > 0% > 25% > 50% > 75% > 90% > 95% > 98% yield cs de ee Examples with selectivity above a threshold %ofavailable Threshold selectivity values
  • 22. Stereo-Rules Generation – A Different Approach  Manually code rules for a diverse set of useful enantioselective and generally selective reaction types.  Mine supporting examples from existing large reaction databases to discover reaction scope and limitations for each rule.  Find effective strategies to aid planning of a stereo controlled synthesis Reactions Diels Alder Sharpless Reduction of C=C Reduction of C=O
  • 23. 70 reaction types with ee>95% and more than 50 examples Designing a Rule-Set Reaction type Bond alterations Examples with ee ≥95% Notes Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions Reduction of C=O C=O → HCOH 1553 Any type of carbonyl Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations Reduction of C=C C=C → HCCH 1120 Wide variety of environments Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones Addition via R2Zn to C=O C-Zn + C=C → CCCH 306 Dihydroxylation of C=C C=C → HOCCOH 266 Reduction of C=N C=N → HCNH 256 Any type of C=N Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene) Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210 C substitution of Br CH + CBr → CC 199 [2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198 Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones Addition via R3B to C=O CB + C=O → CCOH 141 Oxidation of sulphides S → S=O 137 Chiral sulphoxides
  • 24. Perception of stereochemistry in structural diagrams Enabling Technology Stereocenter manipulation and stereo descriptors Op 1 2 3 4 A E 1 2 3 4 B C2 3 1 3 4 2 C C1 3 1 4 2 3 D C2 2 1 4 3 E C1 3 2 3 1 4 F C2 3 2 4 3 1 G C2 3 3 1 2 4 H C1 3 3 2 4 1 J C2 3 4 1 2 K C1 3 4 1 3 2 L C2 3 4 2 1 3 M C2 4 3 2 1 Op 1 2 3 4 s 2 1 3 4 E + 8C3 + 3C2 Rotations Reflection Conceptual Model Stereo Descriptor
  • 25. Chemical constraints layer of representation Enabling Technology CONNECTIONS=1,2,3 FUSION=BIARYL RINGS=5+6,6+7 BRIDGEHEAD=YES DIFFRING=1 EPS=0,1 SAMERING=1 HETS=0,1,2 DIFF=1 NONAROMHETS=0,1,2 SAME=1 HALOGENS=0,1,2 ARYL=YES FGS=ALCOHOL SPCENTRE=1,2,3 FGNOT=CARBONYL CHARGE=YES PROP=EWG HS=0,1,2 PROPNOT=Lg Substructure search/match
  • 26. Reduction of Ketones to Secondary Alcohols Level 1: + Environment constraints Level 0: Bond change constraints only Level 1: + Stereochemical constraints Base ARChem rule Hits ee de (screen) 10,004 (10,004) Not unique to ketone → secondary alcohol conversion 8,442 (10,004) Unique to ketone → secondary alcohol conversion 140 tolerated functional groups 6,525 3,457 4,711 (6,765) Enantioselective and diastereoselective examples
  • 27. Dihydroxylation of Alkenes Level 1: Bond changes with environment constraints Level 2: + Stereochemical constraints Level 3: + Substitution patterns 2253 examples (2416 screened) Hits ee de (screen) 1,428 1,008 1,151 (1,634) 428 117 352 (444) Hits ee de 681 578 552 526 289 418 206 131 168 12 10 11 236 89 191 123 51 103 51 27 41 8 4 7
  • 28. Conclusions  Useful chemical knowledge can be extracted algorithmically from reaction databases.  Automation is crucial given the size and growth of databases.  Different layers of knowledge are tightly entangled: regioselectivity, chemoselectivity and stereoselectivity overlap considerably.  The extracted knowledge can be applied effectively in computer-aided synthesis design, and empower chemists by offering new ideas a broader perspective on the literature. But... The quality of extracted knowledge highly depends on the accuracy and scope of the source data!
  • 29. The Rule-Set Cut-off threshold Useful reactions Noise Distractions Low utility reactions Bad atom maps (avoid) Rare multistep reaction sequences (low utility) Multiple concurrent reactions on substrate (very low utility) Exotic heterocycle formation (promote) Ubiquitous protection / deprotection FGIs such as alcohol/ester, amine/amide etc (demote)
  • 30. Conclusions  Significant portion of data is being lost due to mapping errors and other problems.  Yield and selectivity information is captured inconsistently. What can be done:  Meta data perception can be improved. (in progress)  Mapping algorithms should reflect contemporary mechanistic understanding of reactions.  Systematic mapping errors can be manually fixed (planned)  Extracted rules can be manually curated (continuous).
  • 31. Acknowledgements SimBioSys James Law - Regioselectivity Victoria Lubitch Yasamin Salmasi Aniko Simon Zsolt Zsoldos Reaction Data Elsevier – Reaxys Wiley - CIRX RSC - MOS Accelrys - RefLib University of Leeds Tony Cook - Stereochemistry Peter Johnson Steve Marsden Other Collaborators ChemAxon And… ARChem users! THANK YOU!