The document discusses building a structure-centric community for chemists by leveraging crowdsourcing and text-mining of open chemistry data on the internet. It describes ChemSpider's capabilities to search and aggregate chemical data from various sources by structure and property and its efforts to curate and link open access literature and patents to chemical structures. Challenges around data quality and ambiguity in chemical names are also covered. The goal is to enable new ways of searching chemistry information centered around chemical structures.
2. Imagine a time when ….
The internet is searchable by chemical structure and
substructure (e.g.Wikipedia, Google Scholar)
Chemistry articles are indexed and searchable by a free
online service
The web is linked together through the “language of
chemistry”
Publicly funded research data can be shared and
discussed in the Open, maybe as ONS?
Cheminformatics has as much of a public face as
bioinformatics
Building a Structure Centric Community for Chemists
3. ChemSpider - A Search Engine for Chemists
Questions a chemist might ask…
What is the melting point of n-butanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
ChemSpider can answer all of these questions
Building a Structure Centric Community for Chemists
4. What is a Structure?
Ask a computer…ask a chemist
Building a Structure Centric Community for Chemists
5. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
6. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
7. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
8. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
9. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
10. Tell Me About Glutathione
Building a Structure Centric Community for Chemists
12. Links out to KEGG
Kyoto Encyclopedia of Genes and Genomes
Building a Structure Centric Community for Chemists
13. How many names does a compound have?
Building a Structure Centric Community for Chemists
14. ChemSpider Data Content
Over 21.5 million unique chemical structures from ca. 150 data
sources
Online Databases –PubChem, Drugbank, KEGG, Wikipedia
Literature – PubMed, J Het Chem, Nature, RSC, Open Access
Chemical Vendors – over 40 different vendors and growing
Personal Depositions – individual contributions
Content database vendors
Analytical data collections
Patents
Web scraping
Content is linked back to the original data sources
Building a Structure Centric Community for Chemists
15. Other Searches
What compounds have a mass of 300+/-0.001?
or search a combination of intrinsic/predicted properties
Building a Structure Centric Community for Chemists
18. The Quality of Data Online…
Aggregating data opens up quality issues
Structure-identifier associations are “dirty”
Structures are COMMONLY incorrect
Manual curation of small databases is enough work – what
about millions of structures?
Structures are far from perfect. What is a “correct structure”?
Full stereochemistry?
Historical timeline of structure?
Who is the authority?
Building a Structure Centric Community for Chemists
19. Who holds THE Quality Authority?
Chemical Abstracts Service is the structural authority
today. 1400 employees, world standard in chemistry
information
101 years of knowledge, process and expertise.
How can an online, free access system peacefully co-
exist with the authority?
Building a Structure Centric Community for Chemists
20. Quality is a Major Issue- Search Butanol
OLD EXAMPLE..now fixed
Building a Structure Centric Community for Chemists
21. Wikipedia Chemistry Curation project
Only ca. 5000 organic structures, 7000 total
structures
Almost a year of work so far for a team of 6
people
Many errors removed in the process. Curation
process is a daily event for users/depositors
Slow and torturous process
http://en.wikipedia.org/wiki/Talk:Tacrolimus#
IUPAC_Name_and_structure
Building a Structure Centric Community for Chemists
22. Wikipedia Curation
Looking for self-consistency
across a Wikipedia Page
Primary key is the article TITLE
The chemical shown needs to
match the title
Cyclic self-consistency – and
decisions must get made
Building a Structure Centric Community for Chemists
28. Thymol Blue on ChemSpider
Data online includes:
UV-vis spectrum
Measured experimental properties
Link to Wikipedia article
Links to chromatography details
Multiple identifiers/trade names etc.
Links to vendors/suppliers/other databases
Safety information
http://www.chemspider.com/q/thymol%20blue
Building a Structure Centric Community for Chemists
29. Differences between ChemSpider/Wikipedia
ChemSpider Wikipedia
>21 million unique structures ~5000 organics, 2000 others
Complex queries – Properties, Text
Text, structure/substructure, OA
publishers, Data Sources, …
Prediction of properties No
Analytical Data No, but links.
Active depositors/curators – 30 Active editors > 50 (?)
6000 people/day; 1900 registered ????
Compound monographs linked Detailed compound monographs
Building a Structure Centric Community for Chemists
30. Differences between Wikipedia/ChemSpider
Wikipedia ChemSpider
Supported by tried and tested Primarily Microsoft .NET
Media-Wiki platform. technologies with OS components
Established infrastructure and “Out of a basement” on three
Wikipedia Foundation Team servers and 5 volunteers
Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider
GFL licensing for everything Mixed “licensing”
Strong team of WP:Chem Growing team of advocates,
advocates, curators and admins curators and users
Worldwide reputation as quality Growing reputation as focused on
source – good and bad quality
Building a Structure Centric Community for Chemists
31. Crowd-sourcing Curation
How to curate data for millions of structures?
Robot processes can clean up depositions
Search for Chloride and check molecular formula for Cl
Check for stereochemistry and remove names with stereo
Provide a simple-to-use platform to curate, annotate
and tag data
Provide curator administration to prevent vandalism
(Veropedia)
Building a Structure Centric Community for Chemists
32. Post Comments
Anyone can “Post Comments” associated with a
structure. To curate data we require login to track
Building a Structure Centric Community for Chemists
34. Crowd-sourcing Chemistry
Crowd-sourced curation: identify and tag errors, edit
names, synonyms, identify records for deprecation
ALSO
Crowd-sourced deposition: anyone can deposit data
(structures, text, images, analytical data)
Building a Structure Centric Community for Chemists
38. Structure-Centric
We want to search “information” by structure, substructure,
similarity of structure
Specific focus on Open Chemistry at present
Standard approaches would be:
Identify chemical names “entity extraction”
Convert chemical names to structures and index
ChemSpider has a validated dictionary of structure-name
pairs
Use name extraction, name-conversion and dictionary look-
up. THEN curate.
Building a Structure Centric Community for Chemists
39. “Entity Extraction”
Rule-based recognition of systematic names:
Use a lexeme of name fragments
Rules for identifying bounds of a name
Look-up dictionary:
Drug Names
Trivial Names
Numbers : Registry IDs, EINECS/ELINCS
Massive look-up dictionary of validated identifiers on
ChemSpider
Building a Structure Centric Community for Chemists
41. Name Recognition
Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyde
2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0
oC were successively added (3,4-diaminophenyl)phenyl
methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous
MgSO4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room
temperature [18]. The mixture was filtered and washed with
dichloromethane . Then the solvent was evaporated under
reduced pressure to give azo Schiff base 3 as a red solid which
was recrystalized from ethanol 95% (1.28 g, 91 %)
Building a Structure Centric Community for Chemists
42. Name Recognition
Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyde
2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0
oC were successively added (3,4-diaminophenyl)phenyl
methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous
MgSO4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room
temperature [18]. The mixture was filtered and washed with
dichloromethane . Then the solvent was evaporated under
reduced pressure to give azo Schiff base 3 as a red solid which
was recrystalized from ethanol 95% (1.28 g, 91 %)
Building a Structure Centric Community for Chemists
43. How Many Chemical Names?
“She had the drive to derive success in any
venture and was well versed in Karate.
When the man in the tartan shirt
approached her with a dagger in his hand
she spat in his face, took the stance of a
commando and took advantage of his
shock to release the dagger from his grip
and causing him to recoil. He went home
and took an aspirin after the beating.”
Building a Structure Centric Community for Chemists
44. How Many Chemical Names?
“She had the drive to derive success in any
venture and was well versed in Karate.
When the man in the tartan shirt
approached her with a dagger in his hand
she spat in his face, took the stance of a
commando and took advantage of his
shock to release the dagger from his grip
and causing him to recoil. He went home
and took an aspirin after the beating.”
Building a Structure Centric Community for Chemists
45. ChemMantis
Chemical Markup And Nomenclature Transformation
Integrated System
Building a Structure Centric Community for Chemists
46. Making Open Access Articles Searchable
Proof of Concept
Can we HOST Chemistry Open Access articles on
ChemSpider and add-value
Can we identify chemical names in Open Access articles
in a user-friendly manner
Can we convert names to structures in Open-Access
articles and expand ChemSpider and provide structure
searching of Open Access chemistry articles?
Can we provide an environment for chemists to mark-up
their own articles and crowd-source markup of an
archive?
Building a Structure Centric Community for Chemists
47. Document markup
ChemSpider now hosting Open Access articles from
MDPI, Molecular Diversity Preservation International
Hosting the Molbank collection at present
Building a Structure Centric Community for Chemists
48. A Standard for Document Markup?
NLM-DTD: National Library of Medicine; Document
Type Definition
Approved markup definitions to apply to journal
articles – extended as necessary for our purposes
Building a Structure Centric Community for Chemists
59. A Platform for Markup
Can we provide a platform for document markup for
chemists?
Workflow:
Upload word docs, RTF files or point to HTML and load
Apply entity extraction, convert names to structures, mark-up
automatically and ask for user participation
Publish final version with NLM-DTD markup
Deposit all structures on ChemSpider under embargo and
wait for article DOI to release
Building a Structure Centric Community for Chemists
60. Challenges
Computer software can generate chemical names better
than the majority of chemists
The majority of chemical names are generated by
humans, and Incorrect – convert to the wrong structure
or are ambiguous
One name, Multiple Structures
Building a Structure Centric Community for Chemists
72. Single Configuration File defines entities
for markup
Algorithms can be built for certain
entities but the majority are dictionaries
– vendors, Phys Properties, Analytical
We can extend our system to support
your needs based on dictionaries – what
does NPG need/not need?
Building a Structure Centric Community for Chemists
74. Entity Balloons
Structures are the
language of chemistry
Show structures to
chemists and search/link
from there
Building a Structure Centric Community for Chemists
75. Other Dictionaries - Species
We are considering
Bacteria
Fungi
Enzymes
Viruses
PDB codes….
Building a Structure Centric Community for Chemists
76. Integrations Out to Other Sources
Building a Structure Centric Community for Chemists
77. Integrations Out to Other Sources
Building a Structure Centric Community for Chemists
79. Manual Curation is Always Necessary
Building a Structure Centric Community for Chemists
80. Text-Indexing and ChemSpider?
ChemSpider text-indexes almost 500,000 Open Access
and Free Access articles
Collection is growing and more publishers have already
agreed. Including theses in the future.
Building a Structure Centric Community for Chemists
82. Conclusions
The quality of structure-based data online should
always be questioned – that includes ChemSpider
Data on ChemSpider are being added and curated on a
daily basis but we need more eyeballs helping always
ChemSpider has a large validated structure-name
dictionary
Chemical name extraction and document markup is
very enabling
Building a Structure Centric Community for Chemists