The document discusses the need for mandating data standards and open sharing in scientific publishing. It notes the loss of data reproducibility due to a lack of standardized data formats. The author advocates that publishers should require authors to register chemical compounds, submit spectral data in standard digital formats rather than images, and make data openly available with metadata and provenance. Adopting standards like CVSP could help validate structures and spectra before publication. While mandates may not be preferred, open data standards are important for scientific progress and reproducibility.
Genomic DNA And Complementary DNA Libraries construction.
Our dire need to mandate data standards and expectations for scientific publishing
1. Our dire need to mandate data
standards and expectations for
scientific publishing
Antony Williams
ACS Denver, March 2015
2. Reproducibility, Reporting,
Sharing & Plagiarism
• I will present from the point of view of:
• Losing way too much of my own data!
• Someone who actively wants to share data
• My involvement with a chemistry database
• As a reviewer of publications
• As an author of scientific publications
• ..and as a replacement speaker…
7. What technical solutions tho’?
• Despite the push for Open Data the funders
are not really pushing solutions yet
• Institutional repositories are commonplace
• (Partial) solutions are becoming available
11. So what do I do…
• VP Strategic Development for RSC
• Manage the cheminformatics team
• Interested in Open Drug Discovery, Open Data
management, Cheminformatics standards
• But originally an NMR spectroscopist with a
focus on structure elucidation - very interested
in “CASE”, study of natural products
14. Studying DOZENS of compounds
• NO access to raw data files – in binary or
even standard file formats for processing
• Figures are close to USELESS for 2D NMR
– representative not accurate shifts
• Tabulated shifts are in PDF files and needed
transcribing – where are CSV files???
• TORTUROUS WORK!!!!
17. In researcher mode…
• I want to access and use data
• I want to:
• Download molecules
• Download tables
• Download spectra
• Download figures
• Then reprocess, replot, repurpose
18. Community Norms
• Some wonderful community norms and
mandates!
• Deposit crystal structures in CSD
• Deposit Proteins in PDB
• Deposit gene sequences in Genbank
• Increasingly deposit bioassay data in Pubchem
19. What of general chemistry?
• We publish into locked down files and then
“abstract” the data!
• Could publishers help drive a community
norm for:
• Chemical compound registration
• Spectral data
• Property data
• What else?
22. Could we at least improve
quality of compounds?
• Maybe forcing compound registration ahead
of time won’t work (would need a business
model etc.)
• But what can be done to help correct the
many issues we see with structures?
• Examples?
38. What if…
• CVSP was used to check and process all
ChemDraw, Molfiles, SDF files before
submitting to publishers or databases?
• Publishers used the CVSP API to check their
data?
• All the rules were openly available for adoption
39. A Talk from Yesterday…
http://www.slideshare.net/AntonyWilliams/
45. We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
47. Developing Proof-of-Concept
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8
50. Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
51. We can solve for Authors here
Will it be used though??? YES!
53. The challenges of analytical data
• Vendors produce complex proprietary data
formats and standard formats are required
(JCAMP, NetCDF, AniML)
• ChemSpider already hosts thousands of JCAMP spectra
• Support of “assigned spectra” in place
• Data validation approaches understood
• There are a myriad of analytical data types…
57. It’s Dangerous to Mandate
• Scientists prefer guidelines rather than rules
• It can be more work to meet mandates
• Mandates may discourage submissions to
journals
• But what’s good for science?
• Will the Open Data movement shift things?
• Will the latest generation share more?
58. Reproducibility, Reporting,
Sharing & Plagiarism
• If publishers demanded it of me…
• I would lose less of my own data!
• I would actively be sharing data
• As a reviewer of publications..enables me
• As an author of scientific publications..makes the
publications better I believe
• ..and I did my best as a replacement speaker…