Scot Edmunds talk at CODATA2019 on Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment. 19th September 2019 in Beijing
2. The Hong Kong experience.
Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
UGC Policy: “Realization of
making Hong Kong Asia's
world city is only possible if it
is based upon the platform of
a very strong education and
higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm
3. Research Data policies growing globally
http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=researchdata#1
4. http://dx.doi.org/10.17477/jcea.2018.17.2.200
…meanwhile in Hong Kong
“This ambivalence was reflected by the chairman of the Research Grants Council, who
stated in an interview that ‘there is no relationship between world-class research and
release of data’, questioning whether anyone might be interested in the completeness of
data.
The chairman also saw a conflict between competitiveness and openness, arguing that
the reputation of a researcher is built on publications, not on the underlying data. “
6. If Government doesn’t act,
Universities need to lead way
http://www.rss.hku.hk/integrity/research-data-records-management
7. First CRIS in HK, built upon Scholars Hub
http://hub.hku.hk/advanced-search?location=crisdataset
(CRIS = current research information system)
8. First CRIS in HK, built upon ScholarsHub
http://lib.hku.hk/researchdata/rpg.htm
“Beginning with the September 2017 intake, all HKU
research postgraduate (rpg) students have responsibility
for 1) using a data management plan (DMP), where
applicable, to describe the use of data in preparation for,
or in the generation of their theses, and 2) depositing,
where applicable, a dataset in the HKU Scholars Hub.”
9. Growing # of OA journals addressing this
http://dx.doi.org/10.1371/journal.pmed.1001607
11. http://reproducibility.cs.arizona.edu/
Arizona Repeatability in
Computer Science Experiment
• 2015 study examining extent Computer Systems
researchers share their research artifacts (code)
• NSF policies on sharing code since 2005
• Examined 613 papers from ACM conferences & journals
•
• Attempted to locate source code that backed up results
• If found, tried to build the code.
14. Can we do something similar in HK?
Teaching HKU MLIM students module on data curation and management.
15. HKU Repeatability in HK
Research Experiment
• HKU policy on data sharing from 2015
• PLOS policy mandating sharing of supporting March 1,
2014
• HKU has published ≈400 PLOS ONE papers 2014-date
• Can we quantify reproducibility in a sample of these?
• Compare with other less stringent journals (e.g. Springer
Nature data policy ranked journals1)
• Can we follow Arizona and harness crowdsourced
(student) power?
1. https://www.springernature.com/gp/authors/research-data-policy/data-policy-types/12327096
16. HKU Repeatability in HK
Research Experiment
• Easy exercise in literature curation for HKU MLIM
students
• Set as a project for 59 students, 2017-2019
http://hub.hku.hk/simple-
search?query=&location=publication&sort_by=score&order=desc&rpp=25&filter_field_1=journal&filter_type_1=equals
&filter_value_1=plos+one&etal=0&filtername=dateIssued&filterquery=[2014+TO+2019]&filtertype=equals
18. HKU Repeatability in HK
Research Experiment
https://scholarlykitchen.sspnet.org/2016/01/06/plos-one-shrinks-by-11-percent/
Rise (and fall) of megajournals
Driven by impact factor or “easier” data policies?
“ Because data requirements are not uniform
across all journals, PLOS has put itself at a
disadvantage as far as attracting authors because
other journals offer an easier path. If strictly
enforced, this new policy is likely to result in a
drop in submissions to PLOS journals. While no
other mega-journal has been able to shake PLOS
ONE’s hold on the market, this policy may provide
an opening for competitors to gain on PLOS ONE
and even overtake it.”
Can we quantify this?
19. HKU Repeatability in HK
Research Experiment
• Students assigned 2 PLOS + 2 SciRep papers (268 total)
• Quickly scan paper looking for supporting data
• If no data, go to the next paper
• If uses data, is it all associated with the paper?
• If external data, is it available from URL or accession?
• If “data available on request”, are they contactable?
• Spend about up to 10mins per article
• Add data into googledoc, and teacher double checks &
marks students on accuracy
Homework/Case study: literature curation exercise
20. HKU Repeatability in HK
Research Experiment
Alternative: webscraping option (code in GitHub)…
https://github.com/jessesiu/hku_scholars_hub
21. HKU Repeatability in HK
Research Experiment
See protocols in protocols.io: http://dx.doi.org/10.17504/protocols.io.6x7hfrn
Teachers protocol: http://dx.doi.org/10.17504/protocols.io.6x8hfrw
Students protocol: http://dx.doi.org/10.17504/protocols.io.6yahfse
22. HKU Repeatability in HK
Research Experiment
Example
http://hub.hku.hk/handle/10722/223364
23. HKU Repeatability in HK
Research Experiment
Is there data presented in the paper? – Yes
Is there external data, and if so what is the
link/accession? – No
Is all the data in the paper available? – No
Comments - Has questionnaire, but not data as
says "minimal anonymized dataset will be made
available upon request”
Example
24. HKU Repeatability in HK
Research Experiment
If data “available on request”, do the authors respond if contacted?
Example
26. Interesting examples
Several examples of missing Infectious Disease data
http://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966
28. 148
Papers
114 with data 121
Respond 7
Missing 7
27 data on request
Bounce 5 No response 17
121 accessible data
(82%)
data accessibility
29. 120
Papers
79 with data 87
Respond 8
Missing 25
16 data on request
No response 8
57 accessible data
(72.5%)
data accessibility
30. External Data Sources
• Growing number of papers hosted data via
general-purpose open-access repositories:
– figshare (12), Dryad (5), OSF (4), Zenodo (2), Dataverse
(2), PANGAEA (2), DANS (1)
– Since 2016 figshare use has been dropping &
OSF/Zenodo increasing
– Large numbers of government, IR & institutional
websites
– Other than one broken Dryad link, OA data repositories
much more stable than other URLs (many broken)
https://figshare.com/projects/HKU_Repeatability_in_HK_Research_Experiment/64118
32. Do not rely on handles
Instability of older HKU Scholars Hub Identifiers & data
• Going back to older (papers collected in early 2017) 3/49 (6%) handles have
changed
• Checking back over time, the number of 2016/2017/2018 PLOS/SR papers
listed keeps increasing (have had to update our results)
33. Do not rely on “data available from our website”
http://bioinformatics.oxfordjournals.org/content/24/11/1381.long
34. Do not rely on “data available on request”
https://doi.org/10.1101/633255
35. Do not rely on “data available from the government”
HK Hospital Authority only shares data with researchers at UGC-funded universities
in Hong Kong, with data access charges on average 35,700 HKD per request1
1. https://www.accessinfo.hk/en/request/request_for_statistics_on_data_c
2. https://www.nature.com/articles/s41598-017-15579-z
“Thanks for your interest. I'm afraid we can't as the data came from our hospital
authority which is highly strict in using of their data and would not allow us to
use the data other the purposed we stated before.”
So why say it was available upon request?
Emailing the authors for the data:
36. Do not rely on GitHub (or google)
https://dev.to/mjraadi/if-you-don-t-know-now-you-know-github-is-restricting-access-for-users-from-iran-and-a-
few-other-embargoed-countries-5ga9
37. Lessons Learned: never trust “data on request”
• “Data Available on Request” does not work (65% requests failed after
2 attempts).
• Hong Kong Government (esp. Hospital Authority) data access policies
incompatible with international journal policies
• Email addresses not checked by journals : 5 bounced (one wasn’t
even in correct format). 1 example gave a postal address only.
• Data Access Committee system not working. None of the DACs of the
listed Consortia/Cohort projects responded to emails (Children of
1997, Guangzhou Biobank Cohort Study, JAGES, and China Research
Center on Aging DACs).
• Even if authors respond there are often problems
• t&c’s. e.g.: MTAs or co-authorship, can share a sample of the
processed data not the raw data as they were still writing
publications.
• Data missing, e.g. they deleted the raw sequencing data.
https://figshare.com/projects/HKU_Repeatability_in_HK_Research_Experiment/64118
38. Lessons Learned: problems with Scholars Hub
• Unstable identifiers – 6% (3/49) examples changed in 2
years
• Unstable indexing – numbers of historic publications
keep increasing (self-reporting by authors?)
• Unstable source of datasets: one example of data in a
thesis that was blocked for a period
• Inconsistent indexing/metadata – one example lacked a
link/DOI to the paper, inconsistent keywords & tagging
• Inconsistent authorship – multiple, unused ORCID IDs
registered by HKU
https://figshare.com/projects/HKU_Repeatability_in_HK_Research_Experiment/64118
40. Importance of FAIR snapshots
Why GigaScience set up
https://doi.org/10.1093/database/baz016
Foundational Principles
• Can’t trust “data available on request” – need independent, trusted broker
• Follow FAIR principles (Findability, Accessibility, Interoperability, and
Reusability) for data stewardship & offer unlimited data hosting
• Use globally unique and persistent (stable) identifiers, e.g. DataCite DOIs
• Need to take unlimited sized snapshots of ”version of record” (data, code…)
• Increase Reusability with Interoperable CC licensing (we use CC0)
• Increase Findability & Reusability with rich open metadata (field specific,
DataCite, schema.org) and wide indexing (DataCite, NIH datamed, DCI, etc.)
41. Thanks to:
Laurie Goodman, Editor in Chief
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Chris Armit, Data Scientist
Mary Ann Tulli, Data Ediitor
Xiao (Jesse) Si Zhe, Database Developer
Chen Qi, Shenzhen Office.
@GigaScience
facebook.com/GigaScience
http://gigasciencejournal.com/blog/
Follow us:
www.gigasciencejournal.com
www.gigadb.org
+
Weibo
& WeChat
+ HKU MLIM students