MARC records for archived websites on the Archive of Tomorrow project / Mark Haydn (National Library of Scotland) and Agnieszka Kurzeja (Cambridge University Library).
The presenters will discuss the metadata components of the National Library of Scotland-led ‘Archive of Tomorrow’ project, an 18-month multi-institutional collaboration focusing on capturing health resources online. Metadata work to be discussed includes the creation of a crosswalk to transform metadata produced in The British Library’s web archiving platform (ACT) into functioning MARC records, as well as subsequent enhancement work. Enhancements tested on the project included augmenting ACT metadata to generate authorised LCNAF headings; extending metadata using Wikidata and VIAF, ISNI and LC reconciliation services; and evaluating the analysis of ‘automatic’ subject heading assignation at scale, experimenting with the National Library of Finland’s AI project ‘ANNIF’ as well as other bespoke approaches. In addition to outlining the development and status of this work, the presentation will touch on project challenges and limitations, and the presenters’ experiences getting to grips with new platforms while testing ANNIF.
In addition to discussing the technical elements of the work performed, other strands of the work relevant to conference themes - from performing authority control outside of traditional platforms to making progress with linked data - will be open for discussion/Q&A. Other areas of the project work suitable for incorporation in the presentation include:
• The incorporation of Content Advisories in records for websites that might contain sensitive content, relaying the findings of a literature review conducted by Mark and project Rights Officer Jasmine Hide.
• Our dependence on parallel projects elsewhere, with reference to development work at the BL, user communities online, and ANNIF and Wikidata use across the field.
• The dynamics of multi-institutional project work, in this case performed remotely by dedicated and seconded project staff, touching on learning new skills, reporting findings, and seeking additional support.
Paper presented at the Metadata & Discovery Group Conference & RDA Day (6th - 8th Sept 2023 at IET Austin Court, Birmingham)
Challenges to implementation - Jenny WrightCILIP MDG
More Related Content
Similar to MARC records for archived websites on the Archive of Tomorrow project / Mark Haydn (National Library of Scotland) and Agnieszka Kurzeja (Cambridge University Library).
Similar to MARC records for archived websites on the Archive of Tomorrow project / Mark Haydn (National Library of Scotland) and Agnieszka Kurzeja (Cambridge University Library). (20)
CNIC Information System with Pakdata Cf In Pakistan
MARC records for archived websites on the Archive of Tomorrow project / Mark Haydn (National Library of Scotland) and Agnieszka Kurzeja (Cambridge University Library).
1. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
MARC records for archived websites
on the Archive of Tomorrow Project
Mark Simon Haydn, Metadata Analyst, Archive of Tomorrow project, National Library of Scotland
Agnieszka Kurzeja, Metadata Co-ordinator, Cambridge University Library
CILIP Metadata & Discovery Group Conference 2023
#CILIPMDG2023
2. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Archive of Tomorrow project
• 18-month NLS-led collaboration between Legal Deposit libraries to collect
wide-range of health discourse online, improving access to website captures
available through the UK Web Archive
• Initial collecting focus on wide range of COVID-19 resources, expanding to
provide dedicated subcollections of wide-ranging health topics
• Project team including Web Archivists at NLS, CUL, Bodleian & University of
Edinburgh as well as Project Manager, Rights Officer & and Metadata Analyst;
project also appointed two AoT research fellows and collaborated with
Cambridge University Library Metadata Co-Ordinator
• Collection available at webarchive.org.uk/en/ukwa/collection/4028 and
data.nls.uk; research workshops held at NLS, Edinburgh University,
Cambridge University Library
3. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
webarchive.org.uk
5. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Open Access
Onsite (LDL) access only
6. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Collection (“Wellbeing”) & Target (“Adopting Positivity
Substack”) metadata in JSON format
- Derivative metadata for researchers (data.nls.uk)
- Repurposed to populate catalogue records
- Licenced for reuse
Wellbeing
Blogs and Social Media
Talking about Health
Health Organisations and Services
Medicine & Health
7. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
• No in-built ACT metadata export available;
first test records populated with TSV
exports manually generated by BL
• BL developed API to enable metadata
requests on demand, standardising output
of ACT target and collection MD:
• Previous NLS experience crosswalking
volunteer ISBD input into minimum viable
bib record
8. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
HMSO crosswalk and normalization rule for volunteer cataloguing developed by Carol Hunter and Ian Horobin
AOT crosswalk and AOT normalization rule (DROOL)
9. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Excel transformations
(008)
https://www.oclc.org/content/dam/research
/publications/2018/oclcresearch-wam-
recommendations.pdf
10. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Normalisation rules (Drools)
replaceControlContents "LDR.{6,1}" with "m“
replaceContents "041.a.EN" with "eng“
if(exists "041.{0,*}.a.EN")
addControlField "007.cr#cnu###zznzz“
addField "040.{-,-}.a.StEdNL" if (not exists "040.a")
addField "336.{-,-}.a.text"
addSubField "336.{-,-}.b.txt"
removeField "362" if (exists "362.{-,-}.a.REF|N/A|VALUE!")
changeField "265" to "264"
Examples at https://developers.exlibrisgroup.com/blog/alma-normalization-
rule-examples
12. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Enhancements
Variety in how creator organisations are described:
NHS, N.H.S., National Health Service
NLS Web Archivist Eilidh MacGlone assigns Wikidata QIDs
during QC workflow:
QID added to unused ACT field
↳ VIAF ID extracted from Wikidata entry
↳ Linked LC/NACO authority record
reconciled using OpenRefine
↳ Authorised name cropped and paired
with JSON URI, with ISNIs where
available
13. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
LCSH & FAST analogues for ACT Collection and Subject terms
developed by Agnieszka Kurzeja, Metadata Co-Ordinator,
Cambridge University Libraries
14. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Searching for Library of Congress Subject Headings
16. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
FAST Conversion
17. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
FAST dataset download @ OCLC
+
Prepared RDF files from
National Library of Wales
Short-form target descriptions
paired with target URIs
WARC -> WAT full text?
⚠️
👷🚧👷
Loading FAST.nt vocab
using Docker, Bash
20. National Library of Scotland
Leabharlann Nàiseanta na h-Alba
Challenges
Two stumbling blocks for using ANNIF at scale:
- Requires wide spread of tech skills to prepare vocabulary files, train engine, run at
command line (eased by ANNIF Google Group: https://groups.google.com/g/annif-users)
- Most effective use would involve easy access to target full text (WARC-derivative WAT);
currently only available at target level
Findings
Accessibility as priority, improving discovery of web archives through catalogue
Value of minimal viable records, data normalisation facilitating creation of RDA-compliant
MARC records at scale