Slides accompanying a talk delivered by Dan Gillean at PASIG 2016, held at the Museum of Modern Art in New York, NY October 26-28, 2016.
These slides explore the roles that standards play in digital preservation, and introduce some of the key standards that Archivematica was designed with in mind, and which the system uses to help you capture technical, preservation, and administrative metadata when generating Archival Information Packages (AIPs) and Dissemination Information Packages (DIPs).
For more information about Archivematica, see: https://www.archivematica.org
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Avoiding the 927 Problem: Standards, Digital Preservation, and Communities of Practice
1. Avoiding the 927 Problem:
Standards, Digital Preservation, and Communities of Practice
Dan Gillean
PASIG NYC 2016
October 26, 2016
2. What is a standard?
•A model or basis of comparison
•An agreed-upon set of characteristics,
definitions, and/or practices
•A minimum acceptable benchmark allowing for
quantitative or qualitative judgement
http://www.cas.edu/
3. De Jure vs De Facto
• “According to law,” “By right”
• Declared to be standards by an
authority
• Top-down distribution
• Can be formalized from de facto
standards; can become de facto
as well via adoption
• Generally open
• “In reality,” “As a matter of fact”
• Grow to be standards via
adoption
• Dependent on market or
community uptake
• Can become de jure standard
• Can be open or closed
De Jure De Facto
4. Open vs Proprietary
•Open can sometimes just refer to availability – royalty free
•Open source: community-driven, open exchange of ideas
•Open proprietary: Privately developed or owned but
freely available for implementation
•Closed proprietary: Privately developed/owned, must pay
licensing fee to implement
7. Communities of practice
•Shared craft, domain, or profession
•Shared common interest in improvement
•Established via mutual engagement, joint
enterprise, and shared repertoire
Crowd, by James Cridland. https://www.flickr.com/photos/jamescridland/613445810
8. Designated community:
• An identified group of potential
Consumers who should be able to
understand the preserved information
“Since a key purpose of an OAIS is to
preserve information for a Designated
Community, the OAIS must understand the
Knowledge Base of its Designated
Community to understand the minimum
Representation Information that must be
maintained.“ (p. 2-4)
9. Standards are only useful if
we use them
http://www.salon.com/2016/06/16/black_holes_are_colliding_scientists_confirm_ripples_in_spacetime_partner/
10. Standards are only useful if
we use them
https://commons.wikimedia.org/wiki/File:Snowflake_01.svg
Special!Special!
Special!
11. Standards are only useful if
we use them
The 927 problem:
https://xkcd.com/927/
14. ISO 14721
A reference model – not a
systems architecture!
https://wiki.archivematica.org/Overview
15. • Governance
• Organizational structure
• Staffing
• Procedural accountability
• Preservation policy framework
• Documentation
• Financial sustainability
• Security
ISO 16363
Reminds us that much of digital
preservation readiness is not technical
– it’s organizational
18. What is Archivematica?
Archivematica is a web-
and standards-based,
open-source application
which allows your
institution to preserve
long-term access to
trustworthy, authentic
and reliable digital
content.
Standards based
Open source
Customizable
Integrated w 3rd
party systems
Active community
19.
20. PREMIS in METS XML
Archivematica AIP structure
Packaged according to BagIt specifications
Virus scan, normalization report, extraction log, etc
For browsing in Archivematica
Original + normalized
objects, submission
docs, original metadata
included at SIP creation
21. • Originally developed for exchange between
California Digital Library and Library of
Congress; specifications written up by IETF in
2008
• System agnostic, interoperable format for
storage and exchange
• “Bag and tag” approach: mandatory tag file
contains a manifest listing every file in the
payload together with its corresponding
checksum
BagIt
BagIt is a hierarchical file packaging format
designed to support disk-based or network-
based storage and transfer of arbitrary digital
content.
22. • It provides a wrapper for other metadata, such
as PREMIS and Dublin Core.
• It defines relationships between digital objects
and other digital objects, and between digital
objects and their metadata.
• It can be used to provide technical metadata
about digital objects (although Archivematica
doesn’t implement it that way: we wrap PREMIS
in it instead)
METS, or Metadata Encoding and
Transmission Standard, was designed to
support inter-repository data exchange.METS
23. • It captures technical information about an object in order
to support the implementation of preservation strategies
such as normalization, migration or emulation (PREMIS
Object)
• It describes relationships between digital objects (PREMIS
Object)
• It provides an audit trail of actions taken by the digital
preservation repository to preserve the object (PREMIS
Event)
• It names the individuals, organizations and software tools
responsible for taking actions to preserve digital objects
(PREMIS Agent)
• It specifies the actions a repository is allowed to take to
preserve digital objects (PREMIS Rights)
PREMIS
PREMIS, or Preservation Metadata
Implementation Strategies, is the
recognized standard for metadata
about objects in a digital
preservation system.
It’s difficult to come up with a broad definition of standards without straying into the uselessly general, but essentially, standards give us a means and method for evaluation, comparison, and use. They are a descriptive declaration of a set of features or characteristics, with which we can measure an implementation.
De Jure example: ISO 8601, the International standard for date and time representations (YYYY-MM-DD). Its purpose is to provide an unambiguous and well-defined method of representing dates and times, especially in an international context where national and local conventions may vary greatly.
De Facto example: VHS format for videotape recorders, which won out over Betamax not because it was a better specification, but thanks to broader market adoption.
Open Source example: PCDM is quickly becoming a de facto, community-driven standard for Hydra implementers
EXAMPLE of evolution: development of PDF format as a way to share documents with embedded fonts and images across diverse computer platforms in the early 1990’s. Developed first internally at Adobe as a closed proprietary standard, it was quickly released as an open proprietary standard in 1993. Through wide adoption, it became a de facto standard throughout the late 90’s and 2000’s. In 2008 it was formally released as an open standard, and adopted as ISO 32000-1:2008, making it an open de jure standard.
Within the practice of digital preservation, this is most useful for considering its implications across space and time – standards provide a method of contextualizing and interpreting our data and our metadata so it can be understood and used by others.
But let’s not forget that standards are not a Rosetta stone – they often come rife with presuppositions about the knowledge base of the reader. Digital preservation is a complex field full of jargon, and concepts requiring time and training to acquire.
So to whom exactly do our standards communicate across space and time, then?
I find it useful to think of the utility of standards within the framework of a community of practice. Originally coined as a pedagogical and social anthropology term, a community of practice refers to a group of people united via a shared craft, domain, or profession, with common goals and an interest in improvement. The term was first popularized by Jean Lave and Etienne Wagner, but it is useful to consider digital preservation as a domain bounded by a community of practice – and to conceptualize our standards as both an expression of this community, and an outcome of its shared goals.
In fact, in one of the touchstone standards of our field – the OAIS reference model, now recognized as ISO 14721 – we frame the long-term goals and intelligibility of our digital preservation efforts within a concept very similar to a community of practice. The reference model speaks of a “Designated Community”: those to whom the preserved information should remain understandable, based on their presupposed knowledge base. Standards - being a useful tool for evaluation, comparison, and use - therefore comprise a key part of the knowledge base that we will require to make the information we preserve accessible and comprehensible in the future.
All of this comes down to stating the obvious – Standards are only useful to a community of practice if we use them – correctly, and consistently, across time and space. Failing to do so can be akin to relegating our materials to a black hole – without the proper context and framework to interpret and evaluate the preserved information, how can we guarantee the information will be accessible and intelligible in the future?
There are many reasons why digital preservation standards might NOT be used. At the institution level, one common culprit is the special snowflake effect: “our records, our workflows, our needs are so unique and specialized, the existing standards cannot possibly meet our needs.” This can lead to custom metadata profiles and bespoke systems. More knowledge required for preservation becomes siloed to specific individuals or systems; the burden of documentation is higher, and the efforts required to migrate environments or share access across institutional boundaries become increasingly challenging.
Within our community of practice however, we can sometimes make perfect the enemy of the good – or good enough. This is the 927 problem: seeking the magic bullet format, technology, or standard that will supersede all previous efforts and bring about a golden age of universal adoption. The 927 problem is the reinvention of the wheel, over and over again, sometimes at the expense of previous efforts. It can often mean just adding one more option to a crowded field, and further bisecting our efforts along parallel but separate paths.
So what standards should we be using then? How can we evaluate?
If we expect our standards to be available for use and evaluation in the future, we should choose open standards – favoring openness is digital preservation best practice, from standards to formats to tools, and so on. Ideally they will be non-proprietary as well, to ensure they remain open. A standard need not be De Jure for it to be used in digital preservation, but we want to ensure that we’ve brought our collective expertise to bear in evaluating its validity and utility towards achieving our stated goals of long term preservation and access. And finally, no standard we adopt should force us to use a single tool or platform.
With these criteria in mind, let’s take a look at just a few standards we can use in service of the shared goals of our community of practice.
The starting point for any digital preservation standardization has become the OAIS Reference Model, AKA ISO 14721, and TRAC, or ISO 16363. Both OAIS and TRAC have become De Jure ISO standards with widespread adoption in our community of practice. OAIS provides us with a reference model for the functions and activities we need to consider for creating a comprehensive digital preservation environment, while TRAC supplements this with a series of metrics and requirements for auditing, monitoring, and evaluation needed to achieve full OAIS compliance.
It’s worth quickly noting that OAIS is NOT a systems architecture, but a conceptual model. In practice, it is highly unlikely and possibly even undesirable to consider your preservation environment as a single monolithic system. Instead, it will be many different tools, platforms, and locations, each able to do one job well instead of many tasks poorly.
Equally important to note is that digital preservation is not all tools and systems – much of it is organizational, covering internal policies and procedures, workflow documentation and accountability chains, mission statements, budgeting, staffing and succession planning. Regardless of your resources or the technical expertise you have in-house, considering and prioritizing these important aspects means that you can start working on digital preservation today.
Section 4 of TRAC is where things start getting really technical, and where many institutions shake their head and defer action. How should we document every action taken during the preservation process? How should we capture all agents, both human and machine, involved in the process? What do we use to extract all the necessary technical, administrative, and preservation metadata, and how can we encode this in a standards-based way for it to be reusable and interoperable?
I’m going to talk about a few standards that can help you do just that, in the context of Archivematica.
Archivematica is an open-source digital preservation system that attempts to support standards-based workflows and outputs. Think of it as a standards-based sausage maker for generating what the OAIS reference model refers to as SIPs, AIPs, and DIPs – you provide the content to be preserved (the filling), implement format policies based on your institutional needs and Archivematica will add the standards based “casing”, generating administrative, technical, and preservation metadata that is platform independent and storage agnostic.
Archivematica’s web based dashboard was designed with the OAIS reference model in mind. There are no magical turnkey solutions to digital preservation (and you should be wary of anyone promising such), but Archivematica can help cover some of the more technical aspects of creating your preservation workflow in a standards-based manner.
Here’s a brief overview of an Archivematica Archival Information Package, or AIP. We package all AIPs according to the Library of Congress BagIT specification, and capture all relevant technical, administrative, preservation, and descriptive metadata using PREMIS, embedded in the METS XML included with each AIP and DIP, or Dissemination Information Package. The AIP is platform agnostic and interoperable – you choose your repository environment for long-term storage. There’s nothing about Archivematica’s AIPs or DIPs that requires Archivematica to open them in the future.
Let’s quickly look a bit closer at each of these standards I’ve referenced.
Here’s a quick look at how we embed PREMIS in METS. METS provides us with different sections for descriptive and administrative metadata, and within the administrative section we can embed PREMIS objects, rights, events, and agents.
Here we can see an example of how a PREMIS object-level metadata is nested in the METS techMD. We simply use a wrapping element in METS, declare the standard used, and embed the PREMIS XML inside of it.
Here we have a real example of a PREMIS normalization event as captured in Archivematica’s METS XML – we capture information about the format policy used, the type of event, the tool used, the outcome of the action, and the agents involved the in the event.