This document summarizes the workflows used by the American Archive of Public Broadcasting (AAPB) for preserving and providing access to historical public media content. It describes the multi-step processes for appraising, acquiring, ingesting, describing, digitizing where needed, and making accessible collections from public media organizations. Key aspects of the workflows involve coordinating with content donors, normalizing metadata, digitizing physical media, performing quality control checks, storing master and access files, and reviewing content for inclusion in the online reading room. The workflows involve collaboration between AAPB teams at WGBH and the Library of Congress.
3. a collaboration between
WGBH and the Library of Congress
Seeking to preserve and make accessible significant
historical content created by public media, and to
coordinate a national effort to save at-risk public media
before its content is lost to posterity
7. Be a focal point for discoverability of historical public media content;
Coordinate a national effort to preserve and make accessible historical public media content
Provide content creators with standards and best practices, guidance, training, and advice for storing,
processing, preserving, and making accessible their historical content, and for raising funds in order to
accomplish these tasks;
Disseminate content widely by facilitating the use of archival public media content by scholars, educators,
students, journalists, media producers, researchers, and the public, for the purpose of learning, informing,
and teaching;
Increase public awareness of the significance of historical public media and the need to preserve and make
accessible significant public broadcasting programs; and
Ensure the perpetuation of the archive by working toward financial sustainability.
Mission
8.
9. Identified over 3 million items kept at
stations, archives, producers, university
collections across the country
Collected 2.5 million inventory records
from 120 stations
Digitized and ingested 40,000 hours of
material initially from over 100 stations
– 5,000 hours from born digital files
Initial Collection
10. Collection growth
• Growing the collection by up to 25,000 hours of digitized content per year
• Assisting collection holders with digitization grant proposals and ingesting
digital files into our systems
• Recent acquisitions
– PBS NewsHour and predecessor series
– American Masters raw interviews
– Ken Burns’ The Civil War raw interviews
– Eyes on the Prize raw interviews
– NHPR presidential primary collection
– KBOO community radio programs
– NPACT coverage of Senate Watergate Hearings
– Southern California Public Radio environmental collection
– Vision Maker Media films
11. Goal: A Centralized Web Portal for
Discovery
• All AAPB digitized content on specific topics
discoverable through single searches
• Direct links to public media on other sites
• One-stop shopping for users
• Helps solve the separate silos syndrome
• DPLA as a model
12. “Preservation through Collaboration”
• within our WGBH AAPB team
• with the Library of Congress project partners
• with content creators/donors
• with legal counsel & Berkman Klein Center
• with digitization vendors
• with our marketing team
• with technical development partners
• with LIS programs
• with scholars
13. The Challenges at Hand
• Too much material, too little resources
• Access and rights
• Technology moving too fast
• Maintaining relationships with many donors
• Copyright
26. Louisiana Public Broadcasting
• Statewide PBS affiliate
except for New Orleans
• Headquartered in Baton
Rouge
• Went on the air on
September 6, 1975
27. American Archive
• Participating station
since 2009
• Created first
comprehensive
inventory
• Digitized 550 hours of
at-risk media
28. Louisiana Digital Media Archive
• Collaborative project
with the Louisiana
State Archives
• Launched on
January 20, 2015
• Available at
ladigitalmedia.org
29. National Digital Stewardship Residency
• Hired Eddy Colloton as AAPB NDSR resident
• Documented digitization workflow
• Created digital preservation plan
• Updated digitization workflow based on Eddy’s
recommendations
33. AAPB Access Workflow
• One of the first AAPB stations to use the AAPB
as a portal
• Send updated metadata and the LDMA link to
AAPB in a .csv file
• Do not send in digital files from in-house
digitization project
35. Not a Perfect Solution
• AAPB records are entered at the episode level
• We catalog our newsmagazines to the segment
level
• Will work with AAPB team to find a solution
37. Incoming Collections
• Working with multiple digitization vendors, including those
contracted by AAPB and contracted by contributing
organizations, to receive content into AAPB
• Also acquiring born digital and previously digitized content
submitted to us by donors
• Requires AAPB to be flexible in workflows and acceptance
criteria
39. Appraisal – Three types of projects
• AAPB identifies collection to be preserved, guided by Collection Development
Policy
– AAPB works with content creator/steward on digitization grant proposal (unless already
digitized/born digital)
– Content creator confirms total number hours and assets to be delivered, provides item-level
appraisal inventory
– AAPB team determines when the collection can be acquired and plans project workflow
• Content creator contacts AAPB and is seeking to preserve collection
– Content creator provides total number of hours and assets to be delivered, provides summary
of collection
– AAPB team determines when the collection can be acquired and plans project workflow
• Collaborating archive has already preserved content and made it accessible, and
wishes to aggregate it
– Content creator provides total number of metadata assets to be delivered, provides summary
of collection. Note: in this workflow the recordings are not preserved at WGBH and LOC.
40. Collection Development Criteria
• Unique content
• Content has not been widely distributed
elsewhere or preserved elsewhere
• Content created and owned by station
• Content is at-risk due to its condition
• Comprehensiveness of the collection
• Content documents events, topics, places,
persons, opinions, or attitudes of historical,
cultural, political, sociological, anthropological,
scientific, educational, technological, or
aesthetic significance
• Content reflects significant international,
national, regional, state, or local culture, politics,
or society; or presents the viewpoints of
indigenous communities, subcultures, societal
groups, or population segments
• Content documents unique aspects of the style
and practice of radio and television journalism
• Older content
• Content with a significant impact when first
broadcast
• Content that does not merely illustrate material
available elsewhere in other types of media, such
as text or photographs, but includes unique
content not found in other sources
• Content that has received awards
• Raw footage, including interviews, that are
unique and represent significant historic events,
or some unique aspect of the local community
• Content that could support educational initiatives
• Content that the organization would allow the
American Archive of Public Broadcasting to make
available in the AAPB Online Reading Room
41. Contracting
• Organization must agree to and sign AAPB’s Deed of Gift agreement
– Donor donates rights, title and interest in digital copies
– Donor confirms current copyright ownership and control (donor controls all, some or
no rights)
– Donor assigns rights to AAPB
a. Assignment of copyright to AAPB
b. Dedication to public domain
c. Donor grants AAPB an irrevocable, non-exclusive, royalty-free worldwide perpetual license for
AAPB’s discretionary uses of the Donated Materials, in addition to all uses permitted by law.
Such discretionary uses may include but are not limited to cataloging, preservation, copying
and migration for preservation and access purposes, exhibition, display, and making works
available for non-commercial public access (including online), in accordance with AAPB policy
and with applicable law.
– Re-use by patrons
a. Donor may select among any Creative Commons license
b. Donor does not authorize AAPB to make Donated Materials available for re-use by patrons
– All metadata is made available in the public domain
42. WGBH Software and Systems
• Mac computing environment
• PBCore metadata standard
• Archival Management System
(metadata repository)
– PHP app on MySQL database
– MINT (data ingestion and crosswalk
software)
– PBCore API (metadata normalization
software)
– BagIt (packaging metadata for ingestion)
• Sony Ci (media host)
– Ruby scripts to batch upload / delete
– MediaBox to give limited research access
• AmericanArchive.org (public
website)
– Ruby on Rails
– Solr index
– Blacklight frontend
• Amazon S3 and Web Services
(web and document host)
• Google Sheets
• MediaInfo
• FFmpeg
• Sophos
• Terminal
44. Project Input
• Administrative metadata from appraisal inventory
• Digitized media (masters/proxies OR links out)
• Technical metadata
• Descriptive metadata
• Transcripts and/or closed caption files
• Thumbnails
45. Ingestion Workflow: Phase 1
• Provide donor with metadata template and assist in their creation of
appraisal item-level inventory.
• Receive final appraisal inventory with administrative, technical, and
descriptive metadata from donor.
• Normalize the metadata, mapping it to PBCore.
• Ingest metadata into AMS, which creates unique identifiers for each
record.
• Either receive digital files or coordinate transportation of physical tapes to
digitization vendor.
• Export project inventory containing all metadata paired with the new
unique identifiers (GUIDs) from AMS.
46. donor
“Born Digital”
has digital files
(previously digitized or born
digital)
Digitization Needed
has tapes
(grant-funded digitization
project)
• inventory
• descriptive metadata
• digital files (or links)
WGBH
• inventory
• descriptive metadata
digitization
vendor
• tapes
we provide
metadata
template and
assistance
Ingestion Phase 1
inventory
matching
AMS records to
tapes
47. Ingestion Workflow: Phase 2
Born Digital
• Check for viruses
• Verify inventory vs delivered
content
• Create checksums
• Create proxy files
• Create and upload MediaInfo
technical metadata
• Normalize file names
Digitization
• Send vendor project inventory
• Coordinate digitization with
vendor
• Receive technical metadata
and digitized files
• Check for viruses
• Confirm checksums and QC
files
51. WGBH
Library of
Congress
WGBH
LTO in vault
validate
checkums
thumbnails
AWS
S3
WGBH MARS
record
Archival
Management
System
(AMS)
preservation
metadata
Sony Ci
Master files
proxies
validate
checkums
Content Flow: Born
Digital Projects
54. Digital Preservation
• WGBH generates or receives SIP checksums
• Create AIP on LTO, and another version on spinning disk.
– Use terminal to copy files and confirm checksums in batch processes.
• Create manifest of checksums on LTO.
• Put LTO in the vault
• Retention schedule assigned to LTO tapes.
• Upload preservation metadata (checksum, LTO number, drive number,
project code) to each AMS record in the collection.
• Put checksum manifests on department server.
55. Arrangement and Description
• Project acquisition:
– Varies from project to project; is determined by goals of grant and
planned during appraisal.
– Usually more robust, involves more staff time and item-level
processing.
• Standard acquisition:
– All records uploaded immediately for on location access.
– Interns perform our minimal viable cataloging workflow.
– Staff conducts Online Reading Room reviews of full series and some
items after acquisition has been ingested.
56. Online Public Access
• Online Reading Room totals more than 23,000 programs
available to anyone in the United States (30% of collection
and growing)
• Online access in accordance with copyright law, including
legal doctrine of fair use
• Access for research, educational and informational
purposes only
• Inclusion in the ORR determined by analysis of types of
programs and examination of individual series and
programs
57. ORR Review Workflow
• Check if series/asset is produced by organization that
signed a quitclaim
• Review the series/asset
• Determine the genre bucket
• If an ORR bucket, check for 3rd party content from
litigious organizations
• Put in online reading room! (or don’t )
– If the series/asset is questionable, we have regular
meeting with lawyers for final decisions.
58. Access Workflow
• Run our ruby script to batch upload proxy files to Sony Ci
• Add Sony Ci identifiers and other functional metadata to
corresponding AMS records
• Use FFMPEG script to create thumbnail jpgs from video
proxies
• Upload thumbnails, transcripts, and closed caption files to
Amazon S3
• “Reindex” on americanarchive.org
61. Preservation at the Library
• All files ingested into the “deep archive”
– Files written to T10K-C tapes
– Access copies kept at two, geographically separate
locations
• Migrations performed every 3-5 years
• Checksums stored in MAVIS database
– Files are periodically validated
– Back-ups are used when problems occur
62. Inside the Data Center at NAVCC
Data storage systems Data processing systems
63. Ingestion Workflows
• New workflows developed to accommodate AAPB material
– Ingestion outside of the normal “ordered” workflow at LC
– Digital files with no physical component
– MAVIS records need to be created before ingestion
– Automate as much as possible
• Metadata
– Ingesting metadata from several different sources
– Clean and map to MAVIS fields
– Issues with differences in how AMS and MAVIS are structured
• Different types of workflows
– Files from vendor vs files from a donor
– Born digital vs digitized content
– LTO vs hard drive
65. The Library as AAPB Contributor
• The Library holds a large collection of PBS and NET
material on 16mm and 2” video
• Watergate coverage and Impeachment hearings added
to the AAPB in November
• Challenges
– Funding for in-house digitization
– Develop new workflows
– Resource allocation
– Legal clearance
– Exporting metadata from MAVIS
67. Challenges and Future Goals
• Continually improve/adapt workflows
• Put Baton QC software into wider use
• Improving file delivery methods
• Improving metadata mapping and MAVIS
record creation
It’s a collection of radio and tv materials created by or for public tv and radio in the US dating back to the 1950’s to be preserved for historical purposes and for access by the public.
Who are we? WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Downton Abbey and Sherlock. In addition to television, we have 2 radio stations and a large, award winning Interactive department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television.
At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.
Project is a collaboration between the Library of Congress and WGBH. The Library will oversee the long term preservation of the digital files.
The American Archive of Public Broadcasting seeks to preserve and make accessible significant historical content created by public media, and to coordinate a national effort to save at-risk public media before its content is lost to posterity.
Our mission and goals are challenging. In addition to preserving, we want to assure discoverability, and access. We want to guide and support current content creators and stewards of the materials with best practices to protect this historic programming. We want to facilitate the use of the materials and increase public awareness of it’s importance. And of course we want to be able to sustain these goals into the future.
Initially, CPB funded an inventory project and then a large digitization project. Stations that participated in the inventory had the opportunity to choose items to be digitized – items important to them, or items that the only way they might find out what it is is by digitizing and watching or listening to it. CPB chose a single vendor – Crawford Media – to do all the digitization. Tapes were sent to Crawford in Atlanta. In addition about 5,000 hours of already digital content was identified to be added to the collection. So in the end, the initial collection consists of about 40,000 hours of content from about 97 organizations which totalled 68,000 files. In 2013 CPB chose the collaboration between the Library of Congress and WGBH to be the future stewards of the Archive.
AAPB hopes to provide a centralized web portal of discovery where researchers, educators, students – really anyone – can find relevant public broadcasting programs existing either on our own site or on sites belonging to other archives and stations. With approximately 1,250 public radio and television stations in existence, one access point will aid scholars interested in researching how national or even international topics have been covered in divergent localities over the past 60+ years. AAPB has made a start at becoming that portal. If stations and archives operating their own websites will send us metadata, we will provide direct links from AAPB to digitized files on the other sites. For a researcher, this would be one-stop shopping. This is how the Digital Public Library of America (DPLA) operates, and we plan soon to make AAPB files accessible through searches on the DPLA website as well as on our own. We want to help solve the separate silos syndrome.
First the problem or challenge. There is much to do to save our audiovisual heritage. Collections are huge, technically complicated, access is sometimes an unmovable barrier because of rights, it’s expensive to preserve, and the material is deteriorating. Copies of materials exist in multiple locations. And the technology to create, save and make accessible is rapidly changing. Our best effort to save it is to collaborate. But that means talking to people and sharing information. How do you build multiple successful partnerships that range from the library next door to the institution across the ocean that speaks a different language?
4.5 assets reviewed per hour = about 11.5 months of review left
The LDMA impacts the way we participate with the AAPB
-In 2016, LPB was one of seven public media stations to participate in the AAPB National Digital Stewardship Residency (NDSR) project. We were able to hire a recent NYU MIAP graduate, Eddy Colloton, to serve as our resident. In all, he worked with us for seven months.
-During his residency, Eddy was tasked with documenting our digitization workflow, making recommendations for improvements, and putting together a digital preservation plan. The plan is available on the AAPB NDSR website at https://ndsr.americanarchive.org/2017/03/06/louisiana-public-broadcasting-digital-preservation-plan/
-When LPB began its in-house digitization project in 2014, we chose IMX50 as our preservation format based on our available resources at the time. IMX50 is a lossy compression format and we knew that we wanted to move to a lossless compression format to be in line with best practices. Eddy recommended that we move to FFV1 with an MKV wrapper as our new preservation format.
-In June of this year, we were able to repurpose one of our computers and purchase a piece of AJA equipment to simplify our digitization workflow. Our IT engineer, Adam Richard, also wrote a program that implements Eddy’s recommendations.
-Here is a flowchart of our new digitization workflow. I’ll try to simplify it for you.
-Our transfer engineer now creates 10-bit uncompressed MOV file. At the end of the day, he presses a button to start running the program over night.
-The program creates an FFV1 file with an MKV wrapper using ffmpeg and then checks a frame-by-frame hashing between the uncompressed MOV and the FFV1 files. If the hashes match, the MOV file is deleted. If the hashes don’t match, we use the MOV as our master files. This process, which was suggested by Eddy, incorporates code developed by the Irish Film Archive and CUNY-TV.
-The program also creates a mezzanine file, a 5MB MP4 file that we can edit with. It also runs MD5Deep (creates checksums), MediaInfo (technical metadata), and MediaConch (encoding checker) on these files. The next day, I create the web files using Forscene.
-All of these files are stored on three LTO6 tapes.
-The code for the program is available here: https://github.com/leta-lpb/lpb_archive
-I do all of the cataloging for LPB’s assets and make those available on the LDMA, rights permitting.
-Through an IMLS planning grant for the LDMA, we were able to develop our own PBCore-based MySQL database. We share this database with the Louisiana State Archives so it has additional features that meet both of our needs.
-We also developed an API between this database and the LDMA front end that allows us to display selected PBCore fields on our individual asset pages on the LDMA.
-Because we have our own database, we do not use the AAPB AMS as our primary catalog.
Here is an example of how the metadata from our MySQL database is displayed on an individual asset page on the LDMA. We chose to only display selected metadata fields that we felt were most useful to the end user.
-Because we are already providing access to our assets through LDMA, we worked with the AAPB team to become one of the first participating stations to use the AAPB as a portal, meaning our AAPB records includes links to the videos that are available on the LDMA.
-This is a good solution for us because it is much easier and quicker for us to send in updated metadata instead of all of the digital files that we’ve transferred in-house.
-We’re able to drive new traffic to the LDMA and we’re also still contributing to the AAPB to make it as complete as possible.
-We started sending updated metadata for our records in 2016. At that point, the LPB records that were in the AAPB catalog were only up-to-date to the beginning of 2012 when we turned in our inventory records as a part of the American Archive Content Inventory Project (AACIP).
-Therefore, I have to send in updated metadata and thumbnail images for both existing records and new records that were created after 2012. Also have to keep track of what has been updated.
-I decided to do this through a .csv file
-For existing records – our local identifier, AAPB GUID, LDMA link, broadcast date, and updated description.
-For new records – local identifier, LDMA link, asset type, title(s), genre, duration, original broadcast date, copyright holder, and description.
-In 2017, I started creating separate spreadsheets for our new and existing records as a part of my cataloging workflow.
Here is an example of a LPB record on the AAPB with updated metadata and a link to the LDMA.
-Using the AAPB as a portal to our LDMA records is not a perfect solution for everything. For our newsmagazines, we decided to catalog those programs to the segment level instead of the episode level. In the AAPB catalog, our records are only at the episode level.
-This is not an issue for the content that we had digitized through the AAPB because the AAPB has copies of those digital files. I can send an updated description that includes all of the segment descriptions and a user can view the entire episode through the AAPB Online Reading Room.
-It is an issue for content that we’ve digitized in-house and have not sent the digital files to the AAPB. It requires three or four LDMA links to see an entire episode and the AAPB AMS is not set up for this at this time. I’ll continue to work with the AAPB team to come up with solution.
-We appreciate that the AAPB is flexible and recognizes that each participating station has a different set of needs and is willing to work with us to come up with solutions that work for both sides. Ultimately, this is to the benefit of our users.
Some of these phases occur concurrently, with different assets in different workflow phases, and arrangement and description phase varies most from project to project. Some may not be needed, like digitization for instance. Ingestion often includes a digitization phase, but doesn’t have to if stuff has already been digitized. We also revisit arrangement and description multiple times over a collection’s life. Some projects may require minimal arrangement and description if the donor has already done that work.
Appraisal happens at the collection level, not the item level. Either AAPB identifies a collection and seeks to preserve it, or AAPB is contacted by a creator who wants to preserve their collection, and AAPB determines whether we can take it and in what fiscal year. We are guided by our collection development policy, but in the last 1.5 years of growing the collection, we have accepted every collection that has come to our doorstep. There are three different appraisal situations. The third situation is the least laborious workflow, but is also accomplishes less of our mission.
30% of the deeds of gift we have received have been dedicated by the owner/creator under a creative commons license. You just have to ask, and often they will say yes! We also contract with digitization vendors at this stage, which can be a lot of work.
AMS is the heart of the work, and our systems and workflows are built around its functions. Biggest takeaway is that it primarily functions on the item level, as does our website, which greatly impacts how we imagine our work and workflows.
Inventory is item level, in one way or another. We can ingest work level metadata with multiple physical or digital instantiations, which is sort-of of item-level. Technical metadata is computer generated, as is as much descriptive metadata as is possible.
The first step, involving the appraisal inventory, usually happens before contracting. An inventory needs to happen before we can continue with ingestion. We have basic required item level metadata, which is pretty much just what is required to create a valid PBCore record. The unique identifiers are important, because we use those string to track assets throughout all of our systems and workflows. If we’re aggregating the metadata and links to other archives, normalizing the metadata is the biggest part of that workflow.
Digital files include media and related materials, such as transcripts. This slide shows how ingestion differs between donors providing digitized files. We call the workflow “born digital” but often that just means the records have already been digitized. We use the same workflow for for truly born digital assets and assets that have already been digitized before we appraise them. Kinda two different workflows, so I
A lot of the stuff we do for born digital is in terminal, using bash scripts we have or write for the project. I didn’t know bash but learned it to do this work. We provide digitization specifications to vendor, who delivers masters, proxies, and technical metadata exactly how we expect. Library does most of the QC for Master files now.
Difference between born digital and digitization workflows is that for born digital we have to create technical metadata and initial checksums
Difference between born digital and digitization workflows is that for born digital we have to create technical metadata and initial checksums
Clearly a lot going on, and a lot of systems are needed to create our digital archive environment. Difference between born digital and digitization workflows is that for born digital we have to create technical metadata and initial checksums
Generate checksums using terminal MD5 command.
This phase varies most from project to project, and is planned during appraisal. Dependent on the goals of the project and the content it can vary a lot. For instance the American Masters collection (restricted access) had almost no arrangement and description besides adding interview titles, while the NewsHour (full online access) is robust item-level description of 8,000+ assets. To date budgets and the legal status of the recordings has determined how much description we do. Immediate upload of standard acquisitions is our version of MPLP.
Website displays information only at item level, is more “library” style than “archives”.
Talked about this at the "Bucket list" navigating copyright session that was at 9:45am earlier today!
Buckets are pretty broad, including News Reports that contain limited 3rd party content, News Magazines with limited art/3rd party content, documentaries with limited 3rd party content, talk shows,
Reindexing means running an application we built, which prompts the AMS to send stuff to Solr. When all the functional metadata in the AMS is right (and the video is hosted in Sony Ci), and the ingest is successful, the media will display on our website.
A lot of this is manual, an emphasis going forward is automating more of it.
The Library of Congress is the preservation arm of the AAPB. One of the reasons for this is that the Library has the resources to process and store the thousands of preservation files received through this project. At LC, we have a [insert PB] storage capacity. All files are ingested via our Packard Campus Workflow Application (PCWA) and into our deep archive. Files are written to T10K-C tapes, with copies kept at two, geographically separate locations. Files are migrated every 3-5 years and fixity regularly verified. The metadata and checksums are stored in our MAVIS database. If a problem is detected, files are replaced from a back-up.
Within the next several months we will be developing a Station Advisory Committee comprising reps from participating orgs to provide guidance and advice to AAPB on future services, workflows, etc.
About to launch development of a more sustainable solution for metadata management, with input from the Station Advisory Committee on workflows and needs