Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Digital Preservation - Why, What and How

Problems, solutions and more...

Nir Sherwinter

on 10 October 2018

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Digital Preservation - Why, What and How

Digital Preservation
Why should we care?
Preservation Risks
Preservation Approaches
Research Data
Sustainable Digital Preservation
Why should we care?
Preservation Risks
Preservation Approaches
What is OAIS?
What is PREMIS?
What is METS?
How to get to sustainable digital preservation solution
Why should we care?
Stick figures by Randy Borum (http://community.articulate.com/members/RandyBorum/default.aspx)
Viking Lander data
NASA's early space records are suffering a similar fate, as Joe Miller recently discovered.
The University of Southern
California neurobiologist
couldn't read magnetic tapes
from the 1976 Viking landings
on Mars. With the data in an
unknown format, he had to
track down printouts and hire students to retype everything.
All the programmers had died or left NASA
," Miller said. "It was hopeless to try to go back to the original tapes."
Case Studies
We are living in a digital era:
Berkeley, How Much Info?
Proportion of original, unique publishing (2003)
A study by University of California, Berkeley more than 10 years ago showed that 92% of the materials the university published are digitally-born, meaning the original copy was digital. Paper publications are only 0.01% (for every 10,000 publications only 1 is paper !!).

Baker, M. Keeton, K. Martin, S. June 27 2005
HP Labratories Palo Alto

Basically no matter how much money you spend on the system housing your data there are still many ways in which it can fall over and create opportunities for data to be lost. This may be from hardware/software failure or an act of war. The longer you try to store data the more likely this will occur.

Sometimes people accidentally delete things and if it's the only copy, then it's gone. On the other hand sometimes people think that they no longer need a piece of data and delete it on purpose only to find that it was in fact useful. The longer you try to store data the more likely this will occur.

No affordable digital storage is completely reliable over a long period of time. For example some CD's have recently been shown to have a life span of only 2 years which could cause significant problems for anyone relying on them. Other media such as magnetic tape also suffers various types of bit rot. The worse thing about this threat is that is often undetected until it's too late to recover the material. You would very nearly have to employ someone to check all your media all the time to minimise data losses which would make most of these mediums too expensive to seriously consider in a preservation project. Bit rot is inevitable with any storage medium over a period of time.

Over time all kinds of digital media become outdated. Technology is driven by innovation which unfortunately leads to very short periods of relevancy before redundancy. Data stored on redundant media becomes effectively useless if the appropriate hardware is not available to read it. This is a particularly difficult issue to manage where data is stored over long periods of time. Ideally, long term data storage should be technology independent, however this is not practical. A Cornell University website (mentioned above in another post) has actually documented the lifespan of various storage media with floppy disks lasting a whopping five years.

As hardware becomes redundant, so do file formats and the software which interprets them. A good example of this is Word Perfect; try to find a computer today which can read a Word Perfect document properly. Fortunately, system and format redundancy does not usually happen at quite as rapid a pace as hardware.

Some data can be related, and this relationship can be vital to data interpretation. A good example of this might be the Rosetta Stone, discovered in Rashid, Egypt. The stone is engraved with hieroglyphics in three different languages and without the "key" of what these symbols meant noone was able to read the inscription. It took a French scholar Jean François Champollion fourteen years to decipher the inscription. Can you imagine if you had to take that amount of time to decipher each document on your PC because someone had forgotten to preserve the relationship between that document and its key? It would be like trying to assemble Ikea furniture without instructions, a complete waste of time. Unfortunately, if this relationship is not identified and preserved when information is first stored it is unlikely to ever be recovered. The longer the data is kept without this relationship, the less likely it is to ever be resolved.

Unfortunately in the world we live in there are some people who intentionally destroy or damage digital assets for a variety of reasons. As much of the information is currently located in open access repositories accessible via the internet it is also vulnerable to attack. This is a threat to both long and short term storage.

Many institutions simply do not have the resources, usually financial, to consider digital preservation. These strategies are often overlooked as low priority and are likely to remain so until a major data loss scares people into action.

This is a massive threat to long term digital storage of any kind. Technology is so dynamic not only in innovations but also movement with vendors and competition killing off what seemed to be at one point very strong tech players. For this reason it would be a folly to rely too heavily on any one vendor/system/sponsoring organization because they change and often change quickly. Digital assets which need to be preserved long term must be protected from the failure of any one organisation. Unfortunately this is easily said but hard to plan for in such a dynamic environment.
Why Traditional Storage Systems Don't Help Us Save Stuff Forever?
Massive storage failures
Mistaken erasure
Bit rot
Outdated media
Outdated formats,
applications and systems
Loss of context
Intentional attacks
Lack of resources
Organizational failure
Preservation Risks
Preservation Approaches
Bitstream Preservation
Bitstream preservation can be used as a foundation for other preservation strategies but is not adequate on its own for ensuring long term accessibility and authenticity. It involves simply storing the binary code (1s and 0s) that comprises a digital object bearing in mind that the object will not be reproducible without the original combination of hardware and software that created it. The advantages of carrying out bitstream preservation include:
Having the opportunity to go back to the 'original' record in this form to carry out different preservation techniques in the future.
Is not suitable as a preservation strategy on its own.
BitstreamEncapsulation Preservation
In the encapsulation approach, records are packaged as bitstream with metadata enabling a user in the future to display them. The leading example of this approach is the Victorian Electronic Records Strategy (VERS), the digital preservation program of ADRI member the Public Record Office Victoria.

In the VERS approach, record content is accepted in formats including Text files, PDF, PDF-A, JPEG, TIFF and MPEG, encapsulated using an XML 'wrapper' containing a standard set of metadata elements and authenticated using a digital signature. Each record that is 'encapsulated' can contain multiple documents that together form a record.
Content and contextual information kept together to minimise risk of loss.
Can be 'records-centric' - not as effective for recording contextual information about people, organisations and functions.
Emulation is the replicating of functionality of an obsolete system. According to van der Hoeven, "Emulation does not focus on the digital object, but on the hard- and software environment in which the object is rendered. It aims at (re)creating the environment in which the digital object was originally created.". Examples are having the ability to replicate or imitate another operating system. Examples include emulating an Atari 2600 on a Windows system or emulating WordPerfect 1.0 on a Macintosh. Emulators may be built for applications, operating systems, or hardware platforms. Emulation has been a popular strategy for retaining the functionality of old video game systems, such as with the MAME project. The feasibility of emulation as a catch-all solution has been debated in the academic community.
Has the potential to be more effective for preservation of databases and multimedia.
Still relatively untested in digital records preservation.
This approach involves preserving the bitstream of the record and developing a tool which will be capable of reproducing the intellectual content of the record in a different format. The tool must be developed before the record becomes obsolete. Migration is then only performed when a record is requested.
Data formats which are open standards or which have published codes allow records to be reconstructed if applications are lost. Converting to a different format may cause the record to lose authenticity if essential characteristics are affected.

Tools for converting records to XML formats are now available as open source software
Converting to a different format may cause the record to lose authenticity if essential characteristics are affected.
So, which approach should be taken?
A combination of all!
Let's learn from DigiMan!
What is OAIS?
Reference Model for an Open Archival Information System (OAIS)
Development led by the Consultative Committee for Space Data Systems (CCSDS)
Issued as CCSDS Recommendation (Blue Book) 650.0-B-1 (January 2002)
Also adopted as: ISO 14721:2003

An Open Archival Information System (or OAIS) is an archive, consisting of an organization of
, that has accepted the responsibility to preserve information and make it available for a Designated Community.
OAIS environment:
Producer provides the information
Management sets overall policy (not the day-to-day operations)
Consumer finds and acquires preserved information of interest
Designated Community is the set of Consumers who should be able to understand the preserved information.
Information is any type of knowledge that can be exchanged, and is expressed by some type of data.
For example: The information in a book is typically expressed by characters (the data) which, when combined with a knowledge of the language used (the Knowledge Base), are converted to more meaningful information. If the recipient does not know the language, then the book needs to be accompanied by dictionary and grammar (i.e., Representation Information) in a form that is understandable using the recipient’s Knowledge Base
In order for this Information Object to be successfully preserved, it is critical for an OAIS to clearly identify and understand the Data Object and its associated Representation Information.
- For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits.
- The OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained.
The unit of exchange between an OAIS and its surrounding the environment is an Information Package.
An Information Package is a conceptual container of two types of information:
- Content Information and
- Preservation Description Information (PDI).
The resulting package is viewed as being discoverable by virtue of the Descriptive Information
Information Package variants:
- Submission Information Package (SIP)
- Archival Information Package (AIP)
- Dissemination Information Package (DIP)

Packages will need to vary depending upon their role, for example:
Imaging and e-journal projects often differentiate between their well-managed (and described) "master" files and the derived versions (thumbnails, JPEG files, PDFs) made available through the Web
Now we can understand the diagram:
What is PREMIS?
Captures physical structural relationships, such as which
image is embedded within which website, as well as logical
structural relationships, such as which page follows which
in a digitized book.
Includes technical information that applies to any file type, such as information about the software and hardware on which the digital object can be rendered or executed, or checksums and digital signatures to ensure fixity and authenticity. It also includes content type-specific technical information, such as image width for an image or elapsed time for an audio file.
Includes provenance information of who has cared for the digital object and what preservation actions have been performed on it, as well as rights and permission information that specifies, for example, access to the digital object, including which preservation actions are permissible.
Digital Preservation Tutorial
By Nir Sherwinter
Digital repositories are computer systems that ingest, store, manage, preserve, and provide access to digital content for the long-term. This requires them to go beyond simple file or bitstream preservation. They must focus on preserving the information and not just the current file-based representation of this information. It is the actual information content of a document, data-set, or sound or video recording that should be preserved, not the Microsoft Word file, the Excel spreadsheet, or the QuickTime movie. The latter represent the information content in a specific file format that will become obsolete in the future.
Preservation policies define how to manage digital assets in a repository to avert the risk of content loss. They specify, amongst other things, data storage requirements, preservation actions, and responsibilities. A preservation policy specifies digital preservation goals to ensure that:
Digital content is within the physical control of the repository;
Digital content can be uniquely and persistently identified and retrieved in the future;
All information is available so that digital content can be understood by its designated user community;
Significant characteristics of the digital assets are preserved even as data
Carriers or physical representations change;
Physical media are cared for;
Digital objects remain renderable or executable;
Digital objects remain whole and unimpaired and that it is clear how all
The parts relate to each other; and
Digital objects are what they purport to be.
Describes the intellectual entity through properties such as author and title, and supports discovery and delivery of digital content. It may also provide an historic context, by, for example, specifying which print-based material was the original source for a digital derivative (source provenance).
All of these preservation functions depend on the availability of preservation metadata—information that describes the digital content in the repository to ensure its long-term accessibility. While the Open Archival Information System (OAIS) reference model defines a framework with a common vocabulary and provides a functional and information model for the preservation community, it does not define which specific metadata should be collected or how it should be implemented in order to support preservation goals.
Descriptive metadata
Structural metadata
Technical metadata for physical files
Administrative metadata
PREMIS (PREservation Metadata: Implementation Strategies) is an international working group concerned with developing metadata for use in digital preservation.
Some examples of a digital file’s potential semantic units would include:
the program on which the file was created
the version of that program
the operating system on which that program ran
who created the file
the rights associated with the file
when the file was ingested into the preservation system
dates the file was validated
and so on.
What is METS?
The Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

Musical Score (may be a score, score and parts, or a set of parts only)
Print Material (books, pamphlets, etc.)
Music Manuscript (score or sketches)
Recorded Event (audio or video)
PDF Document
Bibliographic Record
Compact Disc
METS Profiles in use:
defines the preservation model and components

defines the metadata elements that

is used to hold the elements defined by

as defined by
holds the metadata elements defined by

In a standard system:
Using the LIFE project cost model, we can create a complete checklist:
How to get to sustainable
digital preservation solution
Now, let's get practical...
Why should we care?
Preservation Risks
Preservation Approaches
What is OAIS?
What is PREMIS?
What is METS?
L = Complete lifecycle cost over time 0 to T
Aq = Acquisition
I = Ingest
M = Metadata
Ac = Access
S = Storage
P = Preservation

Why should we care?
Preservation Risks
Preservation Approaches
What is OAIS?
What is PREMIS?
What is METS?
How to get to sustainable digital preservation solution
Thank You!
It's important to understand that...
Preservation Approaches
And always make sure that Mr. Bean won't get access to your valuable assets!
Full transcript