Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Fmt/Known...?

Adventures in format identification
by

jay gattuso

on 15 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Fmt/Known...?

Adventures in Format Identification
fmt/known...?
1. Context
What?
~9 million files
~1 million intellectual entities
~100Tb of data
http://paperspast.natlib.govt.nz/cgi-bin/paperspast
http://atojs.natlib.govt.nz/cgi-bin/atojs
http://natlib.govt.nz/collections
Why?
National Library of New Zealand Act 2003

Part 4, Section 33.
"Provision of copies of public documents to National Library"
2. Technology...
Rosetta (Ex-Libris)
Validation includes:
Virus Check
Checksums
Format ID
Characterisation
All deposits go through validation checks
Processes
PRONOM
The "what is a format" reference
Rules and Policy
Rule 1: No files to be ingested without (a) an identifiable file format type and (b) a valid format extension

Rule 2: If no PRONOM format identity is found, attempt to establish a signature based file format type. Failing that, attempt to establish an extension based file format type. Failing that.....

Policy: Where applicable, use the preconditioning policy to assert a file format identity prior to ingest (or during ingest process)
Thelma Rene Kent panning for gold in the Arawata River. Kent, Thelma Rene, 1899-1946 Ref: 1/2-008741-F. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22775043
3. Example Workflow
How?
Via DROID, PRONOM is our primary filter, and has been for 5 years of gathering digital objects. In that time:
3 versions of DROID
45 Signature updates
(7 used in live system)
Objects that fail DROID are treated as anomalies
National Digital Heritage Archive
Est. 2004
The NLNZ digital storage & preservation layer
7 FTE's in business unit
~4 FTE's outside business unit
SaaS (commercial cloud) storage
IaaS (ditto) project inflight (2014)
National Library of New Zealand
Est. 1965
Manages the legal deposit mandate
www.natlib.govt.nz
Who?
Alexander Turnbull Library
Est. 1920
New Zealand’s national documentary heritage collections, including both published and unpublished items
Alexander Turnbull Library:

Preserve, protect, develop, and make accessible for all the people of New Zealand the collections of that library in perpetuity and in a manner consistent with their status as documentary heritage and taonga'
Wellington Independent, Volume II, Issue 149, 17 March 1847, Page 1
http://openbuildings.com/buildings/national-library-of-new-zealand-profile-34223
First Contact
Transfer negotiated with Digital Archivists
Initial virus check completed
Files moved to NLNZ pre-deposit storage
Collection sized for qty and complexity
Meet with Preservation team to discuss any pre-deposit treatment
Treatment
Record file structure "as was"
Run through DROID
Find: extension mismatch, fmt/known, multiple PUID offered
Address issues (change extensions, explore new format sigs, select appropriate single PUID)
Record decisions, and file structure "as is "
Generate provenance notes where applicable
Example Actions
How does:
{\rtf0\mac
relate to
{\rtf1\ansi
[4200060006003701
802d0a0046000300]
No extension
Write Professional v1.2
New Format
This file failed format ID because its damaged, not because its a new format
This is not a file we have seen before.
We do not know what the creating app was...
Outcomes
Write new sig
give to PRONOM
Either write new sig
give to PRONOM
or
get existing sig amended
Request new object.
Watching brief.
Possibly follow creation
process upstream to
find failures
Trouble......
All other cases - have valid
PRONOM format ID
4. Using PRONOM
Working Assumptions
A format ID (PUID) is constant
Object A will always be identified as PUID A
A format ID (PUID) is persistent
PUID A will never change
A format ID (PUID) never changes granularity
PUID A will never fracture into (PUID A and PUID B), or subsume PUID B
Research Areas
Persistence of PRONOM ID's over time
What does our use of different DROID versions and signatures over time look like?
What are the relative merits of MBS?
Is there an efficiency gain?
Whats the accuracy cost?
How does the use of EOF markers in JPEG signatures affect IDs?
What is the EOI marker anyway?
Lessons
And so...?
However....
Good documentation is rare
Very few records are perfect
Interpretation of standards...
Thus, change is inevitable....
Some of what we use
How we work
What we've learnt
Jay Gattuso
National Library of New Zealand
National Digital Heritage Archive
Feb 2013
jay.gattuso@dia.govt.nz

Explored 26k objects of 61 PUIDs
4 versions of DROID and 5 signature version
75% PUIDS performed identically
26% file types get multiple PUIDs
Smart scanning hints at considerable gains:
PUID Persistence is not assured:
Explored 11k objects of 55 PUIDs
41 different scan sizes
55% of the formats only need 64bytes
7% of the formats need more than 64Kbytes
Test EOF markers in jpeg & PDF sigs
Share Results
Eph-D-SOCIAL-1987-01
New Zealand Government. [1980s].
Alexander Turnbull Library
Share Effort
Share Resources
Share Ownership
100% accuracy with and without EOF marker in signature
EOF useful for finding damaged / incomplete files
Contribute to PRONOM
Commit to the issue:

Discussions
Signatures
Effort
Development
Early computers. K E Niven and Co :Commercial negatives. Ref: 1/2-240612-F. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22310339
Early computers. K E Niven and Co :Commercial negatives. Ref: 1/2-240383-F. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22786484
Sourced from LINZ. Crown Copyright reserved.
Booth, Macdonald & Co Ltd :Carlyle 3-furrow lever plough [and] Carlyle 3-furrow riding plough. [1907].. Booth, Macdonald Co Ltd :Farm implements catalogue no. 27. Christchurch, Timaru, Hastings, N.Z. 1st January, 1907.. Ref: Eph-A-FARM-MACHINERY-1907-01-04. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/23015289
Horse team ploughing a field on a farm in England, World War I. Royal New Zealand Returned and Services' Association :New Zealand official negatives, World War 1914-1918. Ref: 1/2-014196-G. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22853134
Group photo of school children from Te One School, Chatham Islands. Burt, William Beverland, fl 1970s-1990s :Photographs of the Chatham Islands. Ref: 1/2-077399-G. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22781135
Emmerson's 2008 Fridge Door Collection. 2008. Emmerson, Rod, 1955- :[Digital cartoons published 3 July 2005 onwards in the New Zealand Herald.]. Ref: DCDL-0015225. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/23171433
Full transcript