Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

The Skeleton Test Corpus

Presentation for IDCC13
by Ross Spencer on 19 January 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of The Skeleton Test Corpus

Generation of a Skeleton Corpus of Digital Objects for the Validation and Evaluation of Format Identification Tools and Signatures by Ross Spencer Twitter: @beet_keeper Ross Spencer - IDCC 2013-01-16T14:00+01:00 Generation of a Skeleton Corpus of Digital Objects for the Validation and Evaluation of Format Identification Tools and Signatures ht.ly/bHmd3 dank u // thank you! CAFEBABE[01:0F]CAFED00D{5}BAADCAFE(0A|0D0A)BAADF00D CA FE BA BE 08 CA FE D0 0D 00 00 00 00 00 BA AD CA FE 0D 0A BA AD F0 0D CA FE BA BE 08 CA FE D0 0D 00 00 00 00 00 BA AD CA FE 0A BA AD F0 0D ONE TWO ?? | * | {n} | {m-n} | (a|b) | [a:b] | [!a] | [!a:b] https://github.com/exponential-decay/skeleton-test-suite-generator Skeleton suite generation time: 27.16615771s
Number of PRONOM records: 946
Number of formats with Signatures: 537
Number of files created: 672 Impact // Response : Duplicate Identifications : 13 fmt/11 and fmt/13
fmt/91 and fmt/92
fmt/60 and fmt/59
fmt/141 and fmt/142
fmt/353 and fmt/436
fmt/131 and fmt/441
fmt/373 and fmt/381 fmt/437 and fmt/353
fmt/438 and fmt/353
fmt/414 and x-fmt/135
x-fmt/34 and x-fmt 34
x-fmt/451 and x-fmt/452
x-fmt/116 and x-fmt/212 Regex Bugs* : 2 fmt/435 - dxf - trailing {1-2}
x-fmt/412 - jar - trailing {18-65531} * Worked in DROID 4.0 not DROID 6.0 DROID Nuances Discovered : 1 No fallback to standard identification for container formats if left unidentified ~ Create ~ ~ Test ~ ~ Submit ~ ~ Host ~ Methodology Acknowledgements Richard Brennan - DROID 6.1 Developer Nicki Welch - Digital Archivist David Clipsham - Signature Developer Where next? idk! immediate impact greater stability lessons learned DROID codebase manual generation? best return for investment Container formats (ZIP, OLE2) Conclusion call to arms - Fetherston and Gollins (2012) : doi.org/10.2218/ijdc.v7i1.211 ~ Identification ~ Feature Extraction ~ Validation ~ ~ Effort involved in full test-corpus ~ Still Picture Interchange File Format - FMT/112 FFD8FFE800205350494646000100(00|01|02|03|04){11}(00|01|02|03|04|05){9}FFE8 ~ Effort involved in full test-corpus ~ Naively: Extract into 5 x 6 skeleton files but: one profile is bi-level
five compression types apply to bi-level images only so: (1 x 5) + (4 x 2) = 13 skeleton files not 30 Profile ID Bits Per Sample Compression Resolution Units P - 1 bi | 4 continuous-tone BPS - 1 bi | 5 continuous tone C - 5 bi | 2 continuous tone R - 3 bi | 3 continuous tone (1 x 1 x 5 x 3) + (4 x 5 x 2 x 3) = 135 exemplar test-corpus files Maintenance More Results // GitHub Stuff ... Failings Benefits ~ Not always possible to submit sample files ~ IPR free ~ Size makes it distributable (400kb) ~ Already benefits FIDO ~ Demonstrated value - DROID v DROID nuances ~ 537 records 672 signatures, unique files ~ Artificial e.g. PDF 1.4 - %PDF-1.4%%EOF ~ Psychological illusion whether good enough? ~ Smaller set of files automatically ~ Signature change = unit test change (DROID) ~ Only tests identification mechanisms ~ Communicate ~
See the full transcript