Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Generation of a Skeleton Corpus of Digital Objects for the Validation and Evaluation of Format Identification Tools and Signatures

~ Create ~

~ Communicate ~

~ Test ~

Methodology

~ Submit ~

~ Host ~

Impact // Response :

Duplicate Identifications : 13

  • fmt/437 and fmt/353
  • fmt/438 and fmt/353
  • fmt/414 and x-fmt/135
  • x-fmt/34 and x-fmt 34
  • x-fmt/451 and x-fmt/452
  • x-fmt/116 and x-fmt/212
  • fmt/11 and fmt/13
  • fmt/91 and fmt/92
  • fmt/60 and fmt/59
  • fmt/141 and fmt/142
  • fmt/353 and fmt/436
  • fmt/131 and fmt/441
  • fmt/373 and fmt/381

Regex Bugs* : 2

  • fmt/435 - dxf - trailing {1-2}
  • x-fmt/412 - jar - trailing {18-65531}

* Worked in DROID 4.0 not DROID 6.0

DROID Nuances Discovered : 1

  • No fallback to standard identification for container formats if left unidentified

Skeleton suite generation time: 27.16615771s

Number of PRONOM records: 946

Number of formats with Signatures: 537

Number of files created: 672

~ Not always possible to submit sample files

https://github.com/exponential-decay/skeleton-test-suite-generator

~ IPR free

~ Size makes it distributable (400kb)

Benefits

~ Already benefits FIDO

~ 537 records 672 signatures, unique files

CAFEBABE[01:0F]CAFED00D{5}BAADCAFE(0A|0D0A)BAADF00D

~ Demonstrated value - DROID v DROID nuances

ONE

TWO

CA FE BA BE 08 CA FE D0 0D 00 00 00 00 00 BA AD CA FE 0D 0A BA AD F0 0D

CA FE BA BE 08 CA FE D0 0D 00 00 00 00 00 BA AD CA FE 0A BA AD F0 0D

?? | * | {n} | {m-n} | (a|b) | [a:b] | [!a] | [!a:b]

~ Artificial e.g. PDF 1.4 - %PDF-1.4%%EOF

~ Psychological illusion whether good enough?

Failings

~ Smaller set of files automatically

~ Signature change = unit test change (DROID)

~ Only tests identification mechanisms

ht.ly/bHmd3

More Results // GitHub Stuff ...

call to arms - Fetherston and Gollins (2012) : doi.org/10.2218/ijdc.v7i1.211

Generation of a Skeleton Corpus of Digital Objects for the Validation and Evaluation of Format Identification Tools and Signatures

~ Identification ~ Feature Extraction ~ Validation ~

Conclusion

Still Picture Interchange File Format - FMT/112

by Ross Spencer

Twitter: @beet_keeper

FFD8FFE800205350494646000100(00|01|02|03|04){11}(00|01|02|03|04|05){9}FFE8

~ Effort involved in full test-corpus ~

Naively: Extract into 5 x 6 skeleton files

but: one profile is bi-level

five compression types apply to bi-level images only

not 30

so: (1 x 5) + (4 x 2) = 13 skeleton files

~ Effort involved in full test-corpus ~

P - 1 bi | 4 continuous-tone

BPS - 1 bi | 5 continuous tone

C - 5 bi | 2 continuous tone

R - 3 bi | 3 continuous tone

Profile ID

(1 x 1 x 5 x 3) + (4 x 5 x 2 x 3) =

Bits Per Sample

135 exemplar test-corpus files

Compression

Resolution Units

immediate impact

best return for investment

Where next?

manual generation?

greater stability

Container formats (ZIP, OLE2)

idk!

Maintenance

lessons learned

DROID codebase

Acknowledgements

  • Richard Brennan - DROID 6.1 Developer
  • Nicki Welch - Digital Archivist
  • David Clipsham - Signature Developer

dank u // thank you!

Ross Spencer - IDCC 2013-01-16T14:00+01:00

Learn more about creating dynamic, engaging presentations with Prezi