Great Expectations: The Future of the Past (for REED)

Envisioning REED in the Digital Age Workshop 2011-04, Toronto

James Cummings

on 6 April 2011

Transcript of Great Expectations: The Future of the Past (for REED)

Great Expectations The Future of the Past (for REED) James Cummings
InfoDev, Computing Services
University of Oxford
James.Cummings@oucs.ox.ac.uk Why Am I Here? I've been interested in REED records since my undergrad here at UofT
2002 IMC Paper on REED and 'Emergent Technologies' eventually published in REED in Review
2005 proof-of-concept TEI XML (and HTML) version of 'Wells' records and IMC paper on 'ReREEDing REED'
Also did a proof-of-concept REED volume search engine that returned DejaVu or PDF pages. Creating a REED Corpus You've already heard about XML

But why descriptive markup at all? Saying what things
has more power than saying what they are for! Most importantly: it helps easy re-use of things in ways we never expected. Things we expect Interactive Research Website Interlinking of all online REED-related resource New ways of interrogating the data And the Metadata as well! New ways of undertaking REED-based research Synonym-based multi-lingual intelligent searching How else can this data be used? Linguistic analysis? Computational Codicology Genealogical context building? Still fairly straightforward stuff? What do you do with a million REED records? Does size really matter? (Records range greatly in size) Millions of Words certainly... but what should we do with them? Statistical palaeography? (Mining abbreviation expansion data?) Decontextualisation One of the problems with REED, you lose MS context. But this is only because online images of the MS don't (or didn't) exist... if they did, we could link back to them! Advances in technology will make transcription easier... ...but probably not automatic! But REED records when re-contextualised are a goldmine of fragments of transcriptions that machines can use to learn the letter forms of that MS. We won't fully transcribe all of these documents any time soon, certainly not in the next century. (Though, I'd love to be proven wrong,
especially if I get to live to be 140...) Village Pre-1844 = Worcestershire Post 1844 = Herefordshire (e.g. Edvin Loach, a tiny exclave of Worcestershire) Google Maps (With historical boundary maps!) Date of record changes location... So we can plot both date and location (Since REED sensibly preserves this information) Correct Information @ Correct Granularity But is decontextualisation a bad thing? In Print: REED Volume A
One Village REED Volume B Another Village Two Miles Online... the records can be plotted as two miles away. Decontextualisation of records from MS = necessary evil Decontextualisation of records from REED volumes = opportunity So bearing that in mind.... It might seem a bit strange for me to say... There. I said it! What I DO care about is that they use the correct granularity of (preferrably XML) markup and record the correct things using international standards. ** make it available as structured data (e.g. excel instead of image scan of a table) TEI XML underneath to generate HTML, PDF, RDF (linked data) Hypothetical record 1601 Chamberlains Accounts ABC: MS AB/CD/EF f 230v (6 Janaury) Item paid to the Earl of Essex's players -- 3s 4d Another reason I don't care: If
correct information @ correct granularity
then we can programmatically transform to
standard formats of the future. Linking REED Online Datasets All REED online materials, should use same underlying methods for linking data and concepts
Each new resource then improves all existing ones Future of the Past (for REED) Greater integration with existing resources
Greater integration of existing REED resources
More use REED materials as avenues into history, literature, politics, manuscript/archival studies, textual editing, etc. If the future is already here, what's that on the horizon? Prediction of long-term technological developments is difficult
Best solution:
mark up data in detailed granularity
mark up using international standards
expose underlying raw data
also provide automatic conversions to current popular flavours (e.g. linked data)
take advantage of third-party resources for other exports (e.g. google maps)
license clearly and openly for re-use
Web of Data * make your stuff available on the web (whatever format but licensed) *** non-proprietary format (e.g. csv instead of excel) ★★**** use URLs to identify things, so that people can point at your stuff ***** link your data to other people’s data to provide context (Of mostly accounting and legal documents?) R E E D Statistical Historical Research? And countless other things! But that is still what you expect! Humanity WILL
its cultural
heritage! When? 50 years? 100 years? 200 years? "eventually"? TEI = Text Encoding Initiative

De facto standard for
encoding historical
texts Deep Down:
I don't really care
If REED chooses to
use the TEI ! (And TEI XML would still be the best
choice to do this! So I'm relieved that
REED is planning to follow the good
advice to use TEI.)

The point is that the data format is
less important than the way the data
is modelled. Correct Information @ Correct Granularity Hopefully in developing its plans for future digital expansion, REED will bear some of these principles in mind.

(Luckily, much of this it already plans to do (pending funding)!) (Remember: eREED will be converting
REED-edited extracts, not the original
documents) (And when it is done with low-hanging fruit of printed works, it will build upon the relatively small-scale work done on 'harder' things like medieval manuscripts.) (And here I just mean:
"take pretty pictures of"
rather than "create digital
scholarly editions of") (Oh, and for my sins I'm also the elected
director of the Digital Medievalist project:
which runs an open access journal, mailing list,
conference sessions, etc. for medievalists doing
digital resource creation.) (Or get funding to make your own, but
if you do release them openly...even (or
especially) to giants like Google.) REED must do anything it can to get more people to use its materials.

More use will eventually equal more funding. Over the last decade (mainly on
the TEI Technical Council) I've
worked to highlight the benefit of
TEI XML as a long term preservation
and rich encoding format.
Full transcript