Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

JADH 2013: ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects

A plenary presentation for JADH 2013; http://tinyurl.com/jc-JADH2013; Rough draft text at http://blogs.it.ox.ac.uk/jamesc/2013/09/21/oddly-pragmatic/
by

James Cummings

on 21 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of JADH 2013: ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects

Being able to seamlessly integrate highly complex textual structures in interoperable methods without significant conditions or intermediary agents is a fantasy.
If texts do seamlessly interoperate unproblematically with no careful and considered effort then:
The markup in the texts is limited or of a mainly structural granularity
The method of interoperation or combined processing is superficial
There has been a loss of some intellectual content, or
the results gained by the interoperation are not significant
This is not a bad thing, nor a failing of digital humanities -- it is an opportunity. The necessary mediation, investigation, transformation, analysis and systems design is interesting and important!
We recently converted the 40188 texts of the Early English Books Online - Text Creation Partnership (EEBO-TCP) corpus to TEI P5 XML
A large portion of these will become public domain in 2015, so we're testing and improving the conversions and toolchain to do fun and interesting things with them
(Like create ePubs so we can read these early printed books on our iPads and phones with facing page images)
This is a fairly large corpus of texts and in working out conversions from the TCP Markup (TEI P3 but then separate evolution) we undertook various statistical analysis of our resulting TEI P5 texts
The comparison of the existing markup through use of TEI ODD was a crucial part of understanding the differences
<hi> 39,911,344
<lb> 21,300,024
<l> 8,068,582
<p> 7,640,775
<gap> 4,890,337
<desc> 4,890,207
<note> 4,231,424
<pb> 3,400,530
<item> 2,687,551
<cell> 2,404,536
The majority of these are structural in nature
There are only 78 distinct elements used in the entire corpus
This reflects the nature of the TCP encoding guidelines of basic structural and rendering markup
Interoperability between EEBO-TCP texts as they are is not a fair test; there is no linear progression between document size and markup complexity -- the granularity remains the same
Increasingly interpretative markup will added to EEBO-TCP texts as more researchers start to use them as a base for further scholarship and this will be a lot more interesting
EEBO-TCP
Poetic Forms
Online
Holinshed's Chronicles
Verse Miscellanies Online
EEBO-TCP to TEI P5 Conversions
Converted EEBO-TCP
Markup Frequencies
Markup Analysis:
Are these texts interesting?
http://www.english.ox.ac.uk/holinshed/
What is the TEI?
(The Text Encoding Initiative)
An international consortium of institutions, projects and individual members
A community of users and volunteers
A freely available manual of set of regularly maintained and updated recommendations: 'The Guidelines'
Definitions, examples, and discussion of over 530 markup distinctions for textual, image facsimile, genetic editing etc.
A mechanism for producing customized schemas for validating your project's digital texts
A set of free and openly licensed, customizable tools and stylesheets for transformations to many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
A simple consensus-based way of organizing and structuring textual (and other) resources
A format for documenting your interpretation and understanding of a text (and how text functions)
An archival, well-understood, format for long-term preservation of digital data and metadata
Whatever
you
make it! It is a community-driven standard
What the TEI is not:
the only standard in this area
objective or non-interpretative
used consistently even within the same project (never mind in other ones)
fixed and unchanging
your research end-point
automatic useful publication of your materials
TEI
NOT TEI
The benefits
of a shared vocabulary
far outweigh
any difficulties
OxGarage
OxGarage
Freely available web frontend to underlying XSLT conversions
REST-enabled API interface for scripts doing bulk conversions
Pipelined conversions through many steps (e.g. DOCX to TEI P5 to ePub)
Often uses TEI P5 as pivot format
http://www.oucs.ox.ac.uk/oxgarage/
Full TEI Schema
Modules
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Dictionaries
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
Customized TEI
Modules
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Dictionaries
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
TEI ODD Customisation
But of course you want to remove specific elements from these 'modules' as well.

No
project will need all of them!
Chaining of TEI ODD Customisations
One of the interesting developments in TEI ODD design is that user are able to 'chain' customisations.

This means that if a TEI Community or Project makes a customisation then others can come along and make their own customisation that points to this project's TEI ODD as a source.

This enables projects to truly say "We're very much like that project over there (e.g. EpiDoc), but we need to add back in this element that they removed" and to document this in a machine-processable manner.

This leads to easier interoperability while maintaining flexibility
TEI ODD
Customisation

Image used with the kind permission of the Worshipful Company of Stationers and Newspaper printers
Stationers'
Register Online

William Godwin's Diary
About Godwin's Diary
William Godwin (1756-1836), philosopher, writer, political activist, husband of Mary Wollstonecraft, father of Mary Shelley
Main project: Dr Mark Philp, Dr David O'Shaughnessy, two students in Politics Dept.
Diary from 1788-1836; hi-res scans by Bodleian; all images and XML available under CC+BY+NC license
We converted DOCX to TEI, taught them XML, reduced TEI schema, and SVN in 2 days, built website
They marked up and categorised every meal, meeting, event, text, name and person mentioned
They identified ~50,000 of the ~64,000 name instances; they provided notes on the identified people
http://godwindiary.bodleian.ox.ac.uk/
Open Data
Any research project
needs
to release its data openly, especially publicly funded projects
If others can't re-use, test, and examine your underlying data in reasonable formats it is not worth producing.
Increasingly funding bodies are calling for open data and open standards
The coolest thing to be done with your data will be thought of by someone else!
But to do this, research projects
need
centralised institutional support!
Benefits of SRO Customisation
Keying company would use any XML schema provided
Keying company charged per kilobyte of output
Original estimate of using byte-reduced schema were
40%
Actual savings of using byte-reduced schema more than
60%
Savings meant project able to include Eyre, Rivington, and Plomer editions of the Register (1640—1708)
ODDly Pragmatic:
Documenting encoding practices in Digital Humanities projects

Dr James Cummings
IT Services, University of Oxford
James.Cummings@it.ox.ac.uk
@jamescummings
http://tinyurl.com/jc-JADH2013
http://www.poeticformsonline.org/
The Unmediated Interoperability Fantasy
versemiscellaniesonline.bodleian.ox.ac.uk
www.poeticformsonline.org
www.cems.ox.ac.uk/holinshed/
The TEI Does Not Say:
Do what I say!
Instead it says:
Do what you need to do but explain it to me in ways I can understand
http://blogs.it.ox.ac.uk/jamesc/2013/09/21/oddly-pragmatic/
CC+BY
Full transcript