Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


TEI, What Else?

"TEI, What Else?" -- A prezi for corpus3.aac.ac.at/showcase/index.php/workshop01

James Cummings

on 26 January 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of TEI, What Else?

TEI - What Else?
Dr James Cummings
University of Oxford
Full TEI Schema
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
Customized TEI
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
TEI ODD Customisation
But of course you want to remove specific elements from these 'modules' as well.

project will need all of them!
We recently converted the 40188 texts of the Early English Books Online - Text Creation Partnership (EEBO-TCP) corpus to TEI P5 XML
A large portion of these will become public domain in 2015, so we're testing and improving the conversions and toolchain to do fun and interesting things with them
(Like create ePubs so we can read these early printed books on our iPads and phones with facing page images)
This is a fairly large corpus of texts and in working out conversions from the TCP Markup (TEI P3 but then separate evolution) we undertook various statistical analysis of our resulting TEI P5 texts
<hi> 39,911,344
<lb> 21,300,024
<l> 8,068,582
<p> 7,640,775
<gap> 4,890,337
<desc> 4,890,207
<note> 4,231,424
<pb> 3,400,530
<item> 2,687,551
<cell> 2,404,536
The majority of these are structural in nature
There are only 78 distinct elements used in the entire corpus
This reflects the nature of the TCP encoding guidelines of basic structural and rendering markup
Interoperability between EEBO-TCP texts as they are is not a fair test; there is no linear progression between document size and markup complexity -- the granularity remains the same
Increasingly interpretative markup will added to EEBO-TCP texts as more researchers start to use them as a base for further scholarship and this will be a lot more interesting
The Unmediated
Interoperability Fantasy
Being able to seamlessly integrate highly complex textual structures in interoperable methods without significant conditions or intermediary agents is a fantasy.
If texts do seamlessly interoperate unproblematically with no careful and considered effort then:
The markup in the texts is limited or of a mainly structural granularity
The method of interoperation or processing is superficial
The results are not significant
This is not a bad thing, nor a failing of digital humanities -- it is an opportunity. The necessary mediation, investigation, transformation, analysis and systems design is interesting and important!
Image used with the kind permission of the Worshipful Company of Stationers and Newspaper printers
Chaining of TEI ODD Customisations
One of the interesting developments in TEI ODD design is that soon users will be able to 'chain' customisations.

This means that if a TEI Community or Project makes a customisation then others can come along and make their own customisation that points to this project's TEI ODD as a source.

This enables projects to truly say "We're very much like that project over there (e.g. EpiDoc), but we need to add back in this element that they removed" and to document this in a machine-processable manner.
What is the TEI?
(The Text Encoding Initiative)
An international consortium of institutions, projects and individual members
A community of users and volunteers
A freely available manual of set of regularly maintained and updated recommendations: 'The Guidelines'
Definitions, examples, and discussion of over 530 markup distinctions for textual, image facsimile, genetic editing etc.
A mechanism for producing customized schemas for validating your project's digital texts
A set of free and openly licensed, customizable tools and stylesheets for transformations to many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
A simple consensus-based way of organizing and structuring textual (and other) resources
A format for documenting your interpretation and understanding of a text (and how text functions)
An archival, well-understood, format for long-term preservation of digital data and metadata
make it! It is a community-driven standard
What the TEI is not:
the only standard in this area
objective or non-interpretative
used consistently even within the same project (never mind in other ones)
fixed and unchanging
your research end-point
automatic useful publication of your materials
What Does Markup Enable?
Markup Enables Many Things
It makes explicit to a machine what is implicit to a human
It encodes our assumptions and understanding of a text
It enables us to impose structure(s) on data
It helps us relate information/resources/metadata together
It lets us categorise aspects of texts that interest us
It helps us share our resources and intellectual endevours through choosing to use a common vocabulary
But that is just the tip of the iceberg!
More TEI Markup
Genetic Editing
Stand Off Markup
XML Databases
XML: Pros and Cons
Expressive freedom Verbose markup
Handles complexity Slow to process
Human readable Complex processing
Arbitrary nested data relationships
Schema validation External dependencies
Relational Databases: Pros and Cons
Well understood Un
istic Structures
Stable & Optimised No complex base types
Standard Query Language

Limits of SQL for complex data
Easy to understand tables

Complex relationship linking
Available tools Hard to use others databases
Relational Databases have a much longer history, but XML is now used pervasively throughout our digital world
If your data is table-like (e.g. a telephone book) then databases are better; if your data has arbitrarily deeply-nested structures (names inside paragraphs inside sections) then XML is the right solution
There are efficient ways to query large XML text-bases, but databases enable large querying of simple structures more easily
It isn't about format: XML is good for some things, databases are good for other things
It is about the preservation and interrogation of intellectual content
It is about finding, understanding, and creating human cultural heritage
It is about enabling research
XML is a good verbose storage format even when you are using other formats for delivery
Outputs such as JSON and Linked Open Data (in RDF) are common (often lossy) outputs but are problematic as primary formats
XML enables as many outputs as you can conceive; there is never only one target output (if you let others can use your data)
Holinshed's Chronicles
TEI ODD Customisation
Stationers' Register
William Godwin's Diary
Wandering Jew's Chronicle
Verse Miscellanies Online
Almost everything we do
these days on computers
needs some form of markup
By using a standardised vocabulary
we are able to re-use tools, programs,
and concepts across projects and
Freely available web frontend to underlying XSLT conversions
REST-enabled API interface for scripts doing bulk conversions
Pipelined conversions through many steps (e.g. DOCX to TEI P5 to ePub)
Often uses TEI P5 as pivot format
EEBO-TCP to TEI P5 Conversions
Converted EEBO-TCP Markup Frequencies
Markup Analysis: Are these texts interesting?
About Godwin's Diary
William Godwin (1756-1836), philosopher, writer, political activist, husband of Mary Wollstonecraft, father of Mary Shelley
Main project: Dr Mark Philp, Dr David O'Shaughnessy, two students in Politics Dept.
Diary from 1788-1836; hi-res scans by Bodleian; all images and XML available under CC+BY+NC license
We converted DOCX to TEI, taught them XML, reduced TEI schema, and SVN in 2 days
They marked up and categorised every meal, meeting, event, text, name and person mentioned
They identified ~50,000 of the ~64,000 name instances; they provided notes on the identified people
The benefits of a shared vocabulary
far outweigh
any difficulties
TEI -- What Else?
Avoid manual changes
Script transformations
and conversions
Re-run transformations
with later improvements
Open Data
Any research project
to release its data openly, especially publicly funded projects
If others can't re-use, test, and examine your underlying data in reasonable formats it is not worth producing.
Increasingly funding bodies are calling for open data and open standards
The coolest thing to be done with your data will be thought of by someone else
Verb semantics and argument realization in pre-modern Japanese
License: CC+By
(even this prezi is stored as XML!)
Full transcript