Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

What is the TEI? And why should I care? (A brief introduction for classicists)

An introduction for digital classicists for the Digital Classics: Oxford ancient history seminar, 2015-01-27; License: CC+BY+NC
by

James Cummings

on 19 February 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of What is the TEI? And why should I care? (A brief introduction for classicists)

Types of Markup
Procedural Markup:
RED INK ON; print "-£1000"; RED INK OFF
Descriptive Markup
It is usually more useful to mark up what we think things represent (in a source text, in our understanding of the data, etc.) rather than what they look like.
Using descriptive markup enables us to make explicit the distinctions we want to make when processing a string of characters
It gives us a way of naming, characterising, and annotating textual data in a formalised way and recording this for re-use
Also called 'Encoding' or 'Annotation'
Separation of Form and Content
Presentational markup
cares more about fonts and layout than meaning
Descriptive markup
says what things are, and leaves the rendition or processing of them for a separate step
Separating the form of something from its content makes its re-use more flexible
It also allows easy changes of presentation across a large number of documents

Markup as an Intellectual Activity
The application of markup to a document can be a scholarly activity
Deciding what markup to apply, how this represents the understanding being modelled, one is acting as an editor
There is almost no such thing as neutral markup -- all of it involves interpretation
Markup assists in answering research questions: but understanding the markup decisions which enable those answers can be a research activity in itself
What is the Point of Markup?
Markup is used in many different fields, for many different purposes: storing data, relating information, encoding understanding, preserving metadata
Markup is a way of making our knowledge or understanding about a text explicit
Markup makes strives to make explicit (to a machine) what is implicit (to a person)
Markup assists us in facilitating re-use of the same material:
in different formats
in different contexts
by different sorts of users
Markup
Presentational Markup:
\textcolor{red}{-£1000}
Descriptive Markup:
<
measure

unit
="
pounds
"
value
="
-1000
">
One thousand pounds in debt
</
measure
>
Compare the Markup
About XML
XML is structured data represented as strings of text
XML looks like HTML, except that:
XML is
extensible
XML
must
be
well-formed
XML
can
be
validated
XML is application-, platform-, and vendor- independent
XML empowers the content provider and facilitates data integration and migration
It is one of the best plain text long-term preservation formats for textual data that we have
XML Terminology
<?
xml

version
="
1.0
" ?>
<
root
xmlns
="
http://namespace/
"
>
<
element

attribute
="
value
">
content
<
childElement

type
="
empty
"/>
content
</
element
>
<!-- comment -->
</
root
>
Annotation by nesting vs standoff
<
taxonomy
>
<
category

xml:id
="
lit
">
<
catDesc
>Literature</
catDesc
>
<
category

xml:id
="
prose
">
<
catDesc
>Prose Texts</
catDesc
>
<
category

xml:id
="
nov
">
<
catDesc
>Novels</
catDesc
>
</
category
>
</
category
>
<
category

xml:id
="
poe
">
<
catDesc>
Poetry</
catDesc
>
<
category

xml:id
="
sonnets
">
<
catDesc
>Sonnets</
catDesc
>
<
category

xml:id
="
petSon
">
<
catDesc
>Petrarchan Sonnets</
catDesc
>
</
category
>
<
category

xml:id
="
shakeSon
">
<
catDesc
>Shakespearean Sonnets</
catDesc
>
</
category
>
<
category

xml:id
="
spensSon
">
<
catDesc
>Spenserian Sonnets</
catDesc
>
</
category
>
</
category
>
</
category
>
<
category

xml:id
="
drama
">
<
catDesc
>Dramatic texts</
catDesc
>
</
category
>
</
category
>
</
taxonomy
>
XML Syntax
There is a
single root node
containing the whole of an XML document
Each subtree is
properly nested
within the root node
Element/attribute names and values are always
case sensitive
Start-tags and end-tags are always mandatory (except there are combined start-and-end tags called 'empty elements' <pb/> <gap/>)
Attribute values are always
quoted

XML in Practice
XML
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).
It uses ISO 10646 (also known as Unicode)
Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensable role in the exchange of a wide variety of data on the Web and elsewhere.
Its success means that general tools are ubiquitous and how it works is well-understood.

You use XML every day
--
you just don't realise it.
XML
XML declaration
root element
namespace
element
attribute
and value
an 'empty' child element
comment
content
<?
xml

version
="
1.0
"
encoding
="
utf-8
" ?>
<
div

n
="
1
">
<
head
>SCENE I. On a ship at sea: a tempestuous noise of thunder and lightning heard.</
head
>
<
stage
>Enter a Master and a Boatswain</
stage
>
<
sp
>
<
speaker
>Master</
speaker
>
<
p
>Boatswain!</
p
>
</
sp
>
<
sp
>
<
speaker
>Boatswain</
speaker
>
<
p
>Here, master: what cheer?</
p
>
</
sp
>
<
sp
>
<
speaker
>Master</
speaker
>
<
p
>Good, speak to the mariners: fall to't, yarely,</
p
>
<
p
>or we run ourselves aground: bestir, bestir.</
p
>
</
sp
>
<
stage
>Exit</
stage
>
</
div
>

<
div

xml:id
="
sonnet116
"
ana
="
#ShakeSon
"
>
<
head
>Sonnet 116</
head
>
<
lg

type
="
stanza
">
<
l
>Let me not to the marriage of true <
rhyme

label
="
a
">minds</
rhyme
></
l
>
<
l
>Admit impediments. Love is not <
rhyme

label
="
b
">love</
rhyme
></
l
>
<
l
>Which alters when it alteration <
rhyme

label
="
a
">finds</
rhyme
>,</
l
>
<
l
>Or bends with the remover to <
rhyme

label
="
a
">remove</
rhyme
>:</
l
>
</
lg
>
<
lg

type
="
stanza
">
<
l
>O no; it is an ever-fixed <
rhyme

label
="
c
">mark</
rhyme
>,</
l
>
<
l
>That looks on tempests, and is never <
rhyme

label
="
d
">shaken</
rhyme
>;</
l
>
<
l
>It is the star to every wandering <
rhyme

label
="
c
">bark</
rhyme
>,</
l
>
<
l
>Whose worth's unknown, although his height be <
rhyme

label
="
d
">taken</
rhyme
>.</
l
>
</
lg
>
<
lg

type
="
stanza
">
<
l
>Love's not Time's fool, though rosy lips and <
rhyme

label
="
e
">cheeks</
rhyme
></
l
>
<
l
>Within his bending sickle's compass <
rhyme

label
="
f
">come</
rhyme
>;</
l
>
<
l
>Love alters not with his brief hours and <
rhyme

label
="
e
">weeks</
rhyme
>,</
l
>
<
l
>But bears it out even to the edge of <
rhyme

label
="
f
">doom</
rhyme
>.</
l
>
</
lg
>
<
lg

type
="
couplet
">
<
l
>If this be error and upon me <
rhyme

label
="
g
">proved</
rhyme
>,</
l
>
<
l
>I never writ, nor no man ever <
rhyme

label
="
g
">loved</
rhyme
>.</
l
>
</
lg
>
</
div
>
What does it mean to be well-formed?
An XML document is encoded as a linear string of characters
It begins with a special processing instruction
Element occurrences are marked by start and end-tags
The characters < and & are
Magic
and must always be "escaped" using
&lt;
or
&amp;
if you want to use them as themselves
Comments are delimited by
<!-- and -->
Attribute name/value pairs are supplied on the start-tag and may be given in any order
xml:id
="
uniqueID
" and
xml:lang
="
languageCode
"
The XML Format
<!-- In document stand-off linking-->
<
linkGrp
>
<
link

target
="
#ShakeSon #sonnet116
"/>
<!-- more links -->
</
linkGrp
>

<!-- Out of document linking -->
<
linkGrp
>
<
link

target
="
http://www.example.com/taxonomy.xml#ShakeSon
http://www.example.com/poems.xml#sonnet116
"/>
<!-- more links -->
</
linkGrp
>

Note:
You can be
valid
in addition to being
well-formed
. This means you obey the rules of a specified schema, such as the Guidelines of the Text Encoding Initiative
Test your XML Knowledge
Well-Formedness vs Validity
Being
well-formed
means you obey the rules of the XML Syntax (e.g. proper nesting, quoted attributed values); All XML must be well-formed, or stop processing
Being
valid
means in addition you obey rules about which elements are allowed where, what attributes they have, and what their values may be
Common schema languages include:
Relax NG (Compact or XML Syntax)
W3C XML Schema
DTD Language
Or the Text Encoding Initiative has a meta-schema customisation language (TEI ODD) which enables generation of all of these
DTDs are very dated, don't cope with namespaces, and have other problems. We recommend Relax NG or TEI ODD.
XML Vocabularies
There are a huge number of XML vocabularies available many overlapping and redundant
Wikipedia lists arround 250 of them, and there are many which are not listed there:
http://en.wikipedia.org/wiki/List_of_XML_markup_languages

There probably exists an XML vocabulary for the data you use: it is better to use an existing format than re-invent the wheel
We (
researchsupport@it.ox.ac.uk
) can help you choose a markup language suitable to your field, work, research, project
The university is a long-term supporter of the Text Encoding Initiative (TEI) guidelines which is an extremely flexible and extensible vocabulary
http://www.tei-c.org/
XML Editors
There are many XML editors available, both free and proprietary
We use the oXygen XML editor, for which the University has a site license
You want an editor which provides
syntax highlighting
continual schema validation
content completion
node collapsing
XPath/XQuery searching
built-in XSLT transformations
multi-platform
Which are
well-formed
XML?
<
seg
>some text</
seg
>
<
seg
> <
w
>some</
w
> <
hi
>text</
hi
> </
seg
>
<
seg
> <
w
>some <
hi
></
w
> text</
hi
> </
seg
>
<
seg

type
="
text
">some text</
seg
>
<
seg

type
=
text
>some text</
seg
>
<
seg

type
="
text
"> some text <
seg
/>
<
seg

type
="
text
"> some text<
gap
/> </
seg
>
<
seg

type
="
text
">some text</
Seg
>
What Does Markup Enable?
Markup Enables Many Things
It makes explicit to a machine what is implicit to a human
It encodes our assumptions and understanding of a text
It enables us to impose structure(s) on data
It helps us relate information/resources/metadata together
It lets us categorise aspects of texts that interest us
It helps us share our resources and intellectual activity through choosing to use a common vocabulary
But that is just the tip of the iceberg!
Almost everything we do
these days on computers
needs some form of markup
By using a standardised vocabulary
we are able to re-use tools, programs,
and concepts across projects and
disciplines
(even this prezi is stored as XML!)
XML
Databases
XML Databases
XML: Pros and Cons
Expressive freedom Verbose markup
Handles complexity Slow to process
Human readable Complex processing
Arbitrary nested data relationships
Unpredictability
Schema validation External dependencies
Relational Databases: Pros and Cons
Well understood Un
Real
istic Structures
Stable & Optimised No complex base types
Standard Query Language

Limits of SQL for complex data
Easy to understand tables

Complex relationship linking
Available tools Hard to use others databases
XML vs DB
Relational Databases have a much longer history, but XML is now used pervasively throughout our digital world
If your data is table-like (e.g. a telephone book) then databases are better; if your data has arbitrarily deeply-nested structures (names inside paragraphs inside sections) then XML is the right solution
There are efficient ways to query large XML text-bases, but databases enable large querying of simple structures more easily
Format
It isn't about format: XML is good for some things, databases are good for other things
It is about the preservation and interrogation of intellectual content
It is about finding, understanding, and creating human cultural heritage
It is about enabling research
Outputs
XML is a good verbose storage format even when you are using other formats for delivery
Outputs such as JSON and Linked Open Data (in RDF) are common (often lossy) outputs but are problematic as primary formats
XML enables as many outputs as you can conceive; there is never only one target output (if you let others can use your data)
What is the TEI?
(The Text Encoding Initiative)
An international consortium of institutions, projects and individual members
A community of users and volunteers
A freely available manual of set of regularly maintained and updated recommendations: 'The Guidelines'
Definitions, examples, and discussion of over 540 markup distinctions for textual, image facsimile, genetic editing etc.
A mechanism for producing customized schemas for validating your project's digital texts
A set of free and openly licensed, customizable tools and stylesheets for transformations to many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
A simple consensus-based way of organizing and structuring textual (and other) resources
A format for documenting your interpretation and understanding of a text (and how text functions)
An archival, well-understood, format for long-term preservation of digital data and metadata
Whatever
you
make it! It is a community-driven standard
What the TEI is not:
the only standard in this area
objective or non-interpretative
used consistently even within the same
project (never mind in other ones)
fixed and unchanging
your research end-point
automatic publication or long-term preservation
T E I
NOT TEI
The
benefits
of a shared vocabulary
far outweigh
the effort of learning the TEI
What is the TEI?
And Why Should I Care?
(A brief introduction for classicists)
Dr James Cummings, Academic IT Services
researchsupport@it.ox.ac.uk

@jamescummings
http://tinyurl.com/jc-classics-2015-01-27
TEI
Full TEI Schema
Modules
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Dictionaries
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
Customized TEI
Modules
Simple analytic mechanisms
Certainty and uncertainty
Core elements
Corpus texts
Dictionaries
Performance texts
Tables, formulæ, notated music, and figures
Character and glyph documentation
The TEI Header
Feature structures
Linking, segmentation and alignment
Manuscript Description
Names and dates
Graphs, networks, and trees
Transcribed Speech
Documentation of TEI modules
Critical Apparatus
Default text structure
Transcription of primary sources
Verse structures
TEI ODD Customisation
But of course you want to remove specific elements from these 'modules' as well.

No
project will need all of them!
Chaining of TEI ODD Customisations
One of the interesting developments in TEI ODD design is that can 'chain' customisations.

This means that if a TEI Community or Project makes a customisation then others can come along and make their own customisation that points to this project's TEI ODD as a source.

This enables projects to truly say "We're very much like that project over there (e.g. EpiDoc), but we need to add back in this element that they removed" and to document this in a machine-processable manner.
TEI ODD Customisation
EpiDoc
EpiDoc is an international, collaborative effort that provides guidelines and tools for encoding scholarly and educational editions of ancient documents.
While it focuses on Epigraphical documents it is also used for other ancient documents as well
EpiDoc is a pure TEI P5 Subset
In creating your own digital classics works, you will want to know EpiDoc:
so you can use existing transcriptions or metadata
to publish your own transcriptions if you want others to use them
Adopting recommended methods of encoding helps with the long-term preservation of the data created and its interoperability with existing or future tools
OxGarage
OxGarage
Freely available web frontend to underlying XSLT conversions
REST-enabled API interface for scripts doing bulk conversions
Pipelined conversions through many steps (e.g. DOCX to TEI P5 to ePub)
Often uses TEI P5 as pivot format

because we need to interchange resources
between people
(increasingly) between machines
because we need to integrate resources
of different media types
from different technical contexts
because we need to preserve resources
cryogenics is not the (full) answer!
we need to preserve metadata as well as data
Why would you want those things?
A document is "TEI Conformant" if and only if it:
is a well-formed XML document
can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
conforms to the TEI Abstract Model
uses the TEI Namespace (and other namespaces where relevant) correctly
is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines

TEI Conformance
Standardization should not mean

Do what I do
’, but rather

Explain what you need to do
in terms I can understand

Version Date
2.7.0 2014-09-16
2.6.0 2014-01-20
2.5.0 2013-07-26
2.4.0 2013-07-05
2.3.0 2013-01-17
2.2.0 2012-10-25
2.1.0 2012-05-15
2.0.2 2012-02-02
2.0.1 2011-12-22
2.0.0 2011-12-16
1.9.1 2011-03-05
1.9.0 2011-02-25
1.8.0 2010-11-05
1.7.0 2010-07-06
1.6.0 2010-02-12
1.5.0 2009-11-08
1.4.1 2009-07-01
1.4.0 2009-06-20
1.3.0 2009-02-01
1.2.0 2008-10-31
1.1.0 2008-07-04
1.0.1 2008-02-03
1.0.0 2007-11-02
Versions of TEI P5
TEI Development
The TEI Guidelines are constantly being improved and as such are an evolving history of Digital Humanities concerns
But any individual project can choose to stay with any earlier version
The elected TEI Technical Council takes bug reports and feature requests from the community and implements them
You, (yes you!), can participate in the community mailing list (TEI-L) and point out bugs or make feature requests on:
http://tei.sourceforge.net
TEI
Descriptive and
Historical Data
Text Transcription
Vocabularies and Indexing Terms
EpiDoc Guidelines
http://www.stoa.org/epidoc/gl/latest/
Tom Elliott, Gabriel Bodard, Elli Mylonas, Simona Stoyanova, Charlotte Tupman, Scott Vanderbilt, et al. (2007-2014),
EpiDoc Guidelines: Ancient documents in TEI XML (Version 8)
.
Available: http://www.stoa.org/epidoc/gl/latest/.
Leiden Underdotting
(ambiguous characters)
EpiDoc Example
EpiDoc
http://www.tei-c.org/oxgarage/
Full transcript