Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

digital archives

No description
by

matt milner

on 10 July 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of digital archives

Document Contents
It is a misnomer to think that the digital involves no humanistic manual labour.
Digitization always involves a balance between automated computation and continued manual labour
For each project the balance is different depending upon the degree of organization within the documents
structured - census records, or tabular text - the easiest as a certain amount of 'manual' labour has already occurred
unstructured - prose, notes, etc. require more manual attention
transformation of a photographed document into a text-based version or representation can involve a mix of automated and manual methods
So you want to create a
Digital Archive?

Making a Digital Archive
Requires wide array of skilled researchers and experts - not merely in humanities domain.
Over-meticulous and exacting planning up front will save money and time.
Lots of attention!
But it's well worth it!

design by Dóri Sirály for Prezi
archive, n. or v.
/arkaiv/ /-kiv/

I. 1. A place in which public records or other important historic documents are kept. Now only in pl.
2. A historical record or document so preserved. Now chiefly in pl.
II. trans. To place or store in an archive; in Computing, to transfer to a store containing infrequently used files, or to a lower level in the hierarchy of memories, esp. from disc to tape.
Digitization
Matthew Milner, PhD LMS
religious and cultural life in late medieval & early modern England
digital humanist working on web applications & historical research
... new collaborative project on early modern English prosopography

Metadata
Metadata is 'data about data' - but this can be ambiguous
Metadata about the data itself - like when the data was collected or created
or
Metadata as kind of metacontent or data surrounding or in addition to content or data about something - such as authorship, place of publication etc.

In practice both are 'metadata', but the distinction is important in terms of historicizing data
the former is not-historical, while the latter, is.
Content Management
Workflows & Project Management
Website Design
Teaching & Curation
Text Analysis & Modelling
... in Six Easy Steps!
Equipment & Lighting
Quality & Standards
Costs
Metadata?
Archival Metadata
Standards
The Sources
Jewish Community Minute Books, c. 1500-c. 1800

Accounts of communal life in Ashkenazi centres which include records of births, marriages, deaths; rites of passage; regulations and injunctions on social life; local and customary interpretations of law; legal findings and court records; community governance and financial accounts; various other records.

Loosely structured, roughly chronological, in folios; between c. 100-200 communities; actual pages ranging in the thousands, various folio size; hand-written in a fairly stable scribal hand with fairly standardized legal and clerical terminologies.
The Researcher
The Archivist
The Student
Partners
Community & Wider Public
Access & Rights
Digitization - A Brief History
Some things to think about...
Your Digital Archive - a Rough Sketch...
Authority
Metadata vs Data?
Managing the stuff...
The Basic Structure
Common Solutions I - Back-End
Common Solutions II - Front Ends
Digital Asset Management (DAM)
Omeka
... some hints
Mobile Website?
Analytics & Tools
Styling & Headers
What are the main sections?
Audiences & Primary Purpose
automation vs manual
auto: optical character recognition
transcribe bentham
auto: natural language processing
juxta / juxta commons
manual: transcriptions & crowd
Digital Texts
catma
esthr
Mallet - Topic Modeling
Voyant Tools
Digitization & Preservation
Student Involvement
Transcriptions & Editing
Lesson Plans
The Virtual Classroom
Project Assessment & Review
Prioritization
Project Development
Concurrent and Stages
Project Life-Cycle
Project Administration
specialist faculty and graduate students with the necessary skills to examine the documents without much assistance; using the documents and data in their most extensive ways.
specialists primarily interested in preservation of documents, and promotion of their usage, for fragile documents will encourage others to use digitized formats before handling actual documents
undergraduate and graduates who may not have a full research agenda, perhaps teaching not on content so much as a) exposure to jewish history and historical culture / documentary evidence and/or b) palaeography or archival students interested in learning skills to handle similar documents
Institutions, Agencies, and Private partners who want to see either collections receive exposure and usage, philanthropic objectives for private partners, and research / cultural heritage for agencies.
non-specialist audience interested in cultural heritage; unlikely most will use the documents directly, though there will likely be a strong small core of dedicated users who are interested in these documents as a matter of public communal history, etc. Most however will likely want a curated exhibition of holdings rather than diving into the digitized documents directly.

non-specialist audience interested in cultural heritage, mainly focused on curated exhibits of holdings rather than research-level interaction with documents. They want to be told the story.
What kinds of access and rights of use will each audience have?
What kinds of rights will each archive give you in order to use and present materials online to these audiences?
What kinds of licenses might be employed to help users understand what their obligations are?
Will the same rights pertain to the entire Archive? For all users within a given audience?

Will the same rights pertain to images as to digitized or transcribed texts?
Who will have rights to any work-product or research conducted on the website?
Whose intellectual property - not merely visitors, but also researchers on the project - is paramount? Graduate student workers? Postdoctoral fellows?

How will institutional policies affect your decisions?
Digitization is about creating a digital representation of an artifact
Roberto Busa, Italian Jesuit Priest convinces IBM to help organize the lemmatization of the works of Thomas Aquinas in 1946.
It takes 30 years, resulting in the Index Thomisticus

The Text Encoding Initiative (TEI) turned to XML format in 2002
after it arrived in late 1990s. TEI, however, was older -
beginning in the late 1980s before the Internet

Busa and TEI saw digitization as the creation of a new representation or edition of a text, either by breaking it down into lemmas, or by encoding a representation of it in TEI format.

Digital photography revolutionized preservation by offering a digital means to create high quality, electronic, representations of archival holdings.

Lasers now being used to create 3D representations...
Highly successful digital projects tend to:
prioritize either content or method in their work at a given time, but not both simultaneously.
Digitization project should - follow established procedures and practices, reuse and repurpose existing frameworks and methods as much as possible, wherever possible.

That said...

If something doesn't work - change the method.
If your digitization project needs new methods or tools, invent them.

Simply be cognizant of what the priority is - it is easy to bite off more than you can chew....
Wherever and Whenever possible - DON'T reinvent the wheel!
a website (virtual research environment, web application, digital archive, virtual learning environment, online collaborative environment, digital museum) that will make Minutebooks of Ashkenazi Communities available to research, community, and learning audiences. It will need, like a physical archive, to build its research, outreach, and pedagogical programs on top of a strong archival and repository foundation for its holdings.
audiences should help guide the production of the website, but the first objective is digitization

Throughout the process there will be a nearly constant need for reflection on the relationship between automated computational processes and manual human research and editorial work.
Digitization
Metadata
Content Management
Website Design
Document Contents
Tagging & Modeling
Teaching
Curation
Dissemination
Workflow
Cyber-Infrastructure
to make a digitization project happen you need a robust infrastructure composed of...
Skilled Personnel - not 'technicians' but researchers and assistants who have a variety of skills, from computer programming to palaeography, and everything in between. And a project manager.
Hardware - of various kinds: cameras, scanners, servers (virtual or physical), laptops, hard drives, cables, notebooks, ... these costs are usually underestimated by humanities researchers
Software - for Image handling, Optical Character Recognition, website design and management, etc.
Oxford
-Kontron Progress 3012
-Dicomed cameras
Penn
4x5 View Better Light Camera
resolution up to 12,000 X 15,900 (1.1 GB files) these surpass single shot backs using the new Kodak 39 Megapixel sensor (225MB)
Digitization is a costly endeavour - here are some breakdowns...
Linked Open Data
Each archive has (or should have) its own metadata associated with holdings - it itself might not be digitized but can be in a card or bound catalogue.
In addition to existing metadata, such as titles or call / manuscript numbers, varying archives may have information on physical descriptions of the holding - foliation, binding, ownership, materials, inks, provenance etc.
Some archives might exert rights management over catalogue or metadata in addition to images
MSS that have moved or changed hands might have metadata in more than one archive
Diversity of metadata - standards have emerged to help ensure interoperability, both between human users (archivists and researchers) AND machines.
Standards are now also 'schemas' or vocabularies for describing metadata - they are important for data exchange and research - some are....
see http://www.loc.gov/standards/
EAD - Encoded Archival Descriptions, archival response to MARC, for finding
archival materials. Library of Congress. http://www.loc.gov/ead/
Dublin Core - popularized the idea of "core metadata" for simple and generic
resource descriptions. Probably the most widely used standard for artifacts.
http://dublincore.org/
Open Archives Initiative Protocol for Metadata Harvesting -
defines a mechanism for harvesting records containing metadata from repositories - http://www.openarchives.org/
METS (Metadata Encoding & Transmission Standard) - Structure for encoding
descriptive, administrative, and structural metadata
PREMIS (Preservation Metadata) - A data dictionary and supporting XML schemas
for core preservation metadata needed to support the long-term preservation of digital materials.
Along with standardized ways of describing metadata, we're seeing the emergence of authority files and systems to help handle authors and places. Analogous to ISBNs, ISMNs, ISSNs, in bibliographic world.

Virtual Identity Authority File (OCLC) http://www.viaf.org
joint project of several national libraries plus selected regional and trans-national library agencies. The project's goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web

Geonames - http://www.geonames.org
geographical database available for download free of charge under a creative commons attribution license. It contains over 10 million geographical names and consists of over 8 million unique features whereof 2.8 million populated places and 5.5 million alternate names.
When does metadata about an artifact become data, and for whom?
How do historians use and interrogate this data?
Can they add to it?
Must they do so in prose only?

Are the publication or dissemination activities of a digital archive merely to harvest and present existing data or to interrogate it and amend it?

A digital archive gives historical scholars the means to
edit and amend existing metadata
add new metadata or concieve of metadata as historical data - such as ownership, prosopograhical data, places names, values of purchase of artifacts, book and MSS production practices, etc.
through the publication of new amended data, offer more up to date data for other scholars.

Metadata standardization - through schemas and vocabularies - are fuelling a new mode of data exchange called 'Linked Open Data' which sees machine-readable data offered across websites. This is turning the entire web itself into a vast database of sorts allowing scholars to disseminate not only their work in prose format, but also as factual assertions as XML or JSON-based data.

Linked Open Data is part of the 'Semantic Web' that aims to make online resources more easier to find, and to make searching more precise.

It is based on machine-access points called Automated Programming Interfaces or APIs that allow for data to be queried, and 'understood' by servers provided with standardized metadata descriptions.

DBPedia.org collates much of this material - as of Sept. 2012 the English version describes 3.77 million things, including 764,000 persons, 573,000 places, 333,000 creative works, 192,000 organizations, 202,000 species and 5,500 diseases.
... digitization will result in - files (images - TIFFs, JPEGs) and text-based metadata which will need to be stored.

Cyber-infrastructure: Hardware & Software 'stacks' or configurations for managing this data and content.

Comprised of disk space in various formats, machine(s) suited to sorting and presenting the material over the web, and software to facilitate the management of the machines, the handling of data, storage and presentation of the files, manipulation and presentation of the metadata/data, and control access to it all... while being flexible and responsive... and cost effective.
The basic system architecture for a web application has three layers - the back end comprising the data storage
database
file storage
an application or logic layer for processing data
and a front end, or website, for user interaction.
Each layer can be quite complex:
the entire back end can rest on one or more servers, running similar or distinct operating systems, using one or a network of hard drives for storage.
the application or logic layer can have numerous applications for handling and processing data, authentication, etc. email, scheduled jobs, non-browser access etc.
the presentation layer or user-interface, is browser-based, but can employ any number of applications or software libraries to help users with the data

each layer must be built somehow - and requires considerable forethought. Should software be open source? what kinds of licenses might be needed? how will distinct servers and applications relate to one another?
Comprise a Server, a Database system, and a scripting language for processing and creating the presentation (web) layer. Two are prominent - Microsoft and Linux-based.
Microsoft Stack
Windows-based; SQL Server; ASP & .Net
*AMP Stack
Various Operating Systems, usually Linux; MySQL/ Postgresql; Php
We're in they heyday of Content Management Systems (CMS) which allow users to generate websites fairly easily using data-management architectures. These sit in the application and presentation layers. Well known ones include...
Drupal
Joomla
Wordpress
Others are more frameworks or libraries allowing rapid custom development
Django
Zend
CakePhp
Sencha
each of these organize posts, pages, forms, handle images, offer plugins, etc. for various types of database-d websites and applications. Some are more customizable than others. Most are open-source or community supported. They use scripting languages like Php, Phython, Java, Javascript, C++, etc.
Systems which combine back end and front end to give users a read-made customizable digital asset management system are becoming popular solutions for libraries and research teams
Canadian success story - Islandora (http://www.islandora.org)
Combines - Fedora Commons (Linux Operating System), Drupal (uses MySQL & Php), along with other applications including Solr (a blindingly fast search engine for Apache webservers) for managing large, searchable collections of digital assets of any type.
Doesn't just mean content acquisition or creation of the online platform - it means long term strategy for programming milestones, versioning, backups, etc.
Milestones - development 'finish' points where certain tasks are completed. Often associated with software or application development.
Versioning - many digital projects use Git or SVN, which are applications that monitor and backup versions of a digital project as you go - so if there's a problem, you can 'roll things back' to a good version.
Versioning and Milestones are related to different tiers of the development process - from Local, Development, Testing, Staging, and Production level versions of software.
Your cyber infrastructure needs to account for these stages
collection management software from Centre for History and New Media at George Mason
Custom Systems
Many projects still build their own custom content management systems. This can be intensive, but is also the only way to make sure you have everything YOUR project needs.
File Management
Database (MySQL/Postgresql)
Php / Python
Web Server (Apache)
Java
... most often the case with non-latin script languages. Although CMS are Unicode compliant, they aren't as thoroughly tested.
Who is visiting the website and why?
Are there several ways to 'enter' the website, depending upon use and experience?
Do you need to appeal to other websites' designs and features because users expect them?
Do different audiences have different sections of the website?
Can users login to the website?
Who monitors users' online activities - especially if there are message boards etc?
Who can gain access and how?
Common sections include:
Home
About
Research
Contact Us
Events
News
Search or Explore
selection of main headers or sections for the website is crucial to user navigation around the site
With an archive, the search and browsing features will be essential. How will users find what they're looking for - even if they don't know what they're after?
Clean designs are best, with light backgrounds and dark fonts.
Content Management Systems allow for the creation of 'themes' which give a similar design to all pages; custom websites have a bit more flexibility - but the principle that webpages should follow a similar design, colour scheme, etc. is a good one.
Wherever possible, have stable headers and footers to pages - often these are places for menus, allowing users to easily move around the site.
Users, increasingly, are annoyed by long pages - shorter pages with 'tabs' or less content are visited more often
take advantage of HTML5 and its new methods of presenting rich web content that has low overhead.
developing cross-browser compliant websites can be tricky - Internet Explorer is the bane of website designers, but commonly found in institutions and organizations - this makes development of academic websites a bit troublesome sometimes.
only load what is needed in a webpage - pages should be as 'skinny' as possible, making them fast and light, and thus easier to use
users typically want or need to be able to find information within 3 clicks of entering a website.
This is something of a debate at the moment - should academic websites have mobile versions?
It's not clear - depends on audience, and use of the website.
A Digital Archive - perhaps not. But if texts are downloadable for easier reading - perhaps so. Or if there are discussion boards.
Mobile sites can be costly to develop; they have different forms in contrast to desktop sites.
The project should exploit free analytics and tools from providers like Google.
Which pages do users like or dislike?
How do they move through the site?
How many visitors do you have a month?
How many visitors come from search engines?
How many visitors from which countries?
Find broken links and pages

Such statistics can be invaluable for funding agencies and institutional partners' impact statements.
uses signal processing to render images into text.
modern fonts the easiest, but advances (here in montreal!) are allowing for recognition of complex handwritten arabic scripts
requires training sets
never perfect - always errors, the question is the margin, and if it is acceptable
Software:
Use of manual labour in transcriptions usually occurs in two distinct ways:
1. Correction of OCR results - research assistants comb through OCR texts, examining both the originals and the output, noting and amending any errors. Essentially a 'clean up' job
2. Transcriptions - skilled users, either assistants, or 'the crowd' create manual transcriptions of texts using online editors and transcription tools.
In these cases usually there's overlap - two or more users transcribe a text, while a third reconciles errors.
A final possibility is crowd sourcing the error or correction of OCR output.

Can be cost-heavy if paid; crowd-sourcing can difficult to manage, but also has the added bonus of community building.
Juxta is an open-source tool for comparing and collating multiple witnesses to a single textual work. Originally designed to aid scholars and editors examine the history of a text from manuscript to print versions, Juxta offers a number of possibilities for humanities computing and textual scholarship. Juxta Commons is the online version.
TPen
T-PEN is a web-based tool for working with images of manuscripts. Users attach transcription data (new or uploaded) to the actual lines of the original manuscript in a simple, flexible interface.
Project based at UCL focused on the online transcription of original and unstudied manuscript papers written by Jeremy Bentham.
Uses MediaWiki - an open source platform for handling media files like images.
Crowdsourced - uses volunteers to go through some 60,000 pages of works
Volunteers transcribe in TEI xml format in a 'transcription desk'
At c. 52% complete having started in 2010.
Encoded texts can be treated differently than simple OCR'd or raw images.
Basic functionalities:
Full-text searching
Collations and word-counts

More complex:
Collaborative or automated tagging and analysis
Corpus Analysis and Distant Reading

More speculative:
Topic Modelling & Subject / Classification
Named Entity Extraction
Voyant Tools is a web-based text reading and analysis environment. It’s designed to make it easy for you to work with your own text or collection of texts in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.
Computer Aided Textual Markup & Analysis
CATMA is now an online research suite comprising tagging, analysis, and visualization modules. While the analysis and visualization modules are automated, tagging is done manually, either by individuals or groups who use self-defined or predefine tagsets to describe their texts
Criminal Intent
The Datamining with Criminal Intent project brings together three online resources: the Old Bailey Online, Zotero and TAPoR. It allows users to study the rich Old Bailey resource (127 milllion words of trial accounts), using analytical tools from TAPoR like Voyeur and information management tools like Zotero.
A topic modeling tool takes a single text (or corpus) and looks for patterns in the use of words; it is an attempt to inject semantic meaning into vocabulary. Topic models represent a family of computer programs that extract topics from texts.
Machine Learning for LanguagE ToolkiT
MALLET generated a list of thirty topics comprised of twenty words each from Martha Ballard’s Diary (1785-1812). Below is a quick sample of what the program “thinks” are some of the topics in the diary:

MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
SHOPPING: lb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower
ILLNESS: unwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach

A website or digital archive is not like a book - it does not 'finish' in the same way. It must be maintained, and sustained, or it looses its currency.
Milestones give projects natural 'cycles' or completion points which translate into grant and publication, as well as review, cycles
Participants need moments to turn attention elsewhere, reevaluate participation, etc.
Teams need moments to assess, bring in new members, etc.
Cyclical development is key for long-term sustainability and planning.
Long-term partnerships will ease these burdens for academics, especially when a project is associated with an institution
An important discussion within Digital Humanities community is how digital projects, including archives, should be assessed and review following traditional humanistic peer-critique.
Review should be seen as part of the process.
Sections of the website should detail methodologies and processes for digitization, treating it as an intellectual problem, not merely a matter of fact practical affair.

How did the project ...
go about producing its digitized images? what standards were used?
transcribe or create its digital texts? what problems were encountered?
involve student researchers?
report its methodological considerations?
I. Collection Preparation
Select the collection or other body of content
Production Workplan
Organize: give the collection its structure and/or arrangement
Apply digital naming conventions
Determine copyright or other restrictions
Obtain Repository Space
II. Contracting for Digital Conversion
Most collections are digitized by contractors who specialize in various types of originals
III. Digital Capture
Technical considerations at capture and post-processing time
text conversion, formats, headers, compression, and delivery
IV. Quality Review
V. Archive
VI. Assemble Collection
Cameras & Scanners
A computer-controlled camera captures images up to 80 megapixels with 48-bit, uninterpolated color. Archival quality, 600 dpi images are possible for codices with spine lengths up to 16.5 inches.
You or Them?
Digitization is a production process which has standards and urges quality reviews
All stress the need for experimentation and 'benchmarking' to calibrate the equipment
Some minimum standards are:
Penn State Libraries:
Text-only or with line drawings: 600 dpi 1-bit TIFFs, ITU-4 compression.
Text with black-and-white illustrations: 300 dpi 8-bit grayscale TIFFs, no compression.
Text with color illustrations or backgrounds: 300 dpi 24-bit color TIFFs, no compression.
Images: 300 dpi 24-bit color TIFFs, no compression.
Oxford:
scanning of graphics is often in the range of 100-600dpi (although 200dpi as a bottom level is more common)
the archiving format of choice is uncompressed TIFFs (IBM-compatible) for colour/greyscale, and Group 4 compressed TIFFs for bi-tonal scans (ranging from 400-600dpi)

ANSI/NISO Z39.87-2006 Data Dictionary – Technical Metadata for Digital Still Images. "This standard defines a set of metadata elements for raster digital images to enable users to develop, exchange, and interpret digital image files." Also see the Library of Congress, NISO MIX Official Web Site for the current MIX schema and documentation.
http://hul.harvard.edu/ois/systems/drs/imagemetadata.pdf
http://www.bodley.ox.ac.uk/scoping/digitization.html
http://www.loc.gov/standards/
(3987 images 6” x 9”, strippable journal, 600 dpi bi-tonal TIFFs, plus 50 photographs, 600dpi greyscale 8-bit TIFFs)

Using Minolta PS3000P Scanner (scanner available)
Hourly production rate: 80 pages
Overall cost (including weighted annual salary but not hardware): £715.50 (i.e. 18p per page)
Overall cost plus Adobe capture software (for photographs): £1203.13 (30p per page)

Flat-bed scanner with sheet-feeder (Fujitsu M3093DE/DG for Midland History))
Hourly production rate: 180 pages
Overall cost (including weighted annual salary): £362.03 (i.e. 8p per page).
Overall cost if hiring scanner, plus Adobe Capture Software: £4449.66 (£1.12 per page)
Overall cost if buying scanner plus Adobe Capture Software: £5092.75 (£1.28 per page)

OCR Processing Time: 133 hours (@ 2 mins per page)
Proof Reading Time: 665 hours (@10 mins per page)

Total cost: £11,456.64 (or £2.94 per page)
Oxford: Midland History (1971-1997)
While quite a bit of digitization occurs in western archives, other teams are traveling to remote locations to do their own digitization on site - as at St Catherine's Monastery in the Sinai.
Travel Kits
Other kinds of Imagery
Jewish Theological Seminary MS 8224
OCRopus
Gamera
ABBYY FineReader
toolkit for building document image recognition systems
an open source OCR system that aims primarily for high-volume document conversion
Industry Leader in OCR
Combines computer science, linguistics, and artificial intelligence researchers to investigate through machine learning and language theory how computers might come to 'understand' or better process language - including digitized texts. It involves, among other things:
Named Entity Recognition - finding names of people, places, and things
Discourse Analysis - who is talking to whom
Natural Language Understanding - first order logic for semantic processing for computers
Parts-of-Speech Tagging - to help identify and train computers on complex textual / linguistic operations
Relationship Extraction, Sentiment Analysis... it goes on.
Tagging is also related to subject classification - can it combine topic modeling, natural language processing, and user-generated data to solve classic gaps in cataloguing?
Boston University project 'ESTHr' examined these questions -

Subject Headings - typically hierarchical, constructs of the dominant norm, reinforce conceits rather than interrogate them, predefined, and linked to finding aids on a ground floor of a library or archive.
Tagging or Keywords - non-hierarchical, can be subversive and interrogate norms, can be pre-defined or can be user-determined, linked to ongoing research and user interests, suited to finding things in an online archive as a conceptual virtual space

http://open.bu.edu/xmlui/bitstream/handle/2144/2404/ESTHRwhitepaper.pdf
Evolutionary Subject Tagging in the Humanities
How will students be involved in the digital archive? Both as users and as producers of data, or in the processing of images, preservation etc?
Important question for funding agencies...
What kinds of students - Graduate, Undergraduate, Secondary?
Student Research Assistants are currently involved in digitization and preservation processes
Library Science and Archival Programs employ or offer internships for students in MLS/MLIS programs
Visiting teams of researchers bring along students, either as Research Assistants, Interns, or as part of course work - Virtual Textile Project
Often these students will not be interested in the content of manuscripts, but rather the archival practices and methods the project uses. They may have tangential relationships with a completed Digital Archive
Students with the necessary experience or aptitudes for manuscript content are often employed as transcribers or editors for digitization work.
Students with computational skills often monitor or manage workflows for large crowd-sourcing operations.
Classic 'Research Assistant' tasks and positions.
Many digital archives are used in online 'virtual' classroom teaching - not only to expose students to documentary evidence, but also for palaeography instruction.
Some digital archives have classroom components themselves to facilitate this kind of instruction, or they combine palaeography with the transcription processes.
Some virtual classrooms operate on a for-credit basis, some are more volunteer or open style, while others offer paid instruction using the online resources as part of their business model.
Some digital archive also offer lesson plans for undergraduate and secondary school instruction. These are closely linked to curated online collections, and comprise various levels of exposure to online materials.
For lower levels this often means working with transcriptions alongside digital images; for upper levels it might mean more palaeography.
Lesson plans are well structured and usually thematic, tracing historical issues or problems through a variety of sources.
Many include online exercises, teaching aids, and glossaries
Curation - Public History and the Online Museum
Digital Archives are, in essence, virtual museums
Many comprise 'collections' which are curated, tied to particular funding bodies and partners for particular purposes, and are linked to public awareness and events related to specific holdings.
Infographics & Visualizations
Digital collections are no being transformed into new visual formats to illustrate complex historical processes and phenomena - all by means of using digitized sources and data.
Public History
Can mean public access to holdings or records
More often than not, emphasis on narrative and storytelling using holdings as means to tell the tale of a particular past
Increasingly 'geo-located' using Google Maps or other GIS enhancements, to describe the places inside a source, or how a source relates to a particular location
Highly curated for local histories as well as online 'tourists'
Showcases aspects of the collection
Dissemination
are public outreach and dissemination just about teaching, curation, or publication?
Access to Metadata
Research Methodologies and Processes
Digital archives regularly publish their research methods and findings online - imaging and metadata standards (as part of quality attestation), new finds, visits to particular archives and new collections etc.
Blogging is a very common way of doing this, very popular among archivists and digital librarians
Access to the Digital Archive through a website and browser is one form of dissemination - but with the world of Linked Open Data, there are other possibilities
XML or JSON based machine-readable files and formats that allow for ongoing and new research metadata to enter into the public realm
Others...
DSpace
CONTENTdm (OCLC)
Prioritization is key in any application or digitization development process. It is a form of triage which sequences what needs to happen before something else does!
Assessment of what kinds of components - what is it your Digital Archive needs to do first and foremost, and what are secondary commitments or interests?
Digitization - production of images and metadata essential
What standards are going to be followed?
What systems will be used?
When will they be ready for testing and use?
When will certain personnel be needed?
Can experienced users and developers use the 'backend' prior to the completion of the 'front end'?
What kinds of process dependencies are there within the digital archive systems architecture
Once prioritization has occurred, and you have a sense of what milestones are appropriate, it becomes a matter of ascertaining what can take place simultaneously and concurrently in terms of development
Stage 1
Stage 2
Stage 3
Trial digitization of folios, collation of archival metadata; temporary storage of data and backups; commencement of work on platform; Assessment of methods etc.
Estimation of required disk space, servers etc., personnel; preparation of collection, collection standards; obtaining equipment etc.
Rigorous critique of uploading and documentation apparatus within platform; second trial of digitization?
Stage 4
Full digitization of folios and collation of metadata; platform is in beta, goes for wider testing for user base and interaction etc.
DO A TRIAL SET FIRST!!!
With so many moving parts.... you'll need an effective project manager, and administrative apparatus
Administrative software or a website doesn't have to be part of the digital archive itself!
INKE - has an embedded management researcher within its team....
GitHub
BaseCamp
Full transcript