Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


High Accuracy Metadata

No description

Ashleigh Faith

on 18 December 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of High Accuracy Metadata

Database used:
SAE International
Mobility engineering company with content including:

From the automotive, commercial vehicle, and aerospace industry dating back to 1906.

Backdrop: Phase 1
Previous to 2010, SAE used over 14,000 terms, called tech tracks, to manually index all content.

There was no automation or librarian involved with taxonomy metadata assignment
High Accuracy Metadata and Machine Learning:
A librarians success
Presented by Ashleigh Faith
Library Science Doctoral Student at the University of Pittsburgh iSchool
iSchool Doctoral Guild
November 15, 2013
12:30-1:30 PM, Room 1A04 Information Science Building
35,000 standards
98,000 technical papers
260 training courses
2,000 magazines
1,000 books
Phase 3: Training
SAE's taxonomy is a complex taxonomy:
Meaning, engineering, technical, and scientific taxonomy is complex due to manifestation of inherently ambiguous vernacular, multiword expressions, and variant definitions, and diverse content types.
Example: A rocker cover is a type of engine valve cover. Typical AI software would usually result in mistags such as rocker chair, rocker musician, etc.
Testing was done to find the best approach to automate the index process. It was found to successfully index complex taxonomies automatically, a robust training regimen through linguistic analysis was needed.
Training did not commence until Fall of 2012
All content must be in machine readable format

Clean metadata

Established workflow

Validated machine taxonomy

Documents should be found by subject matter expert indexers/catalogers
Training Specs
The amount of training documents required varies –the overall standard is usually 50 pristine document examples but testing what works best for your content.

Pristine indicates a document that uses the most concentrated application of the term without a high degree of other taxonomy terms represented (the Golden Mean Theory).

Training documents serve as an example for what AI should look for in a document for classification.

A training set should be created for each classification term in the taxonomy.
Cluster Analysis
*Generated from Temis Luxid
Linguistic Rule Analysis
Training outcome
The SAE project accuracy above equates to 89% (calculations through precision/recall and F-measure)
The average human indexing accuracy is 91% and the average AI indexing accuracy is 75%
Linguistic analysis can be performed by humans, by using AI tools such as cluster analysis and term matrix statistics, or a combination of both (ideal).
A taxonomy and hierarchy scheme with 891 unique mobility engineering terms was first created through cluster analysis of content.

A hybrid, semi-automatic, indexing initiative was also started in 2010 with Temis software. Only 66 high level terms were trained.

The hybrid approach was used until true training of the AI system could be completed.
Hybrid Approach: Phase 2
Content previous to 2010 can now be indexed semi-automatically- leading to 95% more indexed content
A sample of content was taken from each year for a human indexer to validate AI assignment
Corrections were identified and training was modified
Automatic taxonomy assignment was recorded into the records metadata

This technique was also applied to

These processes can be built into governance policies to insure accurate indexing

Continuity created throughout content for searchability and credibility

Data is more manageable and content analytics have been improved
Tech Brief Media Group (NASA, Military Defense ,and Medical taxonomy)
Electric Vehicle Global Technology Library
Sub-taxonomy from the NATO Terminology Directive
Documents used should be roughly the same length

Documents should not be dominated by one author

A stop-word list should be created and regularly updated with terms that are common in content

There can be no duplication of training documents

Content that will be classified the most should be identified and training documents should adhere to fit that style
Training Prep
Contact Information
Web www.linkedin.com/in/ashleighnfaith/
Thank you for your time

Full transcript