Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
High Accuracy Metadata
Transcript of High Accuracy Metadata
Mobility engineering company with content including:
From the automotive, commercial vehicle, and aerospace industry dating back to 1906.
Backdrop: Phase 1
Previous to 2010, SAE used over 14,000 terms, called tech tracks, to manually index all content.
There was no automation or librarian involved with taxonomy metadata assignment
High Accuracy Metadata and Machine Learning:
A librarians success
Presented by Ashleigh Faith
Library Science Doctoral Student at the University of Pittsburgh iSchool
iSchool Doctoral Guild
November 15, 2013
12:30-1:30 PM, Room 1A04 Information Science Building
98,000 technical papers
260 training courses
Phase 3: Training
SAE's taxonomy is a complex taxonomy:
Meaning, engineering, technical, and scientific taxonomy is complex due to manifestation of inherently ambiguous vernacular, multiword expressions, and variant definitions, and diverse content types.
Example: A rocker cover is a type of engine valve cover. Typical AI software would usually result in mistags such as rocker chair, rocker musician, etc.
Testing was done to find the best approach to automate the index process. It was found to successfully index complex taxonomies automatically, a robust training regimen through linguistic analysis was needed.
Training did not commence until Fall of 2012
All content must be in machine readable format
Validated machine taxonomy
Documents should be found by subject matter expert indexers/catalogers
The amount of training documents required varies –the overall standard is usually 50 pristine document examples but testing what works best for your content.
Pristine indicates a document that uses the most concentrated application of the term without a high degree of other taxonomy terms represented (the Golden Mean Theory).
Training documents serve as an example for what AI should look for in a document for classification.
A training set should be created for each classification term in the taxonomy.
*Generated from Temis Luxid
Linguistic Rule Analysis
The SAE project accuracy above equates to 89% (calculations through precision/recall and F-measure)
The average human indexing accuracy is 91% and the average AI indexing accuracy is 75%
Linguistic analysis can be performed by humans, by using AI tools such as cluster analysis and term matrix statistics, or a combination of both (ideal).
A taxonomy and hierarchy scheme with 891 unique mobility engineering terms was first created through cluster analysis of content.
A hybrid, semi-automatic, indexing initiative was also started in 2010 with Temis software. Only 66 high level terms were trained.
The hybrid approach was used until true training of the AI system could be completed.
Hybrid Approach: Phase 2
Content previous to 2010 can now be indexed semi-automatically- leading to 95% more indexed content
A sample of content was taken from each year for a human indexer to validate AI assignment
Corrections were identified and training was modified
Automatic taxonomy assignment was recorded into the records metadata
This technique was also applied to
These processes can be built into governance policies to insure accurate indexing
Continuity created throughout content for searchability and credibility
Data is more manageable and content analytics have been improved
Tech Brief Media Group (NASA, Military Defense ,and Medical taxonomy)
Electric Vehicle Global Technology Library
Sub-taxonomy from the NATO Terminology Directive
Documents used should be roughly the same length
Documents should not be dominated by one author
A stop-word list should be created and regularly updated with terms that are common in content
There can be no duplication of training documents
Content that will be classified the most should be identified and training documents should adhere to fit that style
Thank you for your time