- All content must be in machine readable format
- Validated machine taxonomy
- Documents should be found by subject matter expert indexers/catalogers
- Documents used should be roughly the same length
- Documents should not be dominated by one author
- A stop-word list should be created and regularly updated with terms that are common in content
- There can be no duplication of training documents
- Content that will be classified the most should be identified and training documents should adhere to fit that style
- The amount of training documents required varies –the overall standard is usually 50 pristine document examples but testing what works best for your content.
- Pristine indicates a document that uses the most concentrated application of the term without a high degree of other taxonomy terms represented (the Golden Mean Theory).
- Training documents serve as an example for what AI should look for in a document for classification.
- A training set should be created for each classification term in the taxonomy.
Training outcome
Reflections
- Content previous to 2010 can now be indexed semi-automatically- leading to 95% more indexed content
- A sample of content was taken from each year for a human indexer to validate AI assignment
- Corrections were identified and training was modified
- Automatic taxonomy assignment was recorded into the records metadata
- This technique was also applied to
- These processes can be built into governance policies to insure accurate indexing
- Continuity created throughout content for searchability and credibility
- Data is more manageable and content analytics have been improved
High Accuracy Metadata and Machine Learning:
A librarians success
Database used:
SAE International
Phase 3: Training
Training did not commence until Fall of 2012
- 35,000 standards
- 98,000 technical papers
- 260 training courses
- 2,000 magazines
- 1,000 books
Mobility engineering company with content including:
From the automotive, commercial vehicle, and aerospace industry dating back to 1906.
- SAE's taxonomy is a complex taxonomy:
- Meaning, engineering, technical, and scientific taxonomy is complex due to manifestation of inherently ambiguous vernacular, multiword expressions, and variant definitions, and diverse content types.
Testing was done to find the best approach to automate the index process. It was found to successfully index complex taxonomies automatically, a robust training regimen through linguistic analysis was needed.
Backdrop: Phase 1
Training Specs
- Previous to 2010, SAE used over 14,000 terms, called tech tracks, to manually index all content.
- There was no automation or librarian involved with taxonomy metadata assignment
Hybrid Approach: Phase 2
Thank you for your time
Questions/Comments?
- A taxonomy and hierarchy scheme with 891 unique mobility engineering terms was first created through cluster analysis of content.
- A hybrid, semi-automatic, indexing initiative was also started in 2010 with Temis software. Only 66 high level terms were trained.
- The hybrid approach was used until true training of the AI system could be completed.
Contact Information
Email
anp114@pitt.edu
Web www.linkedin.com/in/ashleighnfaith/
Linguistic Rule Analysis
- Tech Brief Media Group (NASA, Military Defense ,and Medical taxonomy)
- Electric Vehicle Global Technology Library
- Sub-taxonomy from the NATO Terminology Directive
Linguistic analysis can be performed by humans, by using AI tools such as cluster analysis and term matrix statistics, or a combination of both (ideal).
- The SAE project accuracy above equates to 89% (calculations through precision/recall and F-measure)
- The average human indexing accuracy is 91% and the average AI indexing accuracy is 75%
Presented by Ashleigh Faith
Library Science Doctoral Student at the University of Pittsburgh iSchool
iSchool Doctoral Guild
November 15, 2013
12:30-1:30 PM, Room 1A04 Information Science Building
Example: A rocker cover is a type of engine valve cover. Typical AI software would usually result in mistags such as rocker chair, rocker musician, etc.
*Generated from Temis Luxid