CSE 891

No description

Jared Wein

on 28 April 2010

Transcript of CSE 891

Harnessing the Internet for Automatic
Generation of an In-Domain Language Model Focus: Improving accuracy of Large
Vocabulary Continuous Speech Recognition
System that is used for transcription of class lectures Current implementation uses a BBC corpus and augments the
corpus with text from the PowerPoint slides

I focused on using internet resources to augment the current implementation Use the presentation's title to find related
content online

Google search specific to wikipedia.org
Google general search
Google search specific to *.edu Data from the internet requires a
lot of work to clean it up:
Restricting binary data
Removing javascript and css
Removing html tags In conclusion:

The added value depends heavily on a good choice of presentation title by the presenter
Language on the internet does not necessarily reflect spoken language (such as internet lingo)
Ambiguous topics can have unexpected augmentations if the unintended usage is more prevalent.
Surprised to see Wikipedia not contributing much gain, but Google general and EDU specific independently increased accuracy by about five percent
