The Naive Bayes Model
Background Information
- e = {ei} and f = {fj} (typical NB)
- e ~ polysemous* word w
- f ~ lexical knowledge about the sense s of w manifested by e
- WSD is formulated as identifying the sense s* in the sense inventory J
- Soft association between e and f for overlap counts
- Lesk, historically, was a hard binary association counts
- p(ei|fj) is easier to estimate than p(fj|ei)
Lesk's Algorithm
Word Sense Disambiguation
Review of:
Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation
- Pioneer dictionary-based method (1986)
- Words nearby each other in text are related
- Extremely sensitive (issue)
- Pine vs Cone vs Pine Cone
- Has been extended and criticized
- Picking word meaning (e.g. bass)
- Impacts:
- search engines
- understanding of anaphora
- etc. extrapolation
- etc.
- Techniques
- Dictionary-methods using knowledge through lexical resources
- Classifier trained for each distinct word on manually curated examples (most successful)
- Clustering word occurrences
- Difficulties
- Dictionary differences
- Part-of-speech tagging
- Common sense is not easy
Relevant Advances Since 1986
- Overlap was limited to exact definitions
- Length-sensitive matching (2002)
- Tree-matching (2009)
- Vector space models (2012)
- Naive Bayes model (2014) <- this paper
Jill and Mary are mothers
Jill and Mary are sisters
Incorporating Additional Lexical Knowledge
Probability Estimation
- NBM input: Bag-Of-Words
- Text is regarded as a large, grammarless bag (multiplicity intact)
- Tokenized knowledge + f = lots of lexical knowledge
- Hypothesis: Synonyms and hyponyms offer stronger semantic specification that helps distinguish the senses of a given ambiguous words
- Implies synonyms and hyponyms are more effective knowledge sources for WSD
- Co-occurrence of hyponyms results in strong gloss-definition associations in WSD
- Maximum Likelihood selected
- Simplistic
- Demonstrates strength even with crude probability estimation
- c(x) is the count of word x
- c(.) is the corpus size
- c(x,y) is the joint count of x and y
- |v| is the dimension of vector v
*polysemous <- capacity for a word/phrase to have multiple related meanings
Evaluation
Naive Bayes Models with WSD and Lesk (Related Work)
Comparing Lexical Knowledge Sources
Data, Scoring, and Pre-processing
- Goal: Study effect of different types of Lexical Knowledge in WSD
- Outperformed state-of-the-art
- Underperformed against manually curated and selected datasets
- Model performance is evaluated in terms of WSD accuracy (s*)
- No conflicting answers (there is always a best)
- Multi-word expressions (MWEs) were ignored (2 datasets of 3)
- Harmonic means of all incorrect and all correct for MWE-related answers used
- Pre-processing
- lower-casing
- stop word removal
- lemmatization on datasets
- tokenization of MWE instances
Top 2 Unsupervised Methods
- Past: Classifier for each distinct word approach
- 1992 - Information retrieval system
- 2000 - Ensemble NB classifiers with varying window size
- 2009 - Unsupervised NB classifier with expectation-maximization algorithm
- These (and other) Lesk variants focused on extending the gloss to increase overlapping
- This paper aims to make better use of available lexical knowledge
References
Paper: ftp://learning.cs.toronto.edu/public_html/cs/ftp/pub/gh/Wang+Hirst-ACL-2014.pdf
WSD: http://en.wikipedia.org/wiki/Word-sense_disambiguation (yeah, yeah, wikipedia)