Loading…
Transcript

The Naive Bayes Model

Background Information

features

  • e = {ei} and f = {fj} (typical NB)
  • e ~ polysemous* word w
  • f ~ lexical knowledge about the sense s of w manifested by e
  • WSD is formulated as identifying the sense s* in the sense inventory J
  • Soft association between e and f for overlap counts
  • Lesk, historically, was a hard binary association counts
  • p(ei|fj) is easier to estimate than p(fj|ei)

Lesk's Algorithm

Word Sense Disambiguation

Review of:

Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation

  • Pioneer dictionary-based method (1986)
  • Words nearby each other in text are related
  • Extremely sensitive (issue)
  • Pine vs Cone vs Pine Cone
  • Has been extended and criticized

Algorithm (Pseudocode)

  • Picking word meaning (e.g. bass)
  • Impacts:
  • search engines
  • understanding of anaphora
  • etc. extrapolation
  • etc.
  • Techniques
  • Dictionary-methods using knowledge through lexical resources
  • Classifier trained for each distinct word on manually curated examples (most successful)
  • Clustering word occurrences
  • Difficulties
  • Dictionary differences
  • Part-of-speech tagging
  • Common sense is not easy

Relevant Advances Since 1986

  • Overlap was limited to exact definitions
  • Length-sensitive matching (2002)
  • Tree-matching (2009)
  • Vector space models (2012)
  • Naive Bayes model (2014) <- this paper

Jill and Mary are mothers

Jill and Mary are sisters

Incorporating Additional Lexical Knowledge

Probability Estimation

  • NBM input: Bag-Of-Words
  • Text is regarded as a large, grammarless bag (multiplicity intact)
  • Tokenized knowledge + f = lots of lexical knowledge
  • Hypothesis: Synonyms and hyponyms offer stronger semantic specification that helps distinguish the senses of a given ambiguous words
  • Implies synonyms and hyponyms are more effective knowledge sources for WSD
  • Co-occurrence of hyponyms results in strong gloss-definition associations in WSD
  • Maximum Likelihood selected
  • Simplistic
  • Demonstrates strength even with crude probability estimation
  • c(x) is the count of word x
  • c(.) is the corpus size
  • c(x,y) is the joint count of x and y
  • |v| is the dimension of vector v

*polysemous <- capacity for a word/phrase to have multiple related meanings

Evaluation

Naive Bayes Models with WSD and Lesk (Related Work)

Comparing Lexical Knowledge Sources

Data, Scoring, and Pre-processing

  • Goal: Study effect of different types of Lexical Knowledge in WSD
  • Outperformed state-of-the-art
  • Underperformed against manually curated and selected datasets
  • Model performance is evaluated in terms of WSD accuracy (s*)
  • No conflicting answers (there is always a best)
  • Multi-word expressions (MWEs) were ignored (2 datasets of 3)
  • Harmonic means of all incorrect and all correct for MWE-related answers used
  • Pre-processing
  • lower-casing
  • stop word removal
  • lemmatization on datasets
  • tokenization of MWE instances

Hypothesis Confirmed

Top 2 Unsupervised Methods

  • Past: Classifier for each distinct word approach
  • 1992 - Information retrieval system
  • 2000 - Ensemble NB classifiers with varying window size
  • 2009 - Unsupervised NB classifier with expectation-maximization algorithm
  • These (and other) Lesk variants focused on extending the gloss to increase overlapping
  • This paper aims to make better use of available lexical knowledge

By: Clayton Turner

References

Paper: ftp://learning.cs.toronto.edu/public_html/cs/ftp/pub/gh/Wang+Hirst-ACL-2014.pdf

WSD: http://en.wikipedia.org/wiki/Word-sense_disambiguation (yeah, yeah, wikipedia)