Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations

LREC 2012

Kata Gabor

on 29 May 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations

Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Motivation and Context
Related Resources
Lexical Acquisition. Structure
Event Nominalizations
Lexical Acquisition. Corpus
Lexical Acquisition. Distributional Similarity
Lexical Acquisition. Morphological Similarity
Lexical Acquisition. Event Indicator.
Acquisition of Nominalizations. Results.
Enriching the Ontology. The WOLF.
Enriching the Ontology. Heuristic Method.
Enriching the Ontology. Distributional Method.
Enriching the Ontology.
Results & Evaluation
of Nominalizations



Event Indicator
Enriching the WordNet
SCRIBO: Semi-automatic and Collaborative Retrieval of Information Based on Ontologies
Semi-automatic creation of information extraction patterns
Semi-automatic ontology population
Focus: event nominalizations
nouns with an argument structure (subset of the base verb's subcat frame)
Method: lexical acquisition from corpora
distributional properties
morphological properties
(Tanguy and Hathout, 2002)
source: lexical resources, internet
type: morphologically related
size: 9393 V-N couples
Jeux de Mots
lexicon (Lafourcade and Joubert, 2008)
source: users (online game)
type: all

not limited to morphologically related V-N couples:
tomber (to fall, V) chute (fall, N)
rouler (to drive/to travel, V) circulation (traffic, N)
only events are included
corpus- and domain-specific data are acquired
typical distributional contexts can be displayed - facilitates manual
or automatic disambiguation
advantages of the proposed method:
transfer of verbal argument structure to event nominalizations
the noun denoting the event inherits the totality or a subset of the verb's theta role set
the syntactic realization of these arguments will be different
systematic and idiosyncratic divergences
importance of nominal subcategorization
parsing: attachment ambiguities
more detailed lexical information for NLP applications:
information extraction
semantic role labeling (SRL)
specificities of nominal subcategorization and SRL
(compared to verbs)
lack of data (lexical resources, annotated corpora)
lexical exceptions to derivational regularities
irregular or unpredictable mapping between argument structures; ambiguous roles
VERB-NOUN argument structure similarity
morphological similarity
'event indicator' (distributional metric)
list of candidates
700.000.000 words parsed corpus (TAG FRMG parser)
syntactic distributions represented as
1) dependency triplets
2) full subcat patterns
action (event) / result / participant nominalizations
calculated from dependency triplets instead of complete frames
to avoid data sparseness
for better comparability
representations. subcategorization filtering
extract full
subcat patterns
filtering by PMI
filtering by PMI
list of verb-specific subcat frames
list of dependency relation types
subcategorization filtering
dependency relation types;
lemmata in dependency relations
comparison of distributions
metric of similarity
comparison of distributions
mapping between roles; ambiguity
the filtering only applies to the representation of verbs
dependency triplet instances characterized by relation type/lemma/morphological description
prepositional complements -> direct transfer
ambiguity of the preposition 'de': transitive vs intransitive use
transitive verb -> mapped to direct object function
intransitive verb -> mapped to subject function, together with the preposition 'par'
comparison of distributions: Dice index
ranked list of candidates
detect morphologically related candidates among
distributionally similar candidates
-> morphological similarity only taken into account when the candidate has a similar distribution
metric: edit distance
verb stem: deleting infinitival suffix (-er/-re/-ir)
noun stem:
full form
list of potential nominalization suffixes (obtained semi-automatically from other resources)
morpholocial link is established if:
edit distance between the two stems < 3;
length of the verb - edit distance > 3
noun stem + infinitival suffix = verb
problem: shared contexts without shared meaning/aspect, especially with Ns not denoting an event/action
(to shake); subject:
la terre
(the earth)
false nominalization candidate:
(apple); context:
pomme de terre

suggested solution: '
event indicator metric
event nominalizations tend to occur in an argument position with verbs that semantically select an event as an argument
events can be referred to by clauses, infinitives or event nominalizations
event nominalizations have a distribution similar to that of clauses and infinitives
extract verbs which subcategorize for an infinitive or a clause
manually map the syntactic realization of verb-type complements to that of nouns (e.g.
refuser de signer
refuser la signature
((occurrences in event-like contexts) / (total nb of occurrences)) * (types of event-like contexts)
filter out 25% of candidates with the lowest 'event indicator' value
3.351 verbs with at least 50 occurrences in the corpus
2.424 verbs with at least one candidate
+ additional filtering step based on distributional similarity:
1.136 verbs
results presented as tickets for manual evaluation
manual evaluation of 171 randomly selected tickets
on the basis of the contexts provided,
whether the correspond to a correct nominalization (w.r.t. the meaning and the assignment
of semantic roles to complements)
113 candidates (70%) were validated as correct nominalizations
sources of error:
noisy input (lemmatization errors, parse errors)
antonymy, hyponymy relations
semantically close verb/noun couples with different semantic role assignment (e.g. acheter/to buy - vente/selling)
the nominalization is correct w.r.t. a given semantic role, but does not refer to the same event (e.g. enlever/to kidnap - disparition/disappearance)
the nominalization is correct but the verb and the noun do not denote events (e.g. peser/to weigh - poids/weight)
WOLF: WOrdNet Libre du Français, Free French WordNet (Sagot and Fiser, 2008)
structurally equivalent to the Princeton WordNet:
hierarchy of nodes denoting synsets
built automatically from PWN and various multilingual resources
current coverage: 32.351 non-empty French synsets, 38.001 French lemmata
idea: use nominalization relations and semantic relations present in the WOLF to fill up verbal synsets
difficulty: high level of verbal polysemy
filling empty synsets with verbs,
adding verbs to existing synsets
exploit derivational links already present in WOLF (coming from PWN)
for each verb with one or more candidates
for each of its nominalization candidates
if the noun is already present in WOLF:
extract every verbal synset from WOLF, which is linked to the synset of the noun by a derivational relation
2.353 {V, S} candidates
where verb V is suggested to be added to synset S
heuristic method for filling synsets:
if there is only one empty synset among the proposed verbal synsets, V will be added to this synset
result: 377 empty synsets filled (among which 45 filled with more than one verb)
1,716 {V,S} candidates (with more than one empty synset) were positioned using a distributional method combining two metrics:

likeliness {V,S} = correctness of the nominalization + semantic similarity of the V and the synset
(V,S) = distributional similarity between V and the noun based on which synset S is proposed
(V,S) =
distributional similarity between the V and the bag of words around the synset:
every word either in synset S or in any synset linked to S within at most three steps of hypo- or hypernymy relations
semweight corresponds to the minimum of distance between V and the bag of words
combination of semweight and nominweight:
learn the weight of the features with the MegaM classifier
training data:
positive {V,S} candidates (proposed candidates which already figure in WOLF)
negative {V,S} candidates (non-empty synsets which do not have the verb V)
features: semweight, nominweight, (semweight * nominweight)
Kata Gábor, Marianna Apidianaki, Benoît Sagot, Eric Villemonte de la Clergerie
manual evaluation of {V,S} candidates coming from a correct nominalization
candidates positioned by the heuristic method
63 randomly chosen candidates evaluated: 95% of
correct assignments
candidates positioned by the distributional method
63% of correct candidates among the totality of the generated {V,S} tickets:

ranked by
highest ranked 25%: 80% precision
highest ranked 50%: 69% precision
1) unsupervised method for extracting event nominalizations
distributional similarity
morphological similarity
'event indicator' metric
principal advantages:
ability to extract
morphologically unrelated nominalization candidates
ability to
visualize distributional contexts
and facilitate manual validation of candidates
possibility to extract shared arguments and their semantic characteristics
2) ontology population based on the extracted derivational links
generation of {lemma, synset} candidates
heuristic method for unambiguous candidates (95% precision for correct nominalization candidates; 78% precision over the totality of candidates)
distributional method for ambiguous candidates
the confidence value facilitates the manual completion of verbal synsets for ambiguous candidates
Full transcript