Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Overview of EFNILEX 2008-2012

EFNIL meeting, 2012 Budapest

Eniko Heja

on 26 October 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Overview of EFNILEX 2008-2012

E. Héja, D. Takács, T. Váradi
{eheja, takdavid, varadi}@nytud.hu General hypothesis: P(tr) and the ratio of the frequencies of the source and target lemmata provide hints on the semantic relations between the source and target lemmata.
Dyvik, 2002: “Translations come about when translators evaluate the degree of interpretational equivalence between linguistic expressions in specific contexts. In many ways such evaluations, made without any theoretical concerns in mind, seem more reliable as sources of semantic information than the careful paraphrases of the semanticist or the meaning descriptions of the lexicographer.”
We interpreted the basic assumptions behind their method in translational terms Crosslingual semantic relations 1. Since the most likely translation of ‘faire’ is ‘doen’ instead of 'maken' in the one-token dictionary, our method predicts that this translation is not transparent for the user The proportion of whole complementation frames should be increased among translation candidates

Some heuristics should be used to filter out wrong complementation frames in the frame list
Eg.: “too long” frames have to be disregarded

Diversity of data should be lessened
Less syntactic categories should be used to characterize verbal structures To what extent LT methods might help the creation of bilingual dictionaries?

Facilitate lexicographic work
Directly for end-users

Medium-sized dictionaries covering every-day language use: 15,000-25,000 entries

Targeted language pairs: Hungarian-Lithuanian, Dutch-French Hypothesis1:’Semantically closely related words ought to have strongly overlapping sets of translations.’ (Dyvik, 2002)
Interpretation: Two lemmata are translational synonyms if
Translational probability is high and the frequencies of the source and target lemmata are close.
The straight and reverse translational probabilities are both high.
Atkins & Rundell (2008:467) states, ’The perfect translation – where an SL word exactly matches a TL word – is rare in general language, except for the names of objects in the real world (natural kind terms, artefacts, places, etc.)’.
Manual evaluation of Slovenian and Hungarian translation pairs (p(tr)=1, frequency ratio is less than 3)
104 noun-to-noun translations (out of 136 pairs).
proper names, illnesses, professions
a couple of abstract nouns: ihlet (’inspiration’) and botrány (’scandal’). This question arises only in the case of MWE-MWE translations

MWE-one-token unit translations (or vv.) are surely not transparent Goal: To extract parallel verbal structures
E.g.: FR “faire partie de …” NL “deel uitmaken van…” (make part of…)

(1) Monolingual phase:
Automatic extraction of verbal structures via a suitable algorithm
Detection of the proper verbal structures in each side of the parallel corpus
Merging them as if they were one-token expressions:
faire partie de => faire***partie***de

(2) Bilingual phase: Finding the relavant translation pair via word alignment As the word alignment algorithm considers only one-token expressions multi-word expressions (MWEs) should be treated in a different way Strict Parameter Settings With these parameters there is only 1352 translation candidate pairs in the fr-nl dictionary Dictionary Browser:
The user can browse the automatically generated databases This approach diminishes the reliance on lexicographers’ intuition

When characterizing SL linguistic units (LUs) to be included in dictionary
When finding the translational equivalents
Usage-based, representative translations
Clear ranking between more likely and less likely translations
Most-used translation equivalents are ranked higher Example All the bigrams appear in the DB

Even if their translation is not available with the actual parameter setting

Problem: different parameter settings are needed then in the case of one-token units

Nouns are accessible from the verbs they are selected by

Verbs are accessible from the objects they are used with Parallel corpus with deep syntactic analysis
Algorithm to detect verbal structures automatically Relaxed parameters Strict parameters (4) Bilingual concordance is shown by clicking on the translation candidate (3) Word cloud:
font size reflects P(tr)
colours give hints on the semantic relation between SL and TL words Hypothesis3: ’If a word a is a hyponym of a word b (such as tasty of good, for example), then the possible translations of a ought to be a subset of the possible translations of b.’ (Dyvik, 2002)
Interpretation: if the sum of the target lemma frequencies is close to the source lemma frequency and the sum of their translation probabilities is high then the target lemmata represent submeanings of the source word.
The submeanings might be related or homonyms
We cannot automatically tell apart related meanings and homonyms at the present stage An example of the extracted verbal structures
and their translation candidates Example: Extracted verbal structures (Dutch) Too many features: ‘does not use sg’, ‘use adj+sg’, ‘give adj+sg’.

If the ADJs and the negation were not considered, the resulting frames would be exactly what could be “expected”. Relaxed Parameter Settings The frequency of the source lemma and that of the target lemma must exceed a threshold:
min f(S) = min f(T) = 5 Which syntactic features are relevant in the characterization of verbal structures? Extraction of Verbal Structures EFNILEX Project 2008-2012
An Overview Task Workflow (French-Dutch) Prerequisites efnilex.efnil.org More features =>
more detailed description of verbal structures More features =>
less data for a data type, since the diverse tags increases the number of types in the corpus.
Head of the direct dependents of the verb
The dependency relation (complement, adjunct, the preposition denoting the type of the dependency, semantic annotation if any)
Adjectives or possible complements were kept, all the other head modifiers were omitted
Determiners were dropped Therefore: - - + + + - - Atkins & Rundell (2008:467) states, ’The perfect translation – where an SL word exactly matches a TL word – is rare in general language, except for the names of objects in the real world (natural kind terms, artefacts, places, etc.)’. Method based on word alignment on parallel corpora. Cons It is tedious (if possible) to collect parallel corpora of a proper size. Background Cut Board Dictionary Browser Rich, Uniform Corpus
Representation MWEs
2012 Dictionary Query System Cut Board Cut Board Customization: Different parameter settings for different user needs efnilex.efnil.org Additional workflow for verbal structures Pros the larger the dictionary the more incorrect translation candidates it contains Cut Board:
Customization according to the above parameters Dictionary Browser (1) Translation candidates are ranked based on their likelihood =>
most used translation candidates come first (2) Displays the distribution of translations based on P(tr) and ferquency ratio between the SL and the corresponding translation Translation probability must exceed a certain threshold:
min p(tr) = 0.001 This parameter setting results 104.039 translation candidate pairs in the fr-nl dictionary The frequency of the source lemma and that of the target lemma must exceed a threshold:
min f(S) = min f(T) = 100 Translation probability must exceed a certain threshold:
min p(tr) = 0.5 Detailed Workflow Lexicographically
Interesting Translations Conclusion Verbs + Object Structures in the Dictionary Browser Crosslingual semantic relations 2. Crosslingual semantic relations 3. Crosslingual semantic relations 4. Crosslingual semantic relations How to find the ideal parameter settings?
Trade-off between precision and coverage: Bilingual concordances are provided automatically

Reversing the dictionary is straightforward: the workflow is symmetric. Underlying
Database One-token Units Long-distance
(Verbal Structures) Collocations User Feedback,
Documentation, Dissemination
Cross-linguistically uniform morphosyntactic annotation Sass (2009): An extraction method for collecting salient verbal structures (not strictly subcategorization frames)

The method recognizes if the lexical head is inherent part of the structure, eg.: "make part of sg"

"part": lexically bound
"of sg" : only the preposition is inherent part of the verbal frame Workflow Extracting bilingual collocations Hungarian - Lithuanian (and vv.)
Hungarian - Slovenian (and vv.) Collocations are defined as:

Noun + Noun
Adjective + Noun
Adverb + Verb sequences Workflow (1) Monolingual phase:

Automatic extraction of collocations via a suitable algorithm
Detection of the collocations in each side of the parallel corpus
Merging them as if they were one-token expressions

(2) Bilingual phase: Finding the relavant translation pairs via word alignment Workflow User Feedback:
English - Hungarian Dictionary
The whole workflow (except for the extraction of parallel verbal structures) was reapplied to an existing parallel corpus (Hunglish 1.0)


To make our results more widely known
To collect user feedback on the dictionaries: improvements and simplifications Uniform XML corpus representation Manual evaluation has shown that the results are promising BUT To find an automatic, language independent heuristics for deciding whether a MWE translation is interesting for the user Presupposition: A translation is interesting for a user if it is not transparent. The one-token dictionaries generated in the first phase should be used.
We say that the translation is transparent if every part of a compound expression is translated with the most likely translation in the one-token dictionary. 'Faire choix' 'Maaken keuze' is an interesting translation pair? YES! 'Faire' 'Doen' 25,3%

'Maaken' 21,7% One-token dictionary Dictionary Query System Future Plans Instead of Parallel Texts Comparable Texts Comparable texts: texts from similar domain, genre (e.g. Wikipedia articles, news, etc.) Pros: Greater amount of comparable texts Cons: Small bilingual 'seed-dictionaries' are needed
Easily producable: based on some suitable parallel corpora
Results are poorer: precision percentages are not rising above 65% (Irimia, 2012) How to Find Translation Candidates? Hypothesis: word target1 is a candidate translation of word source1 if the words with which target1 co-occur within a particular window in the target corpus are translations of the words with which source1 co-occurs within the same window in the source corpus "Are translations": Based on the seed dictionaries Related Projects (LREC 2012) ACCURAT offers methods to gather comparable corpora from the web and a toolkit to obtain translation dictionaries) Cons: the toolkit targets Croatian, Estonian, Greek, Latvian, Lithuanian and Romanian: how much effort is needed to adapt it to other languages? TTC TermSuite terminology extraction from comparable corpora for English, French, German, Spanish, Latvian, Chinese, Russian. PANACEA Platform provides webservice for focused monolingual/bilingual (parallel) crawling for English, Spanish, French, Italian and German Additional workflow for collocations Goal: the uniform treatment of collocations across languages The resulting proto-dictionaries include

Genre-information Dissemination and Documentation Our results were presented at:
EACL 2012
LREC 2012 Documentation of the experiments, results and workflow is in progress Plans for 2012 (1) Efforts to handle MWEs should be extended: automatic treatment of verbal structures and Hungarian-Lithuanian adjective + noun translation pairs (2) Uniform representation of every parallel corpus (structurally and morphosyntactically) so that they can be processed in a uniform way in the rest of the workflow (5) The use of frequency dictionaries might compensate for accidental gaps in coverage and provide a balanced list of lemmas. Automatic translations of the missing entries should be found. (4) To find optimal parameter settings to be able to detect semantic relations between source and target lemmata (3) Results should be made available online and the methodology carefully documented. DONE DONE IN PROGRESS IN PROGRESS TODO Task Solution
Full transcript