We use a POS Tagging library to guess numbers and symbols grammar info.
Match words with dictionary
Probability
For each word in List A:
- Get word's grammar info
- Find lemma and Forms
- Search word's translation in dictionary
- Add a Score to that word if the same word or a variant has been found*.
- Mark the matching word in the right Selected Words list.
- Use a factor based on Type of Word for scoring words.
Regarding statistics:
- 1 to 1 alignment covers more than 83% of cases.
- We don't need actually to translate whole contents but only sentences.
Dictionary Match Cases
For all cases grammar info must match as well
- 1 of 1 matches
- 1 of many matches
- synonym
- more than one word as a single match
- Don't search for numbers, they should appear equal.
Alignment Process Logic
- Split sentences into words via NLP library (right sentence boundaries)
- Select a word as important (via grammar info)
- Find translation using lemma and Forms
- Match Forms returned in translation result against words in the other language sentence
Principle behind
As in human translation we use:
- Structured Knowledge (dictionary database, corpora data via NLP libraries)
- Basic linguistic logic (grammar/lemma/Forms matches)
- Querying any pre-processed data is cheaper and more probable to be right than calculating it mathematically
English result (List A)
Iterating and matching
We calculate a WordScore for each word and a global SentenceScore taking all WordScores into account:
- [en] Aristotle > [es] Aristóteles: 1.0 (exact match) * 1.0 (NNP Factor) = 1.0
- [en] democracy > [es] democracia: 1.0 (exact match) * 0.9 (NN Factor) = 0.9
- [en] control > [es] controlan: 0.8 (conjugation verb found) * 0.8 (VB Factor) = 0.64
- [en] free > [es] libres: 0.5 (plural found, many definitions) * 0.7 (JJ Factor) = 0.35
- continue with the rest of the words in List A and List B...
- Aristotle
- democracy
- constitution
- majority
- government
- control
- defined
- being
- free
- poor
AWN: Average number of words [ (4+4)/2 ]
SWD: Selected Words Difference [ ABS(4 - 4)*2 ]
1.0 + 0.9 + 0.64 + 0.35
Score = ----------------------------------- = 0.7225
4 (AWN) + 0 (SWD)
More to take into account...
Sentence Score
- Number of Important Words
- Number of words per Type
- Dictionary matches (translation and grammar)
- Matches of synonyms
- Matches of more-than-one-word
Spanish result (List B)
- Aristóteles
- democracia
- constitución
- mayoría
- gobierno
- controlan
- define
- siendo
- libre
- pobres
Sample sentence
On each Sentence
- Get POS tags and GrammarInfo for all words
- Guess the real sentence language via NLP library
Sentence Alignment
Eligibility
Selecting criteria
Most Important Words
We might assume that the alignment is right if Score is close to 1.0, otherwise we discard the alignment based on a minimum required score setting.
EN: Aristotle defined democracy as the constitution in which the free and the poor, being in the majority, control government.
ES: Aristóteles define la democracia como la constitución en el que el libre y los pobres, siendo la mayoría, controlan el gobierno.
Based on grammar information look for:
- Cardinal numbers, symbols and foreign words*
- Proper nouns
- nouns, adjetives and adverbs
- verbs
Algorithm development
We still need
Thanks!
- Set better Word Type Factors
- Set better Dictionary Match Scores