Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Tracking Researcher Mobility on the Web Using Snippet Semantic Analysis

An application of natural language processing tools to the sociology of migration.
by

Jorge Garcia Flores

on 21 October 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Tracking Researcher Mobility on the Web Using Snippet Semantic Analysis

brain drain Name matching filter evaluation Semantic features evaluation Semi-automatic MTC task (user evaluation) tracking mobility evaluation Jorge J. García Flores, Pierre Zweigenbaum, Zhao Yue and William Turner Tracking Researcher Mobility on the Web Using Snippet Semantic Analysis LIMSI/CNRS, B.P. 133, 91403, ORSAY, FRANCE Mobility
Traces
Classification
task Among the Latin-American authors who published scientific articles about biotechnology during 2011, how many of them are living abroad? web people
search "Unoporuno"
NLP pipeline nlp Sociology of migration Traditional Sources: Alternative Source: demographic registers
labour surveys
population census biographic information extracted from the web ("reality mining") local
authors mobile
authors sociologist bibliographical
records names manually tracking mobility someone who has gone abroad for proffessional or academic reasons for more than one year someone who has only spent short periods
of time abroad mobility trace Web People Search (WePS) Ambiguous name only queries
Clustering the results according to homonyms
100 web search results per query
"Michael Jackson"
+ the artist
+ the requirements engineering expert
+ the plumber (Artiles et.al., 2010) hypothesis To refine name only queries with bibliographical information to solve the MTC task UnoporunO mobile
authors local
authors sociologist bibliographical
records refined
name
queries Mobility Traces Classification task
(semi-automatic) mobility trace snippets top 5 snippets UnoporunO local
authors mobile
authors bibliographical
records refined
name
queries Mobility Traces Classification task
(automatic) mobility trace snippets sociologist validation Pre-processing Extract author names, publication titles, organizations and locations from the ISI export file.
Extract geographical locations and organizations from the author's affiliation
Filter out researchers affiliated to non-Spanish speaking countries, except for those with a Spanish first or last name.
Expand ISI initials
Clean useless information. name topics organizations locations Query production Ex Names of people are combined with noun phrases from the publication's title.
Extract geographical locations and organizations from the author's affiliation
Geographical locations are translated into Spanish or English.
Multilingual queries are generated in English, Spanish and organization's detected language. Tools Freeling bilingual pos-tagger.
Google translator.
Home-made gazetteers. An average of 19 queries per person
are produced Name filtering The mass of snippets resulting from Web search queries is then filtered to select those with a valid variation of the person's name.
William A. Turner
WA Turner
Turner, W.A.
William Turner
Turner, William Semantic feature analysis Feature analysis consists in searching in the snippet content for mobility related information.
The rationale is that the snippet contents might give clues about mobility traces not directly visible in the snippet, but which are contained in the referred to document. To design the multilingual rules, an extensive n-gram based analysis of the 58,220 snippets from the training set and NE's from the JRC-Names list was performed. Snippet classification & ranking 4 statistical classifiers were compared: SVM, Naive Bayes, Decision trees and NB Trees.





A geographical heuristic is used to rank the top-5 snippets. Person classification Automatic person classification is performed based on geographical data found in top ranked snippets.
Snippets classified as mobility traces are parsed to extract locations.
Locations are then mapped to countries.
The person is classified according to the most frequent countries in the snippet selection as being mobile, local or unknown. Unoporuno NLP pipeline Evaluation data Experimental framework Name matching filter evaluation
Semantic features evaluation
Snippet classifiers comparison
User evaluation of the semi-automatic person classifier.
Evaluation of the automatic person classifier Random selection of 100 snippets containing a valid variation of 10 persons names (positives)

Random selection of 100 snippets containing no valid variations of 10 person names (negatives).

Manually annotate false positives from the first set and false negatives from the second. Individual feature evaluation: or each of the 15 features, we made a random selection of 50 snippets with the feature on (positives) and 50 snippet with the feature off (negatives). Then we annotated false positives from the first set and false negatives from the second.
Ablation tests: we trained 15 SVM ablated classifiers by removing one feature from the original 15 feature set. The automatic person classification process was run 15 times with a 14 feature-set classifier, and the results compared to the full 15 feature-set run. Measure P@5 (precision at the fifth snippet), R@5 (recall at the fifth snippet) and F@5 based on the observed category value of each snippet.
Measure the Oracle Decision Rate (ODR) to Simulate whether a sociologist would be able to make a decision based on the top-5 snippets.
This would be the case if at least one snippet allowed the sociologist to classify a person in the right category. We evaluated the ability of sociologists to classify persons given the top-5 snippets produced by the classifier that obtained the best ODR.
Three pairs of sociologists classified subsets of 10 persons (5 mobile, 5 local) of the test set. A seventh sociologist was asked to classify the entire test set (50 persons: 25 mobile, 25 local). Precision, Recall and F-measure were computed using the true mobility status of the people. Kappa was computed for each pair of users sharing the same dataset. Results We performed an automatic person classification test on a set of 25 mobile and 25 local persons. The test consisted in classifying automatically a person as being mobile or local based on geographical criteria, and comparing Unoporuno results with the real mobility classes. Conclusions The top-5 snippets are selected for evaluation by a sociologist, and our automatic selection algorithm works in 80% of the cases: using the snippets selected by our system, sociologists can access documents on the Web which allow them to take clear-cut decisions on a person's mobility status with a moderate level of inter-evaluator agreement (avg. kappa=0.60)
The geographical analysis of locations from snippets classified as mobility traces by an SVM classifier was able to find 78% of the mobile persons of the test set (F=0.71). Name matching filter evaluation Semantic features evaluation Snippet classifiers comparison User evaluation of the semi-automatic person classifier Evaluation of the automatic person classifier However, WePS is not appropriate for the MTC task, because:
MTC is interested in one particular individual, while WePS main concern is to cluster homonyms WePS starting point is a name only query... while MTC starting point is the semantically rich context of a bibliographical record Regexps generated from the name grammar to check valid name variations. Context-free grammar for Spanish and English name parsing manual semi-automatic automatic query
refinement EU "Higly multilingual named entity resource"
200,000 person names
7,000 organization names snippet semantic
feature vector statistical
classifier strong mobility trace weak mobility trace no trace home country destination country destination country home country OR Automatic MTC task evaluation Snippet classifiers comparison For each feature of the 15 features:

Random selection of 50 snippets with the feature on (positives) and

Random selection of 50 snippet with the feature off (negatives).

Manually annotate false positives and false negatives Semantic features evaluation Evaluating the ability of sociologists to classify persons given the top-5 ranked snippets.
Three pairs of sociologists
Each pair classified a subset of 10 persons (5 mobile, 5 local) of the test set.
A seventh sociologist classified the entire test set (50 persons: 25 mobile, 25 local).
Precision, Recall, F and Kappa (inter-annotator agreement) were measured. Semi-automatic MTC task evaluation Automatic person classification on the test set (25 mobile, 25 local authors).
The test consisted in classifying automatically a person as being mobile or local based on geographical criteria, and comparing Unoporuno results with their real mobility classes. Automatic MTC task evaluation Results Top 5
ranked
snippets if contains ∈ 1+ strong mobility traces 1+ home country weak trace
1+ destination country weak trace 0 strong mobility trace
0 destination country weak trace mobile mobile local mobile Oracle decision 15 SVM ablated classifiers were trained by removing one feature from the original 15 feature set.
The automatic person classification process was run 15 times with a 14 feature-set classifier, and the results compared to the full 15 feature-set run. Ablation tests Results Classifier comparison Sociologist evaluation Avg F (E1...E6)=0.79 Thank you!
Full transcript