Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Entity-Based Filtering and Topic Detection for Online Reputation Monitoring in Twitter

PhD Thesis Defense
by

Damiano Spina

on 2 March 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Entity-Based Filtering and Topic Detection for Online Reputation Monitoring in Twitter

Filtering
Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Entity-Based Scenario

Challenging for NLP:
Little Context
(1 tweet = 140 chars, ~15 words)
Non-standard, SMS-style language
Most popular microblogging service
One of the most relevant sources for ORM
Dynamics, Real-Time
Quick spread of information
Fingerprint representation
Is the tweet related to the company?
Entity-Based Filtering and Topic Detection
for Online Reputation Monitoring in Twitter

Damiano Spina
Dr. Julio Gonzalo Arroyo
Dr. Enrique Amigó Cabrera

PhD Thesis Defense
Madrid - September 25, 2014

Filter Keywords
Active Learning for Filtering
Wikified Tweet Clustering
Cluster Keywords
Current tools do not seem to perfectly match ORM expert needs
Difficult to personalize/customize
Not trivial to predict how the tool will perform in real data of experts' daily work
Entity of Interest: Organization, Brand, Public Figure
Recall-Oriented Scenario
Topic Detection: Clustering Task
Twitter
Long Tail

Research Questions

Keywords
External Resources
Training Data
RQ 1: Which challenges --when monitoring the reputation of an entity in Twitter-- can be formally modeled as information access tasks? Is it possible to make reusable test beds to investigate the tasks?
Filtering
Topic
Detection
Publications
Peer-Reviewed Journal Papers
Peer-Reviewed Conference Papers
Lab/Workshop Papers
CLEF Lab Overviews
Doctoral Programme in Intelligent Systems
Advisors:
Filter Keywords and Majority Class Strategies for Company Name Disambiguation in Twitter

D. Spina, E. Amigó, J. Gonzalo.
Proceddings of CLEF'11. 2011
Discovering Filter Keywords for Company Name Disambiguation in Twitter
D. Spina, J. Gonzalo, E. Amigó
Expert Systems With Applications, vol. 40, no. 12, 2013
Q1, Impact Factor: 1.965
Resources
ORMA: Online Reputation Monitoring Assistant
Motivation
ORM Framework
Tasks
Conclusions
Topic
Priority
Summary
Polarity
for
Reputation
Topic
Detection
apple
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
Tweet Stream
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
Alert!
Alert!
Alert!
Mildly Important
Sep' 14
Discarded
(unrelated)
Unimportant
OUTPUT
Evaluation Metrics
Reusable
Test Collections

ORM as
Information Access
Tasks

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Author
Profiling
Aspect and Opinion Target Identification
Dimension Classification
Filtering
WePS-3 Dataset (Filtering)
http://nlp.uned.es/weps
RepLab 2013 Dataset
RepLab 2012 Dataset
A Corpus for Entity Profiling
http://bit.ly/profilingTwitter
Reliability & Sensitivity (R&S)
AUC: Area under the R&S Curve
MAR: Mean Average Reliability
[Amigó et al., SIGIR'13]
Clustering: Equivalent to BCubed Precision and Recall
Equivalent to MAP
Accuracy
NED 2 - 1 MEX
KLM Meltdown
What are people saying about a given entity right now?
Is any topic potentially dangerous for reputation?
Need to quickly detect and continuously track potentially dangerous contents
Limited human resources
Online Reputation Expert
Online Reputation Monitoring
Machine
Learning
?
6 months
Random Sampling (RS)
Margin Sampling (MS)
Candidate Selection
Classification
Linear kernel Support Vector Machines (SVM)
Feature Representation
Select instances that may maximize classification performance with minimum effort.
Candidate instance is randomly sampled from the test set.
No informed prior on the instances.
Candidate instance is sampled based on the classification difficulty:
selecting instances where the classifier is less confident.
closest to the margin
Bag of Words (BoW): tweet content + author

Term weighting: Binary Occurrence
1 if appears in the tweet
0 in other case
Active Learning
Research Questions
Topic Detection
Filtering: Conclusions
Topic Detection: Conclusions
Lack of a Standard
Evaluation Framework
Problem not clearly defined in the state-of-the-art
WePS-3
RepLab 2012
RepLab 2013
RepLab 2014
Evaluation Campaigns

Entity-Based Topic Detection in Twitter
Bank of America
ATM flaw let man withdraw $1.5 Million- He gambled it all away - http://goo.gl/n4QD...
Man accidentally given 1.5 million dollars by
Bank of America
http://honda-tech.com/showthread.php?p=47507454…
Topic: ATM Withdraw Mistake
Cons Prod Strategy Manager at
Bank of America
(Jacksonville, FL) SAME_URL
Part Time 20 Hours Bartram Lake Village at
Bank of America
(Jacksonville, FL) SAME_URL
Topic: Vacancy
Entity:
INPUT
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Research Questions
Learning Similarity Functions
Twitter Signals
@damiano10
12:10 AM - 11 Jul 2014
@SIGIR2014 Code and data available at
http://bit.ly/simFunctionsORM

#happyhacking

hashtag
author
timestamp
tweet content
URL
named user
bla bla bla ENTITY NAME bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Tweets about a given entity
Pairwise Tweet Representation
d1
d2
d3
d4
d5
<d1, d2>
<d1, d3>
<d1, d4>
<d1, d5>
<d2, d3>
<d2, d4>
...
... label
0.8 0.6 0.3 0
0.2 0.1 0.1 0
0.9 0.7 0.6 0.4
0 0.1 0.3 0.1
0.7 0.1 0.2 0.6
0.2 0.4 0.1 0.1
...
...
true
false
true
false
true
false
...
...
...
...
...
...
Learning
Hierarchical Agglomerative Clustering
Training Set
Binary
Classifier
Similarity Matrix
Tweets about the Target Entity
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
Do the tweets belong to the same topic?
Similarity Signals
Pair of Tweets
Term Overlap
Meta-data Overlap
Semantic Overlap
Time Proximity
maxPWA
: maximal Pairwise Accuracy
Theoretical upper bound of the effectiveness of different feature combinations
For each classification instance, a perfect classifier would only listen to the features that give the right information
Machine
Learning
?
Similarity Signals
(Jaccard similarity, Lin's similarity)
(entity linking; tweets represented as bag of Wikipedia entities)
(author, named users, URLs, hashtags)
(milliseconds, hours, days)
Pearson correlation
of 0.93 with maxPWA
0.5=random
SVM results can be generalized to other ML algorithms
[Artiles et al., EMNLP'09]
[Meij et al., WSDM'12]
small but statistically significant improvement
SVM classifier's confidence as similarity metric
+ Hierarchical Agglomerative Clustering (HAC)
Topic Detection: Effect of Twitter Signals
To train, or not to train: that is the question
Effect of the Learning Process
~ Confidence of the tweet pair belonging to the same topic
Evaluation Metrics
Reliability & Sensitivity (R&S)
AUC: Area under the R&S Curve
MAR: Mean Average Reliability
[Amigó et al., SIGIR'13]
Equivalent to BCubed Precision and Recall
Official metric used at RepLab
Results
Unsupervised
Supervised
Equivalent to MAP
Topic Detection Results
1
Topic Detection Results
Failure Analysis
Lower standard deviation for R, S and F than RepLab systems -> More robust across test cases
Hard Topics
Easy Topics
inter-annotator agreement: 0.48 F(R,S)
1
vs.


"Negative Opinion of an Owner"
"Bad Service"
"Hate-Opinions"
"Concern of Customers"
"Fans Tweeting"
"Looking Forward to Own a Car"
"Man Arrested for Racial Abuse during Capital One Cup Game"
"Qatar Selling Warrants"
"Dave Matthews Band at Wells Fargo Center"
organizational
event-oriented
"Calls to Condemn
Uganda's Politics"
1
LDA-based topic model that takes into account time and author distributions
Agglomerative clustering of semanticized tweets
(equivalent to semantic features)
RepLab 2013 Systems
6 months
Pairwise Tweet Representation
Pairwise Tweet Representation
Term
Overlap
Author
Overlap
Time
Overlap
URL
Overlap
Twitter Signals are complementary to text

Machine
Learning
?
6 months
Bank of America
ATM flaw let man withdraw $1.5 Million- He gambled it all away - http://goo.gl/n4QD...
Man accidentally given 1.5 million dollars by
Bank of America
http://honda-tech.com/showthread.php?p=47507454…
Topic: ATM Withdraw Mistake
Cons Prod Strategy Manager at
Bank of America
(Jacksonville, FL) SAME_URL
Part Time 20 Hours Bartram Lake Village at
Bank of America
(Jacksonville, FL) SAME_URL
Topic: Vacancy
Entity:
label=false
label=true
d1
d2
d3
similarity function,
not topics!
We have only used SVM, can we generalize our results to other classifiers?
=
Very close to
Inter-Annotator Agreement!
0.47 vs. 0.48 F(R,S)

Not low agreement score due to:
Strictness of the evaluation metric (R&S)
Difficulty of the task (clustering)
1
Let's dive into the details...
Failure Analysis
Hard Topics
Easy Topics
vs.


"Negative Opinion of an Owner"
"Bad Service"
"Hate-Opinions"
"Concern of Customers"
"Fans Tweeting"
"Looking Forward to Own a Car"
organizational
event-oriented
"Man Arrested for Racial Abuse during Capital One Cup Game"
"Qatar Selling Warrants"
"Dave Matthews Band at Wells Fargo Center"
"Calls to Condemn
Uganda's Politics"
ORMA Demo

Preliminary Experiments
Entity Aspect Identification
Real-Time Summarization of Scheduled Events
q=“ferrari”

WP concept c, n-gram q

Wikification: Commonness Probability
Tweet -> Set of Wikipedia Concepts/Articles
Clustering: word overlap between tweets
Approach
Results
RepLab 2012
RepLab 2013
Results
RepLab 2012
RepLab 2013
Step 1: Term Clustering
Learned similarity function (content-based, meta-data, time-aware features)

Hierarchical Agglomerative Clustering

Step 2: Tweet Clustering
Assigns tweets according to maximal term overlap


Translation: Spanish tweets linked to Spanish Wikipedia and then translated by following inter-lingual links.
Assumption: Each topic can be represented with a set of keywords, that allow the user to understand what the topic is about
Example: 300 tweets about
computers
jobs
e-reader
Goals
Identify scientific challenges in the ORM process
How IR & Text Mining techniques can help to make the process more efficient?
How much of the problem can be solved automatically?
Acknowledgements
Contributions & Answers
Sparsity
Tweets Are Short (140 Chars)
Entities Reside in Twitter's Long Tail
Entity-Oriented: Fine-Grained Topics
Difficult to interpret in skewed classification scenarios.
# Correct
# Total
Accuracy =
A system predicting that any paper is rejected for a conference has an accuracy equals to 1-acceptance rate. But is not informative.
Upper Bound of Filter Keywords
ORM Scenarios
Unknown-Entity
Known-Entity
"http://apple.com"

RQ 2:Can we use the
notion of filter keywords
effectively to solve the filtering task?
RQ 4:
Where
should we look for
filter keywords
in order to find them
automatically
?
RQ 7: When
entity-specific training data
is
available
, is it worth looking for
filter keywords
in external resources

or is it better to
learn them
automatically
from
the
training data
?
RQ 8: In an
active learning scenario
, what is the impact in terms of
effectiveness
of an
informative sampling
over a
random sampling
? How much of the
(initial) annotation effort
can be
reduced
by using active learning?
RQ 3:Can we
generalize
the idea of
"filter keywords" to

"cluster keywords"
, i.e., can we use it for topic detection?
RQ 5: Wikipedia is a knowledge base that is continuously being updated, and can be a relevant source to discover filter keywords automatically. Are the
topics
discussed about an entity in Twitter
represented somehow in Wikipedia
?
RQ 6: Can
Twitter signals
be used to improve entity-specific topic detection?
RQ 9: Can previously
annotated material
be used to learn better topic detection models?
Formalization of the ORM problem from a scientific perspective
Cooperation with reputation experts
Collaboration in Evaluation Campaigns
ORM Scenarios: unknown-entity vs. known-entity
Vanessa Álvarez
Ana Pitart
Adolfo Corujo
Miguel Lucas (Acteo)

Prof. Maarten de Rijke
M.Hendrike Peetz
ILPS Group
Dr. Edgar Meij
Dr. Jorge Carrillo de Albornoz
Dr. Irina Chugur
Dr. Arkaitz Zubiaga
Tamara Martín
Dr. Víctor Fresno
Dr. Raquel Martínez
Dr. Laura Plaza
NLP & IR Group, LSI Department

Joint Work
My Contribution
Definition and formalization of the ORM tasks
Creation of test collections and annotation tools
Technical support to participants
Collaboration on:
Identifying Entity Aspects in Microblog Posts (poster)
D. Spina, E. Meij, M. de Rijke, A. Oghina, M.T. Bui, M. Breuss.
Proceedings of SIGIR'12, 2012.
Learning Similarity Functions for Topic Detection in Online Reputation Monitoring
D. Spina, J. Gonzalo, E. Amigó.
Proceedings of SIGIR'14, 2014.
Towards Real-Time Summarization of Scheduled Events from Twitter Streams (poster)
A.Zubiaga, D. Spina, E. Amigó, J. Gonzalo.
Proceedings of Hypertext'12, 2012.
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
J. Carrillo de Albornoz, E. Amigó, D. Spina, J. Gonzalo.
Proceedings of ECIR'14, 2014.
UNED @ RepLab 2012: Monitoring Task
T. Martín, D.Spina, E.Amigó, J.Gonzalo
CLEF'12 Labs and Workshop Working Notes, 2012.
A Corpus for Entity Profiling in Microblog Posts
D. Spina, E. Meij, A. Oghina, M.T. Bui, M. Breuss, M. de Rijke
LREC 2012 Workshop on Language Engineering for Online Reputation Management, 2012.
UNED Online Reputation Monitoring Team at RepLab 2013
D. Spina, J. Carrillo de Albornoz, T. Martín, E. Amigó, J. Gonzalo, F. Giner
CLEF'13 Labs and Workshop Working Notes, 2013.
Towards an Active Learning System for Company Name Disambiguation in Microblog Streams
M. H. Peetz, D. Spina, M. de Rijke, J. Gonzalo
CLEF'13 Labs and Workshop Working Notes, 2013.
WePS-3 Evaluation Campaign: Overview of the Online Reputation Management Task
E. Amigó, J. Artiles, J. Gonzalo, D. Spina, B. Liu, A. Corujo.
CLEF'10 Labs and Workshop Working Notes, 2010.
Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems
E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martín, E. Meij, M. de Rijke, D. Spina
Proceedings of CLEF'13, 2013.
Overview of RepLab 2014: Author Profiling and Reputation Dimensions for Online Reputation Management
E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, E. Meij, M. de Rijke, D. Spina
Proceedings of CLEF'14, 2014.
Entity-Specific
Training Data
Long Texts
Non-Entity Oriented
Trending Topics
(e.g., Knowledge Base Population at TAC,
WePS)
(e.g., entity linking,
TREC Microblog Track)
Representative URL
Alert
Unknown-Entity Scenario
Known-Entity Scenario
WePS-3
RepLab 2012
Filtering Task
~50 training/test entities
~ 43k labeled tweets (crowdsourcing)
Filtering, Topic Detection
Tasks + others
6 trial entities, 31 test entities
~ 8.5k labeled tweets by experts
RepLab 2013
Filtering, Topic Detection
Tasks + others
61 entities, 4 domains
~ 45k labeled tweets for training
~ 95k labeled tweets for testing
13 annotators supervised by experts
Automatic Discovery of Filter Keywords
No training data for the target entity
Vocabulary
Feature Analysis
Term specificity works: Salient terms in the set of tweets in the entity stream (vs. corpus)
keywords vs. skip terms
positive vs. negative filter keywords
Using Filter Keywords
Filter keywords do not cover all the tweets
Tweets Covered by Filter Keywords
Tweets for target entity
e
Propagation Step
Most frequent class
: winner-takes-all, winner-takes-remainder

Bootstrapping (BoW classifier)
: Tweets covered by positive/negative filter keywords are used to train a binary classifier.
Training Data
Test Data
BoW Classifier
Filtering in the Known-Entity Scenario
BoW Classifier vs. Filter Keywords
WePS-3
RepLab 2013
Known-entity scenario simulated with 10-fold cross validation over the test
14% higher
27% higher

Entity-specific training data helps to detect filter keywords

Known-entity scenario: BoW classifier perform significantly better (less error propagation)
Active Learning: Results
RepLab 2013 Data
Simulated Feedback: (test labels)
BoW Classifier, Linear kernel SVM
Accuracy
F(R,S)
Random Sampling (RS)
Margin Sampling (MS)
Cost Reduction of the Initial Training Set
SIGIR'12
HT'12
>0.9
small room for improvement
Margin Sampling (MS) vs. Random Sampling (RS)
10% RS = 2% MS
Red and yellow correspond to lower and higher F(R,S)
MS needs less training data to obtain competitive F(R,S) scores
Inspecting 10% test data: initial cost can be reduced by 90%
Filter Keywords
Known-Entity Scenario
Active Learning
Manual vs. Oracle Keywords
BoW Classifier is cost-effective, when enough entity-oriented training data is available (~700 tweets)
Margin Sampling performs significantly better than Random Sampling
Initial training cost can be reduced by 90% after inspecting only 10% of test data.
COMMONNESS(Ferrari, "ferrari")=4/7
COMMONNESS(Scuderia Ferrari, "ferrari")=2/7
COMMONNESS(Enzo Ferrari, "ferrari")=1/7
Oracle Cluster Keywords:
Coverage of clustering relationships detected by perfect keywords
Cluster Keywords Upper Bound

Approach
Competitive w.r.t. RepLab systems

Strong baseline:
Simple HAC over text similarity

Entity-Oriented Problem
Topics on the Twitter's long tail
Sparsity
Preliminary Experiments
Term specificity
Cluster Keywords (extension of the notion of filter keywords)
Tweet Wikified Clustering (use of Wikipedia as external resource)

Competitive approaches
Simple, Similar/Better Performance
HAC over text similarity:
Learning Similarity Functions
Twitter Signals
Supervised Approach
Statistically significant improvement
Results are close to inter-annotator agreement
Learning overlap, not vocabulary
Terms occurring in the entity's website/Wikipedia
Expanded by
Co-occurrence
R
E
S
U
L
T
S
Accuracy: 0.85 (vs. 0.73)
F(R,S): 0.62 (vs. 0.49)
Does Margin Sampling improve effectiveness over Random Sampling?


Using Active Learning, how much can the cost of training the initial model be reduced?
Web vs. Twitter:
Vocabulary Gap
Automatic Filter Keywords
Most useful features
Term specificity to the tweet stream
Association with the company's website
Co-occurrence expansion helps to remove false zeros (sparseness)

Filter Keywords can be used as seeds to achieve a competitive accuracy in the unknown-entity scenario
Useful technique for keeping the filtering model updated

Keywords
External Resources
RQ 2:Can we use the notion of filter keywords effectively to solve the filtering task?
RQ 4: Where should we look for filter keywords in order to find them automatically?
RQ 3:Can we generalize the idea of "filter keywords" to "cluster keywords", i.e., can we use it for topic detection?
RQ 5: Are the topics discussed about an entity in Twitter represented somehow in Wikipedia?
RQ 6: Can Twitter signals be used to improve entity-specific topic detection?
RQ 9: Can previously annotated material be used to learn better topic detection models?
Training Data
Filter Keywords can be used as seeds to achieve a competitive accuracy in the unknown-entity scenario (>0.7 accuracy)
Manual vs. Oracle Keywords
5 optimal keywords -> 30% recall
~10 manual keywords -> 15% recall, 0.85 accuracy
Challenging in the unknown-entity scenario
BoW Classifier is cost-effective, when enough entity-oriented training data is available (~700 tweets)
For instance, initial training cost can be reduced by 90% after inspecting only 10% of test data.
Most useful features
Term specificity to the tweet stream
Association with the company's website
Co-occurrence expansion helps to remove false zeros (sparseness)

RQ 7: When entity-specific training data is available, is it worth looking for filter keywords in external resources or is it better to learn them automatically from the training data?
RQ 8: In an active learning scenario, what is the impact in terms of effectiveness of an informative sampling over a random sampling?
How much of the (initial) annotation effort can be reduced by using active learning?
Margin Sampling performs significantly better than Random Sampling
MS improvement w.r.t. RS
10% RS = 2% MS
Significantly.
Twitter, company's website, Wikipedia, ODP
Keywords
External Resources
Training Data
Cluster Keywords approach is competitive w.r.t. the state-of-the-art

Oracle cluster keywords -> small room for improvement

Simple agglomerative clustering over text similarity has similar/better performance
Linking tweets to Wikipedia articles (
wikification
) allows identifying shared concepts or entities between semantically related tweets
Wikified tweet clustering is performs similarly/better than RepLab systems
Signals complementary other Twitter information that can be effectively used to learn similarity functions
Authors, timestamps, hashtags, etc. combined to term overlap has a small but statistically significant improvement w.r.t. to text only
Yes, similarity functions can be learned effectively from manually annotated topics.
Our best results are very close to inter-annotator agreement in RepLab'13 dataset
FPU
Combining all signals gets highest accuracy
Weps-3 Dataset
RepLab Datasets (2012, 2013, 2014)
A Corpus for Entity Profiling
Learning Similarity Functions (Code)
Annotation Tools
Twitter
Reputation Management
To take care of the public image of a brand/organization/public figure
Traditional Media
My pet drank Coke and died! #killercola
Automatic tools for ORM
Online Social Media
(
)
Collective Effort
No Match!
Accuracy F(R,S) Rank
Run
Best RepLab 2013 System
BoW Classifier
RepLab 2013 Baseline
Filter Keywords (
known-entity
)
Filter Keywords (
unknown-entity
)
0.91 0.49 1
0.86 0.34 19
0.87 0.32 21
0.84 0.25 42
0.50 0.14 61
~0.7 F(R,S)
>0.8 F(R,S)
Clustering Problem
Opportunity for information technologies to improve the ORM process

bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
Frequent terms in the tweets
Hashtags
Keywords
Terms occurring in the entity's website
or in Wikipedia articles linking to entity's website
Positive Filter Keywords
Some positive keywords might not be present in external resources
But some of most co-occurrent keywords might appear
Co-occurrence
Expansion
<d1, d2>
<d1, d3>
<d1, d4>
<d1, d5>
<d2, d3>
<d2, d4>
...
... label
0.8 0.6 0.3 0
0.2 0.1 0.1 0
0.9 0.7 0.6 0.4
0 0.1 0.3 0.1
0.7 0.1 0.2 0.6
0.2 0.4 0.1 0.1
...
...
true
false
true
false
true
false
...
...
...
...
...
...
<d1, d2>
<d1, d3>
<d1, d4>
<d1, d5>
<d2, d3>
<d2, d4>
...
... confidence(true)
0.8 0.6 0.3 0
0.2 0.1 0.1 0
0.9 0.7 0.6 0.4
0 0.1 0.3 0.1
0.7 0.1 0.2 0.6
0.2 0.4 0.1 0.1
...
...
0.8
0.1
0.6
0.5
0.9
0.3
...
...
...
...
...
...
<d1, d2>
<d1, d3>
<d1, d4>
<d1, d5>
<d2, d3>
<d2, d4>
...
... confidence(true)
0.8 0.6 0.3 0
0.2 0.1 0.1 0
0.9 0.7 0.6 0.4
0 0.1 0.3 0.1
0.7 0.1 0.2 0.6
0.2 0.4 0.1 0.1
...
...
?
?
?
?
?
?
...
...
...
...
...
...
Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
by Felisa Verdejo (Donosti, SEPLN'09)
by Jorge Carrillo de Albornoz
(Roma, RepLab 2012)
Madrid, 2013
Istanbul, LREC 2012
System AUC MAR
Text Similarity + HAC 0.40 0.59
All Features + HAC 0.41 0.61*
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
Unrealistic to look for improvements in this conditions
Lower standard deviation for R,S, and F than RepLab systems
1
More robust
across test cases
Redefine the task
Topic
Priority
Summary
Polarity
for
Reputation
Topic
Detection
apple
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
bla,bla,ba....
apple
...
bla,bla,bla
Tweet Stream
bla,bla,ba....
apple
...
bla,bla,bla
Alert!
Alert!
Alert!
Mildly Important
Sep' 14
Discarded
(unrelated)
Unimportant
OUTPUT
Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Author
Profiling
Aspect and Opinion Target Identification
Dimension Classification
Filtering
Entity-Based Topic Detection in Twitter
Bank of America
ATM flaw let man withdraw $1.5 Million- He gambled it all away - http://goo.gl/n4QD...
Man accidentally given 1.5 million dollars by
Bank of America
http://honda-tech.com/showthread.php?p=47507454…
Topic: ATM Withdraw Mistake
Cons Prod Strategy Manager at
Bank of America
(Jacksonville, FL) SAME_URL
Part Time 20 Hours Bartram Lake Village at
Bank of America
(Jacksonville, FL) SAME_URL
Topic: Vacancy
Entity:
INPUT
"http://apple.com"

Representative URL
Alert
Clustering Problem
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
Filtering
Apple
launches new
Iphone 5s...
Having a delicious
piece of
apple
pie

Topic Detection
Look at event-oriented topics
ORM is still an
Open Problem
Interactive Evaluation with Users


Active Learning
Future Directions
Linking to other Social Media sources
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
bla bla bla ENTITY bla bla bla
Topic 1
Topic 2
Topic 3
Event-Oriented Topics?
Full transcript