Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Online Reputation Monitoring: Keyword-based Appraoches for Filtering and Sub-Topic Detection in Microblog Streams

Presentation about the state of my PhD thesis. The talk was given at the Second Doctoral Consortium at LSI-UNED on June 18th 2013 in Madrid, Spain.
by

Damiano Spina

on 3 June 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Online Reputation Monitoring: Keyword-based Appraoches for Filtering and Sub-Topic Detection in Microblog Streams

Term Clustering: Lesson learned
High Reliability (BCubed Precision) scores -> It exists a terminology that describes each topic

Low Sensitivity (Bcubed Recall) scores -> It’s hard to find!

2010-2011
Aspects Identification
Company Name Disambiguation (Filtering)
Twitter
Online Reputation Monitoring:
Keyword-based Approaches for
Filtering and Sub-Topic Detection in
Microblog Streams

Damiano Spina

UNED NLP & IR Group

@damiano10

Challenging for NLP:
little context
(1 tweet = 140 chars, ~15 words)
non-standard, SMS-style language
Most popular microblogging service
Online Reputation Management
Profiling
Monitoring
Relevant Source for ORM
Dynamics, Real-time information streams
Topic Detection
@ RepLab 2012

Filter keywords strategy
Fingerprint representation
WePS-3 Online Reputation Management Task
CLEF 2010
RepLab
Aspects: "hot topics" discussed in a certain time frame referred to an entity of interest.
Events, key people, products, competitors, etc.
Corpus aspects [+opinion targets]
Available at http://bitly.com/profilingTwitter
– Public image of an entity in Online Media


– Entity = { brand, organization, company, person, product, ... }
Two Scenarios:
Clustering Sub-Task

Is the tweet related to the company?
Profiling
Filtering/Company name disambiguation
Polarity for Reputation
Monitoring
Topic and Alert detection
Trial data
6 companies
~300 annotated tweets
~30k tweets (background)
Test data
31 companies
~300 annotated tweets
~50k background tweets
Multilingual: English & Spanish
Term Clustering
Hypothesis: Each cluster/topic has a terminology that describes it
1.
2.
3.
...
What do people tweet about @X ?
Ranking Task
4 Methods
TF.IDF
LLR: Log-likelihood Ratio
PLM: Parsimonious Language Models
OO: Opinion-oriented. Opinion target extraction using topic-specific subjective lexicons
Pooling methodology
Top 10 terms
Dataset
WePS-3 ORM corpus
94 entities, ~17k tweets
~175 "related" tweets/entity
Real-Time Summarization of Scheduled Events
Public debates, Keynotes, etc.
Can we summarize streams of tweets in such a way that:
Users receive a reduced stream that they can follow?
Users do not miss any key sub-event occurred during the event?
Copa America 2011 (July 1-26, 2011):
26 soccer games.
11k-70k tweets per game.
Tweets written in 30 languages.
Yahoo! Sports summaries as reference (sub-events)
continuous

real-time

detecting alerts
static

online reputation
at a particular point in time
example: 300 tweets containing "fujitsu"
computers
jobs
e-reader
An experiment: Soccer matches
Understand the problem
Research challenges
Filtering
Topic Detection (+alerts)

Provide benchmarks
Reusable test collections
Evaluation campaigns

Explore the challenges
Propose (keyword-based) solutions
How much of the problems can be solved automatically?
@ CLEF
99 companies
52 training test
47 test set
~500 tweets/entity
Manually annotated with Mechanical turk
2011-2012
2013-2014
2012-2013
Results
Thesis topics
Supervisors:
Dr. Julio Gonzalo
Dr. Enrique Amigó

Identifying Entity Aspects in Microblog Posts.
D. Spina, E. Meij, M. de Rijke, A. Oghina, M.T. Bui, M. Breuss
SIGIR 2012 (poster)
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
A.Zubiaga, D. Spina, E. Amigó, J. Gonzalo.
Hypertext 2012 (poster)
Filter Keywords and Majority Class Strategies for Company Name Disambiguation in Twitter.
D. Spina, E. Amigó, J. Gonzalo.
CLEF 2011
Filter keywords and majority class strategies for company name disambiguation in Twitter.
D. Spina.
Master's thesis, UNED, 2011.
Supervisors: J. Gonzalo, E. Amigó.
Sub-event Detection
Tweet
selection
We compare 2 approaches:
Increase: considering only sudden increase with respect to the recent tweeting activity (Zhao&Zhong, 2011)
Outliers: considering also the previous activity seen during a game, so that the system learns from the evolution of the audience
sub-event <= tweet rate > 90% of all the previously seen tweeting rates
Two methods
TF: number of occurrences within the sub-event
KLD(current || previous)
Real-Time Summarization of Scheduled Events: Lesson learned
Use of simple text analysis methods such as KLD generates accurate summaries
P and R values > 80% (100% for key sub-events)
In real-time as the game is being played.
Without need of external data.

Is this approach applicable on the Online Reputation Monitoring Scenario?

Evaluation
Manually inspecting summaries
Each tweet:
correct (compared to Y! Sports reference)
novel
noisy
Three languages:
Spanish (es)
English (en)
Portuguese (pt)
Dataset
Recall
Precision
How to combine them (i.e. define weights)?
machine learning
using trial data (6 entities)

positive instance= pair of terms with purity > 0.9 (90% of co-occurrence on tweets belong to a same cluster)
negative instance in other case

confidence(positive class) ~ similarity function
Systems results
Dataset
Dataset
Annotations provided by experts!
LLORENTE&CUENCA
Online Communications consultancy
Tasks
1. Build a
co-occurrence graph
Nodes: Terms
High divergence between entity stream and other streams
(pseudo-document TF.IDF, KLD, Log-likelihood)
Edges: relation of co-occurrence in different dimensions
Terms in the tweets
Meta-data (URLs, hashtags, named users, authors, timestamps)
2. Find clusters of terms
Hierarchical Agglomerative Clustering (HAC)
standard
Cut-off/Threshold:
Empirically defined using trial data
3. Assign tweets to clusters
For each tweet, find the closest term cluster
Maximizing the number of terms within the cluster
Co-chairs:
Arkaitz Zubiaga (City University of New York, USA)
Damiano Spina (UNED, Spain)
Maarten de Rijke (University of Amsterdam, The Netherlands)
Markus Strohmaier (Graz University of Technology, Austria)
Mor Naaman (Rutgers University, USA)
2012
2013
Monitoring Scenario
Filtering
Polarity for Reputation
Topic Detection
Alert Detection

Full Task: Filtering + Topic Detection + Alert Detection
Dataset
Discovering Filter Keywords for Company Name Disambiguation in Twitter
D. Spina, J. Gonzalo, E. Amigó
Expert Systems With Applications, vol. 40, no. 12, 2013
WP concept c, n-gram q
q=“ferrari”
Clustering strategies:
1. Wikified representation
[Apple Inc., Software testing, Ipad, IOS ]
Server logs show Apple testing iPads with iOS 6 http://bit.ly/...
Tweet -> List of Wikipedia articles



For each n-gram, select the most likely Wikipedia article
based on inter-Wikipedia links
Commonness probability [Meij et al.,WSDM’12]
English + Spanish Wikipedia dumps
Spanish concepts are translated using inter-language links
Clustering: Tweets sharing x% of identified Wikipedia articles are grouped together
Clustering strategies:
1. Wikified representation
Two-step approach
Term clustering
Learnt similarity function
Content-based features (TF, TFIDF, KLD, Levenshtein distance)
Meta-data features (named users, hashtags, URLs, authors, timestamps)
Hierarchical Agglomerative Clustering
Cut-off learnt from trial data
Tweet clustering
Assigns tweets according to maximal term overlap
Clustering strategies:
2. Term clustering
little room for improvement!
RepLab 2012 - Priority Analysis
negative neutral positive
negative neutral positive
tweet polarity
topic polarity
novelty
centrality
alert vs. not alert
Motivation
RepLab 2012 - Clustering Analysis
Concluding
Exploring
Consolidating
M.Sc. thesis
Goals
Publications
Filtering: Lesson learned
Cluster the most recent tweets
thematically (topics)
[Apple Inc., Software testing, Ipad, IOS ]

Server logs show Apple testing iPads with iOS 6 http://bit.ly/...

Tweet -> List of Wikipedia articles



For each n-gram, select the most likely Wikipedia article based on inter-Wikipedia links
Commonness probability [Meij et al.,WSDM’12]
English + Spanish Wikipedia dumps
Spanish concepts are translated using inter-language links
Clustering: Tweets sharing x% of identified Wikipedia articles are grouped together



ORM LiMoSINe Demonstrator
13 annotators
Co-chairs:
Arkaitz Zubiaga (City University of New York, USA)
Damiano Spina (UNED, Spain)
Maarten de Rijke (University of Amsterdam, The Netherlands)
Markus Strohmaier (Graz University of Technology, Austria)
- Term Specificity is the most useful feature to detect filter keywords (either positive or negative)
- Presence in Wikipedia page is the most useful feature to discriminate positive keywords.
- Filter Keywords strategy (semi-supervised) :
0.73 accuracy, 0.27 F(R,S)
- 10-fold validation (supervised upper bound):
0.85 accuracy, 0.30 F(R,S)

- Oracle keywords: 5 keywords -> 30 % coverage
- Vocabulary gap between Twitter and the Web (ODP, Wikipedia, company's domain)
Topics
Time
Tweet density
Topics are not sequential!
Tweets over Time (Monthly Monitoring Report for Repsol)
Aspect Identification: Lesson learned
- Term specificity (e.g. TF.IDF) is a strong baseline for the aspect identification task,
significantly outperforming opinion-oriented methods.
- 61 Companies
- Overlapped training and test cases
(i.e. same companies, different tweets)
- ~750 tweets/training case
- ~1500 tweets/test case
- [500, 50k] Background tweets/company
t
Working plan
July 8th:
RepLab 2012 (monitoring)
Oct 2012-Jan 2013:
Lessons learned from RepLab 2012
Study of factors for priority
Submission to SIGIR
Feb - May :
Aspects identification on a larger dataset
considering n-grams
How do aspects evolve over time?
Submission to CIKM
June 2013:
RepLab 2013
Working plan
Baseline = HAC + Jaccard similarity
Oracle = Tweets sharing an oracle keyword are grouped together.
Step n = considering n best (purity, then coverage) oracle keywords.

=
- Is the size of the collection too small for keyword based approaches?
- Is the task extremely difficult?
Towards a Learning-to-Rank System for Alert Detection
alert vs. mildly important vs. unimportant
Manually annotated variables
Alert Detection: Lesson learned
(preliminary experiment)
Even using manual annotations for centrality, novelty and polarity,

Alerts are difficult to detect automatically
(reasonable precision but low recall)
Let's see how it goes on RepLab 2013...
Topic Detection
Filtering
Two Approaches:
- Filter Keywords (M.Sc. thesis)
- Active Learning (joint submission with UvA)

Same approaches as in RepLab 2012
- Term Clustering
- Wikified Tweet Clustering

Results
Wikified Tweet Clustering
Results
Two approaches
RepLab 2013 Participation
Full Task
Filtering + Topic Detection + Alert Detection (Priority)
Best run in terms of F(R,S)
Best run
RepLab 2013 Annotation Tool

RepLab 2013 Participation (on-going)
July 2013: Analysis of the results (and fixing)
Sep-Oct '13: Active Learning (ECIR paper)
Sep-Nov'13: RepLab 2013 Collection: Log analysis of the annotation process


Lessons learned from RepLab 2013 -> SIGIR 2014?
Jan'14: Thesis writing
Jun'14: PhD defense
annotations, logs, surveys
better understanding of the problems in ORM
RepLab 2013 Participation: Lesson learned
RepLab Annotation Tool
- Wikified tweets + pseudo TF.IDF
- Naïve Bayes
- Candidate selection = argmax |conf(related) - conf(unrelated)|
- 15 iterations, 1 tweet/iteration => 1% test data manually labeled
Full transcript