Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Twitter Topic Modeling

No description
by

Roberto di Lallo

on 24 February 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Twitter Topic Modeling

Twitter Spout
Twitter Topic Modeling
Collegamento a Twitter mediante API (libreria utilizzata: twitter4j).
In uscita flusso di oggetti "status" (contenenti testo, geolocation, linguaggio, etc.)
Language Bolt
Il primo Bolt riceve in ingresso un flusso di status, estrae per ognuno la lingua e produce un flusso di status in lingua inglese,
Text Bolt
Riceve in ingresso un flusso di status in lingua inglese ed estrae il testo. Produce dunque un flusso di tweet in inglese
Mallet Bolt
In questo Bolt si riceve un flusso di tweet e si esegue su questi l'algoritmo di LDA implemetato in Mallet che estrae due topic per ogni tweet
On-line processing
Si ottiene dai bolt uno o più topic per ogni tweet. Le tuple <tweet, topic1, topic2, ..,topicN> vengono scritte su fie diversi per ogni algoritmo usato
Efficiency Analysis
Creazione di un questionario su google drive contenente per ogni tweet i topic proposti dai vari software
Il questionario è stato poi sottoposto ad un insieme eterogeneo di persone
I risultati ottenuti sono stati analizzati per valutare l'efficienza dei vari algoritmi utilizzati.
Tecnologie utilizzate
APACHE STORM
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
TWITTER4J
Twitter4J is an unofficial Java library for the Twitter API.
With Twitter4J, you can easily integrate your Java application with the Twitter service. Twitter4J is an unofficial library.
MALLET
Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.
Grazie per l'attenzione
Wikipedia Miner Bolt
In questo Bolt si riceve un flusso di tweet e si esegue su questi l'algoritmo di Wikipedia-Miner che estrae dei Topic basandosi principalmente sui contenutidi Wikipedia
dbpedia spotlight Bolt
In questo Bolt si riceve un flusso di tweet e si esegue su questi l'algoritmo di dbpedia spotlight che fra le altre cosa utilizza l'algoritmo di LingPipe's2 Aho-Corasick per il matching delle stringhe
OpenCalais Bolt
In questo Bolt si riceve un flusso di tweet e si esegue su questi l'algoritmo di Opencalais che si basa principalmente sui passi contenuti in quello che si suole chiamare NLP Stack. E' molto più adatto a testi lunghi che a tweet, più degli altri soffre infatti i 140 caratteri imposti dal social network cinguettante e non restituisce nessun topic
WikipediaMiner is a toolkit for tapping the rich semantics encoded within Wikipedia.
It makes it easy to integrate Wikipedia's knowledge into your own applications, by:
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. "Michael Jordan"), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan the basketball player). It can also be used for building your solution for Named Entity Recognition, Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.
The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.
Full transcript