Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Literature Survey

No description
by

Rohan Shetty

on 13 March 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Literature Survey

A SYSTEM TO DETECT INAPPROPRIATE MESSAGES IN ONLINE SOCIAL NETWORKS
Introduction


As social networking is growing with a rapid pace, it is vital that we work on improving its management.
Researches have shown that the content present in such social networks may prove to be highly influential and hence have negative consequences if misused.
The text classification will be done using a Machine Learning based text classifier which implements Machine Learning algorithms to build a predictive model.
Scope
The aim of our project is to build an optimized predictive model using a Machine Learning based text classifier that predicts if a sentence is offensive or not offensive.

Literature Survey
[1] Marco Vanetti, Elisabetta Binaghi, Elena Ferrari, Barbara Carminati, and Moreno Carullo, ”A System to Filter Unwanted Messages from OSN User Walls”, IEEE Transactions on Knowledge and Data Engineer, February 2013.
 
Abstract:
One fundamental issue in today’s Online Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering.
 
From this paper we have referred the following concept:

Machine learning based short text classifier.

A System to Filter Inappropriate Messages in OSN
[2] Ying Chen, Yilu Zho, “Detecting Offensive Language in Social Media to Protect Adolescent Online Safety”, ASE/IEEE International Conference on Social Computing, 2012.
Abstract:
Since the textual contents on online social media are highly unstructured, informal, and often misspelled, existing research on message-level offensive language detection cannot accurately detect offensive content. Meanwhile, user-level offensiveness detection seems a more feasible approach but it is an under researched area. To bridge this gap, we propose the Lexical Syntactic Feature (LSF) architecture to detect offensive content and identify potential offensive users in social media. Results from experiments showed that our LSF framework performed significantly better than existing methods inoffensive content detection. It achieves precision of 98.24% and recall of 94.34% in sentence offensive detection, as well as precision of 77.9% and recall of 77.8% in user offensive detection.

From this paper we have referred the following concepts:

n-gram method

Literature Survey
Literature Survey
[3] Xing Fang, Justin Zhan, “A Computational Framework for Detecting Malicious Actors in Communities”, International Conference on Social Informatics, 2012.

Abstract:
Despite the significant research achievements on the study of communities, how to maintain a benign social environment for a community, as a problem, has not received much attention. Current existing malicious activity detecting mechanisms are subject to the limitation of the underlying online environment. However, we found that information plays an important role in terms of socialization. In this paper, we propose a computational framework for detecting malicious actors in communities from the perspective information diffusion. We use the term, “malicious actors”, to represent a group of people who intentionally or unintentionally conduct malicious behaviours to sabotage benign culture on Social media.
 
From this paper we have referred the following concept:
 
Identification of malicious actors on social media.

Literature Survey
[4] Ramnath Balasubramanyan, Aleksander Kolcz,” "wOOt! feeling great today!" Chatter in Twitter: Identification and Prevalence”, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013.
 
Abstract:
Microblogging services like Twitter are used for a wide variety of purposes and in different modes. Here, we focus on the usage of Twitter for "chatter" i.e., the production and consumption of tweets that are typically non-topical and contain personal status updates or conversational messages which are usually intended and are useful only to the immediate network of the producers of the tweets. We study the prevalence of chatter tweets in Twitter and present techniques to detect them using machine learning techniques that require minimal supervision.
 
From this paper we have referred the following concept:
 
From this paper taking example of twitter as a “chatter” ,we are able to study or classify tweets as per their ranks and then filtering out the one’s that are of high relevance.
Literature Survey
[5] Basu, C. Watters, and M. Shepherd “Support Vector Machines for Text Categorization", Science Dalhousie University, IEEE36th Hawaii International Conference on System Sciences,2012.
 
Abstract:
Text categorization is the process of sorting text documents into one or more predefined categories or classes of similar documents. Differences in the results of such categorization arise from the feature set chosen to base the association of a given document with a given category. Advocates of text categorization recognize that the sorting of text documents into categories of like documents reduces the overhead required for fast retrieval of such documents and provides smaller domains in which the users may explore similar documents. In this paper we are interested in examining whether automatic classification of news texts can be improved by a prefiltering the vocabulary to reduce the feature set used in the computations. First we compare artificial neural network and support vector machine algorithms for use as text classifiers of news items. Secondly, we identify a reduction in feature set that provides improved results.
 
From this paper we have referred the following concept:
 
Support Vector Machine algorithm for text classification.

Proposed System
Working of the System
Modules
1.Data Extraction
2.Feature Extraction
3.Feature Selection
4.Classifier


Data Extraction
Tokenization
Let us consider extracted messages from YouTube comments for text classification

May be they change what he is actually saying.

The most stupid person in the entire world.

Separation of words in the sentences

1. ‘Maybe’ ‘they’ ‘change’ ‘what’ ‘he’ ‘is’ ‘actually’ ‘saying’
2. ‘The’ ‘most’ ‘stupid’ ‘person’ ‘in’ ‘the’ ‘entire’ ‘world’

Feature Selection using TF
Text
Value
Text
Value
IDFi=Log |D|/|f(i)|

Where,
f(i) is term in document Dj
D is the set of documents in the training set

In sentence one the word “the” receives lower precedence since it is repeated. The other words automatically get favored.

Feature Selection using n-gram
The most stupid person in the entire world


Maybe they change what he is actually saying

Machine Learning Algorithms
Support Vector Machine
A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification.
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.






denotes 1
denotes -1
Classification
Future Scope
The System could be improved to include blacklist management
Detecting any kind of sarcastic comments could go a long way in improving the system
References
Hyperplane Positioning
Marco Vanetti, Elisabetta Binaghi, Elena Ferrari, Barbara Carminati, and Moreno Carullo,” A System to Filter Unwanted Messages from OSN User Walls”,IEEE Transactions on Knowledge and Data Engineer, February 2013.
Ying Chen, Yilu Zhou, “Detecting Offensive Language in Social Media to Protect Adolescent Online Safety”, ASE/IEEE International Conference on Social Computing, 2012.
Xing Fang, Justin Zhan “A Computational Framework for Detecting Malicious Actors in Communities”, International Conference on Social Informatics, 2012
Ramnath Balasubramanyan, Aleksander Kolcz ,” "wOOt! feeling great today!" Chatter in Twitter: Identification and Prevalence”,IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013.
Basu, C. Watters, and M. Shepherd “Support Vector Machines for Text Categorization Science Dalhousie University, IEEE 36th Hawaii International Conference on System Sciences,2012.
Basu, C. Watters, and M. Shepherd “Support Vector Machines for Text Categorization Science Dalhousie University, IEEE 36th Hawaii International Conference on System Sciences,2012.
Thank You!!!
w.x-b>0
w.x-b=0
w.x-b<0
Testing of Data
Feature Selection using Term Frequency
Testing Of Data
Using Chi-Square
Problem Statement
Finding the dependency for ‘an’ and the label is as follows:

where N('an present', 'insult') is the number of rows which have the feature 'you' and are labeled as 'insult' and N is the total number of rows

Finding the dependency for ‘idiot’ and the label is as follows:
"where N(′idiot′, ′insult′) is the number of rows which have the feature ′you′ and are labeled as ′insult′ and N is the total number of "
Here, since the value of ‘idiot’ is higher than ‘an’ it gets higher precedence.

Machine Learning Algorithm
Feature Extraction
Feature Selection
Input Text
Feature Extraction
Feature Selection
Classifier
Output
(Label Selected)

Labeled Dataset
Shivani Singh
Kalyani Nair
Shantanu Nakhare
Rohan Shetty
Department of Computer Engineering
Rajarshi Shahu College of Engineering
TRAINING
Naive Bayesian
Naive Bayesian classifier is a basically probabilistic classifier based on Bayes theorem.
It works on the assumption and training document .
Bayesian learning is to find most appropriate assumption based on prior hypotheses and initial knowledge.
Summary of SVM and Naive Bayesian
Based on different experiments SVM performed better than NB in general classfication task.
For Example : SVM performed more accurately for Reuters-21578 collection data set than NB.
Both NB and SVM are linear ,efficient and scalable to large document set.
Bernoulli Naive Bayes
Formula
TESTING
Machine Learning for Text Classification
Finds patterns from previously obtained data
Uses these patterns to classify new and unknown data
Supervised and Unsupervised Learning
Algorithms Used
Naive Bayes Algorithm
Support Vector Machine
Xi = Boolean Expression
Ck = Class
Pki =Probability
In this model features are independent booleans (binary values) describing input
Multinomial Naive Bayes
P(w|c) = Prior probability of a given word in a given class
Count(w.c) = count of a word in a given class
Count(c) = total occurance of a class
w= chosen word
c = class
V= vocabulary
Full transcript