Loading…
Transcript

FAKE NEWS DETECTION

ML PROJECT

K.CHAITANYA

K.SRI VISHNU

A.V.ROHIT ROY

FAKE NEWS...?

Fake news is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media or online social media. Digital news has brought back and increased the usage of fake news, or yellow journalism

INTRO

Impact

Impact of fake news in now a days!

You understand by now that fake news and other types of false information can take on different faces. They can also have major impacts, because information shapes our world view: we make important decisions based on information. We form an idea about people or a situation by obtaining information. So if the information we saw on the Web is invented, false, exaggerated or distorted, we won’t make good decisions.

Examples - Fear

Racist ideas

Bullying and violence against innocent people

Democratic Impacts

Is there a solution to this problem

solution

For this problem our team made a project that we build a model using some ML classification algorithms based on that we can classify the news into real and fake.

literature review

Literature

review

For our project we had made a background study to know whether what methods they have used before and what we need to use to improve our model better than that model

these are the papers we have checked

Paper 1

Paper 1

https://www.ijeat.org/wp-content/uploads/papers/v9i1/A2633109119.pdf

Here in this paper they have used the logistic regression. Logistic regression works best for both short and long text and they have get accuracy of around 79%-89%

To be specific, since this model is a classification model or in other words it is a supervised model which is able to detect news to be fake or not. In this project scenario, the data collected which are the news articles of Malay language is insufficient and not quality enough to fit the model for data trainings. This is because, the model is solely based on past trainings of data for predicting future news. If the dataset is small, it can be trained but it would not be able to predict correctly

Paper 2

Paper 2 using N-grams

http://dspace.library.uvic.ca:8080/bitstream/handle/1828/8796/Ahmed_Hadeer_Masc_2017.pdf?sequence

(University of Victoria)

Paper 3

Paper-3

https://www.csustan.edu/sites/default/files/groups/University%20Honors%20Program/Journals/02_stahl.pdf

(university of california)

This paper includes a discussion on Linguistic Cue approaches,

and proposes a three-part method using Naïve Bayes Classifier, Support Vector Machines, and Semantic Analysis as an accurate way to

detect fake news on social media.

Our Project

Our

Project

As of know we have seen the literature work on our project and we have seen what are the methods they have used in their project now we are going to see the methods we used to build our model to classify the fake news from real news

Methodology

we have followed the below process for our Project

Methodology

Data

Data Pre-processing

Now a days we are getting the raw data from the internet sources, So now we need to do data pre-processing so convert the data into the data which is suitable for our machine learning algorithms, as in our project we need the data that only consist of numerical but we will only get the data with the mix of alphabets and numerical so now we need to data pre-processing so in our project we used TF-IDF as a data pre-processing technique.

TF-IDF(Term Frequency-Inverse Document frequency)

Tf-IDF means term frequency-inverse document frequency. And this is used as a weighting factor for the features that means it is mainly used for converting text that is in English to numbers. And this can be implemented using sklearn library and the method used is tf-idf vectorizer. Generally the weight increases as the if the words are more repetitive, Now let us see the working of tf-idf.

Working of TF-IDF

Tf-idf(t,D) = tf(t,d)*idf(t,D)

tf(t,d) = frequency of given term in the document

idf(t,D)=log(N/ {d belongs D =t belongs d} )

Here TF= term frequency.

IDF = inverse document frequency

t = term or word

d = document

D =Data set

N = no of documents

For example consider two documents.

D1- the sky is blue.

D2- the sky is not blue.

N = 2

Now we need split the data into two parts so that we can test the model that is trained and fitted. For that purpose, we use train test split method and for this we need to import sklearn library. In our project

We partitioned 70 percent of data for training the model and 30 percent for testing the model so that we can check the accuracy of the model.

TOTAL DATA SET

Training data - 70 percent

Testing data – 30 percent

Algorithms used....

Methods

In our project we have used

  • Logistic regression
  • Random forest
  • KNN
  • SVC
  • Passive Agreessive classifier

Logistic

Logistic Regression

It is a machine learning algorithm generally used for binary classification. The working of logistic regression is first we apply linear regression on the data and the line is passed through a sigmoid function which gives the probability of an event and we set a threshold like p = 0.5 and then if p<=0.5 then it belongs to class 0 and if p>0.5 it belongs to class 1. By this technique we can classify whether the news is fake or real. And the function used is sigmoid function is as below.

P(y) = e^(B0X) / (1+ e^ (B0X))

Random

Forest

Random Forest

KNN

K-Nearest Neighbor

SVC

Support Vector Classifier

PAC

Passive Aggressive Classifier

This algorithm will be passive for correct classification but for any wrong classification it will become aggressive. And it updates the wrongly classified data but updating or changing the weight vector.

Performance

As a meaasure of performace we are calculating the

  • Accuracy
  • Precision
  • Recall
  • F1 score

Accuracy

Accuracy

Precision

Precision

Recall

Recall

F1 Score

F1 Score

Result

Results

From the performance metrics we can say that the passive aggressive classifier and the linear support vector classifier are best classification for detection of fake news as the accuracy of both the algorithms are high. And remaining algorithms like logistic regression, random forest, poly-svc, Rbf svc are shows good accuracy. The knn algorithm gave very less accuracy when compared to other algorithms

Conclusion

Conclusion

Spreading of fake news creates great impact in long run or in short run sometimes this creates nervous ness among the people , in some cases due to this there will be a huge revenue loss for both the individual and also the country In some cases. let us see some situations when there is a spread of the fake news lets us take the present situation if there is a fake news like "our govt is letting army to control the lock down and we are not supposed to come out of the homes " due to this many people comes out frm their homes and buy the goods that are required more than enough and then this creates to loss of control on the public and the disease spreads faster and also a huge revenue loss occurs, in this way different things can happen by spreading the fake news that causes harm to the man kind. So in this project we used different algorithms to predict wether the news is fake or real as there are similar projects we used a better algorithm that gives the best prediction.