ML PROJECT
K.CHAITANYA
K.SRI VISHNU
A.V.ROHIT ROY
Fake news is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media or online social media. Digital news has brought back and increased the usage of fake news, or yellow journalism
You understand by now that fake news and other types of false information can take on different faces. They can also have major impacts, because information shapes our world view: we make important decisions based on information. We form an idea about people or a situation by obtaining information. So if the information we saw on the Web is invented, false, exaggerated or distorted, we won’t make good decisions.
Examples - Fear
Racist ideas
Bullying and violence against innocent people
Democratic Impacts
For this problem our team made a project that we build a model using some ML classification algorithms based on that we can classify the news into real and fake.
For our project we had made a background study to know whether what methods they have used before and what we need to use to improve our model better than that model
these are the papers we have checked
https://www.ijeat.org/wp-content/uploads/papers/v9i1/A2633109119.pdf
Here in this paper they have used the logistic regression. Logistic regression works best for both short and long text and they have get accuracy of around 79%-89%
To be specific, since this model is a classification model or in other words it is a supervised model which is able to detect news to be fake or not. In this project scenario, the data collected which are the news articles of Malay language is insufficient and not quality enough to fit the model for data trainings. This is because, the model is solely based on past trainings of data for predicting future news. If the dataset is small, it can be trained but it would not be able to predict correctly
Paper 2 using N-grams
http://dspace.library.uvic.ca:8080/bitstream/handle/1828/8796/Ahmed_Hadeer_Masc_2017.pdf?sequence
(University of Victoria)
https://www.csustan.edu/sites/default/files/groups/University%20Honors%20Program/Journals/02_stahl.pdf
(university of california)
This paper includes a discussion on Linguistic Cue approaches,
and proposes a three-part method using Naïve Bayes Classifier, Support Vector Machines, and Semantic Analysis as an accurate way to
detect fake news on social media.
As of know we have seen the literature work on our project and we have seen what are the methods they have used in their project now we are going to see the methods we used to build our model to classify the fake news from real news
we have followed the below process for our Project
Now a days we are getting the raw data from the internet sources, So now we need to do data pre-processing so convert the data into the data which is suitable for our machine learning algorithms, as in our project we need the data that only consist of numerical but we will only get the data with the mix of alphabets and numerical so now we need to data pre-processing so in our project we used TF-IDF as a data pre-processing technique.
Tf-IDF means term frequency-inverse document frequency. And this is used as a weighting factor for the features that means it is mainly used for converting text that is in English to numbers. And this can be implemented using sklearn library and the method used is tf-idf vectorizer. Generally the weight increases as the if the words are more repetitive, Now let us see the working of tf-idf.
Tf-idf(t,D) = tf(t,d)*idf(t,D)
tf(t,d) = frequency of given term in the document
idf(t,D)=log(N/ {d belongs D =t belongs d} )
Here TF= term frequency.
IDF = inverse document frequency
t = term or word
d = document
D =Data set
N = no of documents
For example consider two documents.
D1- the sky is blue.
D2- the sky is not blue.
N = 2
Now we need split the data into two parts so that we can test the model that is trained and fitted. For that purpose, we use train test split method and for this we need to import sklearn library. In our project
We partitioned 70 percent of data for training the model and 30 percent for testing the model so that we can check the accuracy of the model.
TOTAL DATA SET
Training data - 70 percent
Testing data – 30 percent
In our project we have used
It is a machine learning algorithm generally used for binary classification. The working of logistic regression is first we apply linear regression on the data and the line is passed through a sigmoid function which gives the probability of an event and we set a threshold like p = 0.5 and then if p<=0.5 then it belongs to class 0 and if p>0.5 it belongs to class 1. By this technique we can classify whether the news is fake or real. And the function used is sigmoid function is as below.
P(y) = e^(B0X) / (1+ e^ (B0X))
This algorithm will be passive for correct classification but for any wrong classification it will become aggressive. And it updates the wrongly classified data but updating or changing the weight vector.
As a meaasure of performace we are calculating the
From the performance metrics we can say that the passive aggressive classifier and the linear support vector classifier are best classification for detection of fake news as the accuracy of both the algorithms are high. And remaining algorithms like logistic regression, random forest, poly-svc, Rbf svc are shows good accuracy. The knn algorithm gave very less accuracy when compared to other algorithms
Spreading of fake news creates great impact in long run or in short run sometimes this creates nervous ness among the people , in some cases due to this there will be a huge revenue loss for both the individual and also the country In some cases. let us see some situations when there is a spread of the fake news lets us take the present situation if there is a fake news like "our govt is letting army to control the lock down and we are not supposed to come out of the homes " due to this many people comes out frm their homes and buy the goods that are required more than enough and then this creates to loss of control on the public and the disease spreads faster and also a huge revenue loss occurs, in this way different things can happen by spreading the fake news that causes harm to the man kind. So in this project we used different algorithms to predict wether the news is fake or real as there are similar projects we used a better algorithm that gives the best prediction.