Machine Learning Aided Differentiation of Real and Fake News

Written By
Published on September 14th, 2022

Table of Contents [show]

Table of Contents

Introduction

Since the start of the current millennium, technology has advanced quickly. This led to the introduction of many news channels in various media viz. electronic, including online, television, and print media. The rise of platforms and channels has set the stage for ever-increasing competition. Sensationalism has become a new way to attract audience attention, especially for electronic media, which is sometimes fueled by fake news. For billions of people, the Internet has emerged as the primary medium for information consumption. Our opinions and worldview are shaped by what we read and see online. Access to information is vital to democracy. Fake news is bleeding democracy from a thousand cuts by constantly hacking the truth.

While the spread of fake news is funded, supported, and encouraged by several vested interests and further fueled by human behavior, the technology that helps create it can also be used to combat it. Can the algorithms that exacerbate the effects of fake news also be used to suppress it and promote critical thinking on a mass scale? Machine learning holds promise in distinguishing between real news and fake news.

What is Fake News?

Fake news, a kind of yellow journalism, is material that may be a hoax and is typically disseminated via social media and other online media. This is typically accomplished through political agendas and is frequently done to advance or impose particular views. Such messages may contain false and/or exaggerated claims and are virtualized by algorithms, and users may end up in a filter bubble.

About Fake News Detection Using Python

We’ll start by importing NumPy, pandas, and re. “re” is a built-in package that represents a regular expression. A search pattern is created using a sequence of characters. Then we will import the ignored words. Icons are words that are not very significant, like a, an, etc. We import the icons from nltk.corpus, where “nltk” is a natural language toolkit and “corpus” is a repository of ignored words.

Lemmatization is done to convert a word into its basic form. Lemmatization is more contextual than stemming, which is another procedure that reduces a word to its fundamental form. WordNetLemmatizer from nltk.stem.wordnet is imported for this. The intrinsic Morphy feature of wordnets is used for lemmatization. A Python lemmatization library is called nltk.stem. The string is then imported for use with any classes and constants.

The TfidVectorizer is now imported from sklearn.feature extraction.text. TfidVectorizer is a term frequency-inverse document frequency that converts text into a meaningful collection of numbers. The numbers are used to adjust the machine’s algorithm for prediction. This uses the sklearn.feature_extraction package, which extracts features in a format supported by machine learning. Since this is a binary classification problem, we will use logistic regression to classify real and fake messages. Next, we import nltk and download the ignored words.

This advanced fake news detection python project deals with fake and real news. Using sklearn, we create a TfidfVectorizer on our dataset. We then initialize the PassiveAggressive Classifier and fit the model. The accuracy score and confusion matrix will ultimately tell us how well our model is doing.

Fake News Dataset

The dataset we will use for this python project – will call it news.csv. This data set has a shape of 7796?4. The first column labels the messages, the second and third have the title and text, and the fourth column has labels indicating whether the message is REAL or FAKE. The dataset takes up 29.2 MB .

Our Learners Also Read: What are the top Machine Learning tools?

Steps To Detect Fake News Using Python

Follow the steps below to detect fake messages and complete your first advanced Python Project :-

1. Make the necessary imports:

    import numpy as np
    import pandas as pd
    import itertools
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import PassiveAggressiveClassifier
    from sklearn.metrics import accuracy_score, confusion_mat

2. Now we load the data into the DataFrame and get the shape of the data and the first 5 records.

“`

#Read the data

df=pd.read_csv(‘news.csv’)

#To Get shape and head

df.shape

df.head()

“`

3. And get the labels from the DataFrame.

#DataFlair – Get Labels

labels=df.label

labels.head()

4. Now Split the data set into training and test sets.

#DataFlair – Splitting the dataset

x_train, x_test, y_train, y_test=train_test_split(df[‘text’], labels, test_size=0.3, random_state=7)

5. Let’s begin by initializing the TfidfVectorizer with English stop words and a maximum document frequency of 0.7. (terms with a higher document frequency will be discarded).

Now customize and transform the vectorizer on the train set and the vectorizer on the test set.

#DataFlair – Initialize the TfidfVectorizer

tfidf_vectorizer=TfidfVectorizer(stop_words=’English’, max_df=0.7)

#DataFlair – Customize and transform the train set, transform the test set

tfidf_train=tfidf_vectorizer.fit_transform(x_train)

tfidf_test=tfidf_vectorizer.transform(x_test)

6. Next, we initialize the PassiveAggressiveClassifier. This is. We will place it on tfidf_train and y_train.

We then predict the test set from TfidfVectorizer and calculate the accuracy using accuracy_score() from sklearn.metrics.

#DataFlair – Initialize PassiveAggressiveClassifier

pac=PassiveAggressiveClassifier(max_iter=50)

Pac.fit(tfidf_train,y_train)

#DataFlair – Predict the test set and calculate the accuracy

y_pred=pac.predict(tfidf_test)

score=accuracy_score(y_test,y_pred)

print(f’Accuracy: {round(score*100.2)}%’)

7. We obtained an accuracy of 92.82% with this model. Finally, let’s print the confusion matrix to get an overview of the number of false and true negatives and positives.

#DataFlair – Build a confusion matrix

confusion_matrix(y_test,y_pred, labels=[‘FAKE’,’REAL’])

There are 589 true positives, 587 true negatives, 42 false positives, and 49 false negatives with this model.

Summary

In this blog, we have seen Machine Learning Aided Differentiation of Real and Fake News. We also discussed how to use python to identify fake news.

About The Author:

Digital Marketing Course

₹ 9,999/-Included 18% GST

Buy Course

Overview of Digital Marketing
SEO Basic Concepts
SMM and PPC Basics
Content and Email Marketing
Website Design
Free Certification

All Details

₹ 29,999/-Included 18% GST