Machine Learning Model for Detecting Fake News.

I took this problem statement from Kaggle competition. Here we have two datasets.

True.csv
Fake.csv

So let’s work with these datasets in collab. You can access my repository mentioned bottom of this blog.

Let’s jump directly in coding.

Primary Steps

Import all the necessary libraries.

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import nltk,re,string,unicodedata

from nltk import pos_tag

from nltk.corpus import wordnet,stopwords

from nltk.stem.porter import PorterStemmer

from wordcloud import WordCloud,STOPWORDS

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize,sent_tokenize

from bs4 import BeautifulSoup

import keras

import tensorflow as tf

from keras.preprocessing import text, sequence

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

from sklearn.preprocessing import LabelBinarizer

from sklearn.model_selection import train_test_split

from string import punctuation

from keras.models import Sequential

from keras.layers import Dense,Embedding,LSTM,Dropout

from keras.callbacks import ReduceLROnPlateau

2. Load both the datasets.

true_dataset = pd.read_csv(“True.csv”)

false_dataset = pd.read_csv(“Fake.csv”)

3. Check and analyse both the datasets.

4. Creating a new column with name “category”, and assigning “0″ for false news whereas “1” for true news.

true_dataset[‘category’] = 1

false_dataset[‘category’] = 0

5. Now we need to merge both the datasets together.

dataset = pd.concat([true_dataset,false_dataset])

6. Let’s analyse the final dataset.

Data Preprocessing

In the final dataset, we don’t need few columns like “title”, “subject”, “date”. So we are going to remove them.

dataset[‘text’] = dataset[‘text’] + “ “ + dataset[‘title’]

del dataset[‘title’]

del dataset[‘subject’]

del dataset[‘date’]

Now we need to download stopwords from NLTK.

nltk.download(‘stopwords’)

stop = set(stopwords.words(‘english’))

punctuation = list(string.punctuation)

stop.update(punctuation)

Here we are going to create our own functions for doing preprocessing of our data.

def strip_html(text):

soup = BeautifulSoup(text, “html.parser”)

return soup.get_text()

#Removing the square brackets

def remove_between_square_brackets(text):

return re.sub(‘\[[^]]*\]’, ‘’, text)

# Removing URL’s

def remove_between_square_brackets(text):

return re.sub(r’http\S+’, ‘’, text)

#Removing the stopwords from text

def remove_stopwords(text):

final_text = []

for i in text.split():

if i.strip().lower() not in stop:

final_text.append(i.strip())

return “ “.join(final_text)

#Removing the noisy text

def denoise_text(text):

text = strip_html(text)

text = remove_between_square_brackets(text)

text = remove_stopwords(text)

return text

#Apply function on review column

dataset[‘text’]=dataset[‘text’].apply(denoise_text)

Data Visualization

Count Plot

sns.set_style(“darkgrid”)

sns.countplot(dataset.category)

2. Sub Plot

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,8))

text_len=dataset[dataset[‘category’]==1][‘text’].str.len()

ax1.hist(text_len,color=’red’)

ax1.set_title(‘Original text’)

text_len=dataset[dataset[‘category’]==0][‘text’].str.len()

ax2.hist(text_len,color=’green’)

ax2.set_title(‘Fake text’)

fig.suptitle(‘Characters in texts’)

plt.show()

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,8))

text_len=dataset[dataset[‘category’]==1][‘text’].str.split().map(lambda x: len(x))

ax1.hist(text_len,color=’red’)

ax1.set_title(‘Original text’)

text_len=dataset[dataset[‘category’]==0][‘text’].str.split().map(lambda x: len(x))

ax2.hist(text_len,color=’green’)

ax2.set_title(‘Fake text’)

fig.suptitle(‘Words in texts’)

plt.show()

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(20,10))

word=dataset[dataset[‘category’]==1][‘text’].str.split().apply(lambda x : [len(i) for i in x])

sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1,color=’red’)

ax1.set_title(‘Original text’)

word=dataset[dataset[‘category’]==0][‘text’].str.split().apply(lambda x : [len(i) for i in x])

sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color=’green’)

ax2.set_title(‘Fake text’)

fig.suptitle(‘Average word length in each text’)

Now we’ll split our dataset into training and testing.

x_train,x_test,y_train,y_test = train_test_split(dataset.text,dataset.category,random_state = 0)

Creating Model

Tokenizing

max_features = 10000

maxlen = 300

tokenizer = text.Tokenizer(num_words=max_features)

tokenizer.fit_on_texts(x_train)

tokenized_train = tokenizer.texts_to_sequences(x_train)

x_train = sequence.pad_sequences(tokenized_train, maxlen=maxlen)

tokenized_test = tokenizer.texts_to_sequences(x_test)

X_test = sequence.pad_sequences(tokenized_test, maxlen=maxlen)

Embedding

EMBEDDING_FILE = ‘glove.twitter.27B.100d.txt’

Model

#LSTM

model = Sequential()

model.add(Embedding(max_features, output_dim=embed_size, weights=[embedding_matrix], input_length=maxlen, trainable=False))

model.add(LSTM(units=128 , return_sequences = True , recurrent_dropout = 0.25 , dropout = 0.25))

model.add(LSTM(units=64 , recurrent_dropout = 0.1 , dropout = 0.1))

model.add(Dense(units = 32 , activation = ‘relu’))

model.add(Dense(1, activation=’sigmoid’))

model.compile(optimizer=keras.optimizers.Adam(lr = 0.01), loss=’binary_crossentropy’, metrics=[‘accuracy’])

Summary of our model

model.summary()

Fitting the model

history = model.fit(x_train, y_train, batch_size = batch_size , validation_data = (X_test,y_test) , epochs = epochs , callbacks = [learning_rate_reduction])

Accuracy of the model

You can clone my repository from here. And can also work with this dataset and can come with another approach.

Don’t forget to give us your ? !

Machine Learning Model for Detecting Fake News. was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/machine-learning-model-for-detecting-fake-news-cae28a8574ef?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/machine-learning-model-for-detecting-fake-news

Machine Learning Model for Detecting Fake News.

Primary Steps

Data Preprocessing

Trending AI Articles:

Data Visualization

Creating Model

Don’t forget to give us your ? !

Published by 365Data Science

Leave a comment Cancel reply

Primary Steps

Data Preprocessing

Trending AI Articles:

Data Visualization

Creating Model

Don’t forget to give us your ? !

Share this:

Related

Published by 365Data Science

Leave a comment Cancel reply