SlideShare a Scribd company logo
1 of 27
Download to read offline
Fake News and Their Detection
Data Science and Big Data Analysis
Professor: Antonino Nocera
Team Name: 4V’s
Group members:
Arnold Fonkou
Vignesh Kumar Kembu
Ashina Nurkoo
Seyedkourosh Sajjadi
WELFake
Fake News Detection (WELFake) dataset
of 72,134 news articles with 35,028 real
and 37,106 fake news.
This dataset is a part of an ongoing
research on "Fake News Prediction on
Social Media Website" as a doctoral
degree program of Mr. Pawan Kumar
Verma and is partially supported by the
ARTICONF project funded by the
European Union’s Horizon 2020 research
and innovation program.
Columns:
- Serial number (starting from 0)
- Title (about the text news heading)
Text (about the news content)
- Label (0 = fake and 1 = real)
Data Stream Ingestion Hadoop MapReduce
MongoDB Sandbox
PySpark HDFS
Analysis
Architecture
PySpark
Ingestion
From CSV to JSON
Data Conversion
we have converted the
file into JSON to be
closer to reality.
Reading Data
Using PySpark
We used the
DATAFRAME client of
SPARK to read our big
data.
Saving to Hadoop
Write into Hadoop
We read from the data
frame and then we write
it to Hadoop.
Reading Section
Import findspark
findspark.init()
import pyspark
from pyspark.sql import *
spark = SparkSession.builder 
.master("local[1]") 
.appName("PySpark Read JSON") 
.getOrCreate()
# Reading multiline json file
multiline_dataframe = spark.read.option("multiline","true") 
.json("project_data_sample.json")
multiline_dataframe.head()
Saving Section
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
And the data is shown as below:
sqlContext = SQLContext(spark)
df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json')
df.show()
Hadoop
Hadoop
Component
HDFS
Component
MapReduce
Mapper (BoW Creation)
Read Lines
Input Data
The data is given as input
lines to the mapper.
Extract Text
Title and Text Extraction
After reading each line as
a JSON object, we
extract the title and the
text related to that piece
of news from it.
Tokenize
Word Extraction
We perform some data
cleaning and then we
extract every single word
from it.
Text Cleaning
import sys
import re
import json
def clean_text(text):
text =
re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(
?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
text = re.sub(r'[^a-zA-Zs]+', '', text)
return text
Tokenizing
def tokenize(text):
if not isinstance(text, str):
text = str(text)
text = clean_text(text)
text = str.lower(text)
return text.split()
Execution
for line in sys.stdin:
line = line.strip()
Try:
json_obj = json.loads(line)
except:
continue
title = json_obj.get("title", "")
text = json_obj.get("text", "")
title_words = tokenize(title)
text_words = tokenize(text)
for word in title_words + text_words:
print(f"{word}t1")
Reducer
Read Lines
Input Data
The data is given as input
lines each containing 2
elements.
Initialize Counter
Word and Count
Extraction
After reading each line a
JON object, we extract
the title and the text
related to that piece of
news from it.
Create BoW
Dictionary
Create a dictionary and
add each word as the key
and its associated count
value as the value.
Counter Initialization
import sys
from collections import Counter
import json
bag_of_words = Counter()
Execution
for line in sys.stdin:
line = line.strip()
try:
word, count = line.split("t")
except:
continue
count = int(count)
bag_of_words[word] += count
with open('bow_data.json', 'w') as f:
json.dump(bag_of_words, f)
Moving to MongoDb
import json
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')
db = client['bow']
collection = db['bow_collection']
with open('bow_data.json', 'r') as f:
bow_data = json.load(f)
collection.insert_one(bow_data)
Performing MapReduce Operation
In the Terminal:
cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
HDFS
In the case of dealing with big data, we
could partition our dataset into a number
of batches instead of saving it in a single
file.
Instead of:
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j
son', format='json')
Use:
partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0")
partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
partition_counts = partitioned_df.rdd.mapPartitions(lambda it:
[sum(1 for _ in it)]).collect()
print(partition_counts)
[482, 480, 519, 519]
Create
Database
Create a database for
containing the data.
Import From
Hadoop
Import the JSON File
from Hadoop via
PySpark.
View Data &
Backup
View the data and if it is
inserted correctly then
create a backup before
starting the modifications.
Clean Data
Remove
non-alphanumeric
characters.
Display
Modified Data
Display the modified
content to view
changes.
MongoDB
Creating Database
Use an existing database or create a new one:
>use dsdb_dev
Viewing Data
>use dsdb_dev
>show collections
>db.fake_real_news.find()
>db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number
: {$sum : 1}}}])
Creating a Copy
In the Terminal:
mongodump --db dsdb_dev --collection fake_real_news --out
/home/ds/Documents/
Importing From Hadoop
In the Terminal:
mongoimport --db dsdb_dev --collection fake_real_news --file
/usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b
72-b87d-5943bec596d3-c000.json
Importing from Hadoop Using PySpark
with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line +
'n')
import json
with open('sampled_data.json') as file:
data = file.readlines()
collection.insert_many([json.loads(line) for line in data])
df =
spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234
40-4fde-4b72-b87d-5943bec596d3-c000.json")
sampled_df = df.sample(fraction=0.8, seed=42)
from pymongo import MongoClient
conn = MongoClient()
db = conn.dsdb_dev
collection = db['sampled_data']
json_data = sampled_df.toJSON().collect()
Data Cleaning
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]). forEach(function(doc) {
if (doc.title) {
var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, '');
db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } });
}
});
Modified Content Display
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]);
The file is now ready for word occurrence counting,
which can be done using Jupyter Notebook and
PyMongo.
Backup Restoration
In case of any need, restore the initial file:
>db.fake_real_news.drop()
mongorestore --db dsdb_dev --collection fake_real_news
/home/ds/Documents/dsdb_dev/fake_real_news.bson
Count the Number of Words
db.fake_real_news.aggregate ([
{
'$match': {
'label': "0" # the condition for the 'label'
field to be 1
}
},
{
'$project': {
'words': {'$split': [{'$toLower': '$title'}, ' ']} #
Split the lowercase version of the title field into
an array of words
}
},
{
'$unwind': '$words' # Separate documents
for each word
},
{
'$group': {
'_id': {
'word': '$words', # Group by word field
and count
},
'count': {'$sum': 1}
}
},
{
'$project': {
# Project to return only word field, count,
and id
'word': '$_id.word',
'count': 1
}
},
{
'$match': {
'word': {'$ne': None}, # Exclude null or
non-existent values
}
},
{
'$match': {
'$expr': {'$ne': ['$word', '']} # Exclude
empty strings
}
},
{
'$sort': {'count':-1}
}
])
Hypotheses
H1
Generation of fake news shall be with
the help of stop words.
Metrics - Average number of stop
words in title shall be higher in fake
news.
H2
Real news shall be short and crisp in
order to generate easy value.
Metrics - Length of the fake news shall
be more than the real ones.
H1
We used NLTK to extract stop words from
the title column and compared the
averages between fake and real titles.
The hypothesis is false, as shown by the
figure: fake news (0) is less frequent than
real news (1).
H2
The hypothesis is true, as shown by the figures:
fake news (0) tends to be longer than real news
(1).
Insights on Data &
Pre-processing
To gain quick insights from the data, we
used word clouds for the titles overall and
for fake/real data.
Real News
Fake News
Null Values
The title column contains some null
values, which may cause issues in data
analysis or processing.
We need to fill the null values in the title
column to ensure accurate data analysis.
Text Normalization
To further prepare the data, we applied text normalization techniques, including converting
the title and text to lowercase and removing punctuation marks.
Classification Model
For the binary classification of the News, we have choose
Random Forest Classifier
Splitting of data in x and y variable and Test and train split of
the data has been performed with 77 & 33 size.
The bag of words has been performed to the text of the news
(X_train & X_test) and by removing the stop words in English.
The Label Y_train & Y_test has the class of the news (Fake = 0
& Real = 1 )
Now the train data is feed to the RandomForestClassifier with
500 trees and the model has been tested with the test data
and the model classification confusion matrix is below.
Thank You For Your Attention!

More Related Content

Similar to Fake News Detection Dataset Analysis (WELFake

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12thanimesh dwivedi
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfrajeshjangid1865
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARKMRKUsafzai0607
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput Ahmad Idrees
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonkannikadg
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project reportManav Deshmukh
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
 
An intelligent scalable stock market prediction system
An intelligent scalable stock market prediction systemAn intelligent scalable stock market prediction system
An intelligent scalable stock market prediction systemHarshit Agarwal
 
Multivariate Data Analysis Project Report
Multivariate Data Analysis Project ReportMultivariate Data Analysis Project Report
Multivariate Data Analysis Project ReportUtkarsh Agrawal
 

Similar to Fake News Detection Dataset Analysis (WELFake (20)

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdf
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project report
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 
An intelligent scalable stock market prediction system
An intelligent scalable stock market prediction systemAn intelligent scalable stock market prediction system
An intelligent scalable stock market prediction system
 
Multivariate Data Analysis Project Report
Multivariate Data Analysis Project ReportMultivariate Data Analysis Project Report
Multivariate Data Analysis Project Report
 

Recently uploaded

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 

Recently uploaded (20)

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 

Fake News Detection Dataset Analysis (WELFake

  • 1. Fake News and Their Detection Data Science and Big Data Analysis Professor: Antonino Nocera Team Name: 4V’s Group members: Arnold Fonkou Vignesh Kumar Kembu Ashina Nurkoo Seyedkourosh Sajjadi
  • 2. WELFake Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. This dataset is a part of an ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program. Columns: - Serial number (starting from 0) - Title (about the text news heading) Text (about the news content) - Label (0 = fake and 1 = real)
  • 3. Data Stream Ingestion Hadoop MapReduce MongoDB Sandbox PySpark HDFS Analysis Architecture PySpark
  • 4. Ingestion From CSV to JSON Data Conversion we have converted the file into JSON to be closer to reality. Reading Data Using PySpark We used the DATAFRAME client of SPARK to read our big data. Saving to Hadoop Write into Hadoop We read from the data frame and then we write it to Hadoop.
  • 5. Reading Section Import findspark findspark.init() import pyspark from pyspark.sql import * spark = SparkSession.builder .master("local[1]") .appName("PySpark Read JSON") .getOrCreate() # Reading multiline json file multiline_dataframe = spark.read.option("multiline","true") .json("project_data_sample.json") multiline_dataframe.head() Saving Section multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') And the data is shown as below: sqlContext = SQLContext(spark) df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json') df.show()
  • 7. Mapper (BoW Creation) Read Lines Input Data The data is given as input lines to the mapper. Extract Text Title and Text Extraction After reading each line as a JSON object, we extract the title and the text related to that piece of news from it. Tokenize Word Extraction We perform some data cleaning and then we extract every single word from it.
  • 8. Text Cleaning import sys import re import json def clean_text(text): text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|( ?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text) text = re.sub(r'[^a-zA-Zs]+', '', text) return text Tokenizing def tokenize(text): if not isinstance(text, str): text = str(text) text = clean_text(text) text = str.lower(text) return text.split() Execution for line in sys.stdin: line = line.strip() Try: json_obj = json.loads(line) except: continue title = json_obj.get("title", "") text = json_obj.get("text", "") title_words = tokenize(title) text_words = tokenize(text) for word in title_words + text_words: print(f"{word}t1")
  • 9. Reducer Read Lines Input Data The data is given as input lines each containing 2 elements. Initialize Counter Word and Count Extraction After reading each line a JON object, we extract the title and the text related to that piece of news from it. Create BoW Dictionary Create a dictionary and add each word as the key and its associated count value as the value.
  • 10. Counter Initialization import sys from collections import Counter import json bag_of_words = Counter() Execution for line in sys.stdin: line = line.strip() try: word, count = line.split("t") except: continue count = int(count) bag_of_words[word] += count with open('bow_data.json', 'w') as f: json.dump(bag_of_words, f)
  • 11. Moving to MongoDb import json from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017') db = client['bow'] collection = db['bow_collection'] with open('bow_data.json', 'r') as f: bow_data = json.load(f) collection.insert_one(bow_data) Performing MapReduce Operation In the Terminal: cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
  • 12. HDFS In the case of dealing with big data, we could partition our dataset into a number of batches instead of saving it in a single file. Instead of: multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j son', format='json') Use: partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0") partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') partition_counts = partitioned_df.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect() print(partition_counts) [482, 480, 519, 519]
  • 13. Create Database Create a database for containing the data. Import From Hadoop Import the JSON File from Hadoop via PySpark. View Data & Backup View the data and if it is inserted correctly then create a backup before starting the modifications. Clean Data Remove non-alphanumeric characters. Display Modified Data Display the modified content to view changes. MongoDB
  • 14. Creating Database Use an existing database or create a new one: >use dsdb_dev Viewing Data >use dsdb_dev >show collections >db.fake_real_news.find() >db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number : {$sum : 1}}}]) Creating a Copy In the Terminal: mongodump --db dsdb_dev --collection fake_real_news --out /home/ds/Documents/ Importing From Hadoop In the Terminal: mongoimport --db dsdb_dev --collection fake_real_news --file /usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b 72-b87d-5943bec596d3-c000.json
  • 15. Importing from Hadoop Using PySpark with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line + 'n') import json with open('sampled_data.json') as file: data = file.readlines() collection.insert_many([json.loads(line) for line in data]) df = spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234 40-4fde-4b72-b87d-5943bec596d3-c000.json") sampled_df = df.sample(fraction=0.8, seed=42) from pymongo import MongoClient conn = MongoClient() db = conn.dsdb_dev collection = db['sampled_data'] json_data = sampled_df.toJSON().collect()
  • 16. Data Cleaning >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]). forEach(function(doc) { if (doc.title) { var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, ''); db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } }); } }); Modified Content Display >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]); The file is now ready for word occurrence counting, which can be done using Jupyter Notebook and PyMongo. Backup Restoration In case of any need, restore the initial file: >db.fake_real_news.drop() mongorestore --db dsdb_dev --collection fake_real_news /home/ds/Documents/dsdb_dev/fake_real_news.bson
  • 17. Count the Number of Words db.fake_real_news.aggregate ([ { '$match': { 'label': "0" # the condition for the 'label' field to be 1 } }, { '$project': { 'words': {'$split': [{'$toLower': '$title'}, ' ']} # Split the lowercase version of the title field into an array of words } }, { '$unwind': '$words' # Separate documents for each word }, { '$group': { '_id': { 'word': '$words', # Group by word field and count }, 'count': {'$sum': 1} } }, { '$project': { # Project to return only word field, count, and id 'word': '$_id.word', 'count': 1 } }, { '$match': { 'word': {'$ne': None}, # Exclude null or non-existent values } }, { '$match': { '$expr': {'$ne': ['$word', '']} # Exclude empty strings } }, { '$sort': {'count':-1} } ])
  • 18. Hypotheses H1 Generation of fake news shall be with the help of stop words. Metrics - Average number of stop words in title shall be higher in fake news. H2 Real news shall be short and crisp in order to generate easy value. Metrics - Length of the fake news shall be more than the real ones.
  • 19. H1 We used NLTK to extract stop words from the title column and compared the averages between fake and real titles. The hypothesis is false, as shown by the figure: fake news (0) is less frequent than real news (1).
  • 20. H2 The hypothesis is true, as shown by the figures: fake news (0) tends to be longer than real news (1).
  • 21. Insights on Data & Pre-processing To gain quick insights from the data, we used word clouds for the titles overall and for fake/real data.
  • 24. Null Values The title column contains some null values, which may cause issues in data analysis or processing. We need to fill the null values in the title column to ensure accurate data analysis.
  • 25. Text Normalization To further prepare the data, we applied text normalization techniques, including converting the title and text to lowercase and removing punctuation marks.
  • 26. Classification Model For the binary classification of the News, we have choose Random Forest Classifier Splitting of data in x and y variable and Test and train split of the data has been performed with 77 & 33 size. The bag of words has been performed to the text of the news (X_train & X_test) and by removing the stop words in English. The Label Y_train & Y_test has the class of the news (Fake = 0 & Real = 1 ) Now the train data is feed to the RandomForestClassifier with 500 trees and the model has been tested with the test data and the model classification confusion matrix is below.
  • 27. Thank You For Your Attention!