SlideShare a Scribd company logo
1 of 33
Big Data Training
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Agenda
• Big data Overview
• Dimensions of Big data
• Traditional approach and limitations
• Hadoop Overview
• Spark Overview
• Hive Overview
• Other Big data frameworks
3
Big Data Overview
What is Big
data?
• Each user with a smartphone generates
approximately 40 Exabytes of data every month.
• According to Forbes, 2.5 Quintillion bytes of data are
created every day.
5
What is Big
data?
• Collection of data that is so huge & complex like none of
the traditional data management tool can store or process
it.
6
Dimensions of Big Data
6v’s Of Big data
• Volume
• The scale of data.
• Velocity
• Speed of data.
• Variety
• Diversity of data.
• Veracity
• Accuracy of data.
• Value
• Insights gained from data.
• Variability
• How often data can change.
8
Big Data Phases
Big Data Phases
• Data collection
• Data Cleansing / Validation
• Data Transformation
• Data Storage
• Data Visualization
Different Pipelines:
• ETL (Extract, Transform, Load)
• ELT (Extract, Load, Transform)
10
Traditional Approach
Traditional
Approach
• An enterprise will have a
computer to store and process
big data.
• Limitations:
• Processor that is
processing the data.
• Dealing with huge amounts
amounts of scalable data
12
Traditional
Approach
• Google’s Solution:
• Solved the processor
problem using an
algorithm called
MapReduce.
• Divides the task into small
parts and assigns them to
many computers.
13
Hadoop Overview
Hadoop Overview
• Using the solution provided by
Google, Doug Cutting and his team
developed an Open-Source Project
called HADOOP.
15
Hadoop Overview
• Framework for distributed data processing Maps
data to key/value pairs
Reduces intermediate results to final output Largely
supplanted by Spark these days
• Yet Another Resource Negotiator
Manages cluster resources for multiple data
processing frameworks
• Hadoop Distributed File System
Distributes data blocks across clusters in a redundant
manner
16
Spark
Overview
Spark Overview
• Hadoop MapReduce must persist data back to the
disk after every Map or Reduce action.
• This brings processing slowness.
• Spark - Distributed processing framework for big
data.
• Apache Spark is very much popular for its speed.
It runs 100 times faster in memory and ten times
faster on disk than Hadoop MapReduce since it
processes data in memory (RAM).
• Supports Java, Scala, Python, and R.
18
Spark Components
19
How Spark
Works
• Spark apps are run as
independent processes on a
cluster.
• Executors run computations
and store data.
• Spark context sends
application code and tasks to
executors
• Cluster manager – Yarn
20
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 1.x three entry points were introduced,
•
Spark Context:
• The entry point of all spark application
• Spark Context is the first step to use RDD and connect to Spark
Cluster
• SQL Context:
• Used for the spark SQL executions & Structured data processing.
•
Hive Context:
• Used for the application to communicate with the hive.
21
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 2.x introduced spark session,
• Spark Session:
• Combination of spark context, SQL context and
hive context.
22
Resilient Distributed
Dataset (RDD) & Dataframe
• RDD (Resilient Distributed Dataset) is a fundamental data
structure of Spark.
• The data frame is organized into named columns.
• Data frame supports APIs such as select, agg, sum, avg
etc.
• Support Spark SQL
• Catalyst Optimizer is available.
• Both are fault-tolerant, immutable distributed collections of
objects, which means you cannot change once you create.
23
Different types of Evaluation
• Eager Evaluation:
• Is the evaluation strategy you’ll most probably be familiar with and is used in most
programming languages
• Lazy Evaluation:
• Is an evaluation strategy that delays the evaluation of an expression until its value is
needed.
• Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want,
but Spark will not start the execution of the process until an ACTION is called.
24
Transformation & Actions
• Transformations are the instructions you use to modify the Data Frame in the way you want and
are lazily executed.
• Narrow transformations:
• Select
• Filter
• with column
• Wide transformations:
• Group by
• Repartition
• Actions are statements that will ask for a value to be computed immediately and are eager statements.
• Show, collect, save, count.
25
Spark’s Catalyst
Optimizer
• When performing different transformations,
Spark will store them in a Directed Acyclic
Graph (or DAG).
• Once the DAG is constructed, Spark’s catalyst
optimizer will perform a set of rule-based
and cost-based optimizations to determine
a logical and then physical plan of execution.
• Spark’s Catalyst optimizer will group
operations together, reducing the number of
passes on data and improving performance.
26
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Spark Hands-on
27
Spark Assignment
• Input:
• Covid data CSV file
• Expected outputs:
• Convert all state names to lowercase.
• The day had a greater number of covid cases.
• The state has the second-largest number of covid cases.
• Which Union Territory has the least number of death.
• The state has the lowest Death to Total Confirmed cases
ratio.
• Find which month the more Newer recovered cases.
• If the month is 02 it should display as February.
28
Hive Overview
Apache Hive
• Uses familiar SQL syntax (HiveQL)
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data
warehouse applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java.
• Interactive & Highly optimized.
30
Other Big Data
Frameworks
Other Big
Data
Frameworks
32
• Pig introduces Pig Latin, a scripting language that lets you
use SQL-like syntax to define your map and reduce steps.
Apache Pig:
• Non-relational, petabyte-scale database.
• In-memory, Based on Google’s Bigtable, on top of HDFS
Apache HBase:
• It can connect to many different “big data” databases and
data stores at once, and query across them.
• Interactive queries at the petabyte scale.
Presto:
• Interactively run scripts/code against your data.
Apache Zeppelin:
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
33

More Related Content

Similar to Big Data training

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3xKinAnx
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

Similar to Big Data training (20)

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Apache drill
Apache drillApache drill
Apache drill
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Big Data training

  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 3. Agenda • Big data Overview • Dimensions of Big data • Traditional approach and limitations • Hadoop Overview • Spark Overview • Hive Overview • Other Big data frameworks 3
  • 5. What is Big data? • Each user with a smartphone generates approximately 40 Exabytes of data every month. • According to Forbes, 2.5 Quintillion bytes of data are created every day. 5
  • 6. What is Big data? • Collection of data that is so huge & complex like none of the traditional data management tool can store or process it. 6
  • 8. 6v’s Of Big data • Volume • The scale of data. • Velocity • Speed of data. • Variety • Diversity of data. • Veracity • Accuracy of data. • Value • Insights gained from data. • Variability • How often data can change. 8
  • 10. Big Data Phases • Data collection • Data Cleansing / Validation • Data Transformation • Data Storage • Data Visualization Different Pipelines: • ETL (Extract, Transform, Load) • ELT (Extract, Load, Transform) 10
  • 12. Traditional Approach • An enterprise will have a computer to store and process big data. • Limitations: • Processor that is processing the data. • Dealing with huge amounts amounts of scalable data 12
  • 13. Traditional Approach • Google’s Solution: • Solved the processor problem using an algorithm called MapReduce. • Divides the task into small parts and assigns them to many computers. 13
  • 15. Hadoop Overview • Using the solution provided by Google, Doug Cutting and his team developed an Open-Source Project called HADOOP. 15
  • 16. Hadoop Overview • Framework for distributed data processing Maps data to key/value pairs Reduces intermediate results to final output Largely supplanted by Spark these days • Yet Another Resource Negotiator Manages cluster resources for multiple data processing frameworks • Hadoop Distributed File System Distributes data blocks across clusters in a redundant manner 16
  • 18. Spark Overview • Hadoop MapReduce must persist data back to the disk after every Map or Reduce action. • This brings processing slowness. • Spark - Distributed processing framework for big data. • Apache Spark is very much popular for its speed. It runs 100 times faster in memory and ten times faster on disk than Hadoop MapReduce since it processes data in memory (RAM). • Supports Java, Scala, Python, and R. 18
  • 20. How Spark Works • Spark apps are run as independent processes on a cluster. • Executors run computations and store data. • Spark context sends application code and tasks to executors • Cluster manager – Yarn 20
  • 21. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 1.x three entry points were introduced, • Spark Context: • The entry point of all spark application • Spark Context is the first step to use RDD and connect to Spark Cluster • SQL Context: • Used for the spark SQL executions & Structured data processing. • Hive Context: • Used for the application to communicate with the hive. 21
  • 22. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 2.x introduced spark session, • Spark Session: • Combination of spark context, SQL context and hive context. 22
  • 23. Resilient Distributed Dataset (RDD) & Dataframe • RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark. • The data frame is organized into named columns. • Data frame supports APIs such as select, agg, sum, avg etc. • Support Spark SQL • Catalyst Optimizer is available. • Both are fault-tolerant, immutable distributed collections of objects, which means you cannot change once you create. 23
  • 24. Different types of Evaluation • Eager Evaluation: • Is the evaluation strategy you’ll most probably be familiar with and is used in most programming languages • Lazy Evaluation: • Is an evaluation strategy that delays the evaluation of an expression until its value is needed. • Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called. 24
  • 25. Transformation & Actions • Transformations are the instructions you use to modify the Data Frame in the way you want and are lazily executed. • Narrow transformations: • Select • Filter • with column • Wide transformations: • Group by • Repartition • Actions are statements that will ask for a value to be computed immediately and are eager statements. • Show, collect, save, count. 25
  • 26. Spark’s Catalyst Optimizer • When performing different transformations, Spark will store them in a Directed Acyclic Graph (or DAG). • Once the DAG is constructed, Spark’s catalyst optimizer will perform a set of rule-based and cost-based optimizations to determine a logical and then physical plan of execution. • Spark’s Catalyst optimizer will group operations together, reducing the number of passes on data and improving performance. 26
  • 27. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Spark Hands-on 27
  • 28. Spark Assignment • Input: • Covid data CSV file • Expected outputs: • Convert all state names to lowercase. • The day had a greater number of covid cases. • The state has the second-largest number of covid cases. • Which Union Territory has the least number of death. • The state has the lowest Death to Total Confirmed cases ratio. • Find which month the more Newer recovered cases. • If the month is 02 it should display as February. 28
  • 30. Apache Hive • Uses familiar SQL syntax (HiveQL) • Scalable – works with “big data” on a cluster • Really most appropriate for data warehouse applications • Easy OLAP queries – WAY easier than writing MapReduce in Java. • Interactive & Highly optimized. 30
  • 32. Other Big Data Frameworks 32 • Pig introduces Pig Latin, a scripting language that lets you use SQL-like syntax to define your map and reduce steps. Apache Pig: • Non-relational, petabyte-scale database. • In-memory, Based on Google’s Bigtable, on top of HDFS Apache HBase: • It can connect to many different “big data” databases and data stores at once, and query across them. • Interactive queries at the petabyte scale. Presto: • Interactively run scripts/code against your data. Apache Zeppelin:
  • 33. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 33