2. About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
3. Agenda
• Big data Overview
• Dimensions of Big data
• Traditional approach and limitations
• Hadoop Overview
• Spark Overview
• Hive Overview
• Other Big data frameworks
3
5. What is Big
data?
• Each user with a smartphone generates
approximately 40 Exabytes of data every month.
• According to Forbes, 2.5 Quintillion bytes of data are
created every day.
5
6. What is Big
data?
• Collection of data that is so huge & complex like none of
the traditional data management tool can store or process
it.
6
8. 6v’s Of Big data
• Volume
• The scale of data.
• Velocity
• Speed of data.
• Variety
• Diversity of data.
• Veracity
• Accuracy of data.
• Value
• Insights gained from data.
• Variability
• How often data can change.
8
10. Big Data Phases
• Data collection
• Data Cleansing / Validation
• Data Transformation
• Data Storage
• Data Visualization
Different Pipelines:
• ETL (Extract, Transform, Load)
• ELT (Extract, Load, Transform)
10
12. Traditional
Approach
• An enterprise will have a
computer to store and process
big data.
• Limitations:
• Processor that is
processing the data.
• Dealing with huge amounts
amounts of scalable data
12
13. Traditional
Approach
• Google’s Solution:
• Solved the processor
problem using an
algorithm called
MapReduce.
• Divides the task into small
parts and assigns them to
many computers.
13
15. Hadoop Overview
• Using the solution provided by
Google, Doug Cutting and his team
developed an Open-Source Project
called HADOOP.
15
16. Hadoop Overview
• Framework for distributed data processing Maps
data to key/value pairs
Reduces intermediate results to final output Largely
supplanted by Spark these days
• Yet Another Resource Negotiator
Manages cluster resources for multiple data
processing frameworks
• Hadoop Distributed File System
Distributes data blocks across clusters in a redundant
manner
16
18. Spark Overview
• Hadoop MapReduce must persist data back to the
disk after every Map or Reduce action.
• This brings processing slowness.
• Spark - Distributed processing framework for big
data.
• Apache Spark is very much popular for its speed.
It runs 100 times faster in memory and ten times
faster on disk than Hadoop MapReduce since it
processes data in memory (RAM).
• Supports Java, Scala, Python, and R.
18
20. How Spark
Works
• Spark apps are run as
independent processes on a
cluster.
• Executors run computations
and store data.
• Spark context sends
application code and tasks to
executors
• Cluster manager – Yarn
20
21. Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 1.x three entry points were introduced,
•
Spark Context:
• The entry point of all spark application
• Spark Context is the first step to use RDD and connect to Spark
Cluster
• SQL Context:
• Used for the spark SQL executions & Structured data processing.
•
Hive Context:
• Used for the application to communicate with the hive.
21
22. Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 2.x introduced spark session,
• Spark Session:
• Combination of spark context, SQL context and
hive context.
22
23. Resilient Distributed
Dataset (RDD) & Dataframe
• RDD (Resilient Distributed Dataset) is a fundamental data
structure of Spark.
• The data frame is organized into named columns.
• Data frame supports APIs such as select, agg, sum, avg
etc.
• Support Spark SQL
• Catalyst Optimizer is available.
• Both are fault-tolerant, immutable distributed collections of
objects, which means you cannot change once you create.
23
24. Different types of Evaluation
• Eager Evaluation:
• Is the evaluation strategy you’ll most probably be familiar with and is used in most
programming languages
• Lazy Evaluation:
• Is an evaluation strategy that delays the evaluation of an expression until its value is
needed.
• Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want,
but Spark will not start the execution of the process until an ACTION is called.
24
25. Transformation & Actions
• Transformations are the instructions you use to modify the Data Frame in the way you want and
are lazily executed.
• Narrow transformations:
• Select
• Filter
• with column
• Wide transformations:
• Group by
• Repartition
• Actions are statements that will ask for a value to be computed immediately and are eager statements.
• Show, collect, save, count.
25
26. Spark’s Catalyst
Optimizer
• When performing different transformations,
Spark will store them in a Directed Acyclic
Graph (or DAG).
• Once the DAG is constructed, Spark’s catalyst
optimizer will perform a set of rule-based
and cost-based optimizations to determine
a logical and then physical plan of execution.
• Spark’s Catalyst optimizer will group
operations together, reducing the number of
passes on data and improving performance.
26
28. Spark Assignment
• Input:
• Covid data CSV file
• Expected outputs:
• Convert all state names to lowercase.
• The day had a greater number of covid cases.
• The state has the second-largest number of covid cases.
• Which Union Territory has the least number of death.
• The state has the lowest Death to Total Confirmed cases
ratio.
• Find which month the more Newer recovered cases.
• If the month is 02 it should display as February.
28
30. Apache Hive
• Uses familiar SQL syntax (HiveQL)
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data
warehouse applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java.
• Interactive & Highly optimized.
30
32. Other Big
Data
Frameworks
32
• Pig introduces Pig Latin, a scripting language that lets you
use SQL-like syntax to define your map and reduce steps.
Apache Pig:
• Non-relational, petabyte-scale database.
• In-memory, Based on Google’s Bigtable, on top of HDFS
Apache HBase:
• It can connect to many different “big data” databases and
data stores at once, and query across them.
• Interactive queries at the petabyte scale.
Presto:
• Interactively run scripts/code against your data.
Apache Zeppelin: