SlideShare a Scribd company logo
1 of 29
Big Data Training -
Amazon EMR
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Amazon EMR
Agenda
• EMR Overview
• EMR Fundamental blocks
• Launch types of EMR
• EMR Storage
• EMR Managed Scaling
• EMR Security
• EMR Pricing
• Hands-on
4
What is EMR?
5
Elastic MapReduce
Managed Hadoop framework on EC2 instances.
Includes Spark, HBase, Presto, Hive & more
Several integration points with AWS.
Basic blocks of
EMR
• Master node:
The master node manages the cluster
and typically runs master components
of distributed applications.
All the major services like spark-
history server, resource manager, and
node manager runs on the master
node.
6
Basic blocks of
EMR
• Core node:
A node with software components
that run tasks and store data in the
Hadoop Distributed File System (HDFS)
on your cluster.
Multi-node clusters have at least one
core node.
7
Basic blocks of
EMR
• Task node:
A node with software components
that only runs tasks, and you can use
task nodes to add power to perform
parallel computation tasks on data,
such as Hadoop MapReduce tasks and
Spark executors.
Task nodes don’t run the Data Node
daemon nor store data in HDFS.
8
Launch types of
EMR
• EMR on EKS cluster.
• EMR serverless (November 2021.)
• EMR on EC2 instances.
• Instance Group
• Instance Fleets
9
EMR Storage
HDFS
• Hadoop Distributed File System
• Multiple copies stored across cluster instances
for redundancy
• Files stored as blocks (128MB default size)
• Ephemeral – HDFS data is lost when cluster is
terminated!
• But, useful for caching intermediate results or
workloads with significant random I/O
• Hadoop tries to process data where it is stored
on HDFS
Local file system:
• Suitable only for temporary data (buffers,
caches, etc)
10
EMRFS:
• Access S3 as if it were HDFS
• Allows persistent storage after cluster
termination
• EMRFS Consistent View – Optional for S3
consistency
• Uses DynamoDB to track consistency
• May need to tinker with read/write
capacity on DynamoDB
• New in 2021: S3 is Now Strongly
Consistent!
EMR Scaling
EMR Automatic Scaling :
• The old way of doing it
• Custom scaling rules based on CloudWatch
metrics
• Supports instance groups only.
EMR Managed Scaling:
• Support instance groups and instance fleets
• Scales spot, on-demand, and instances in a
Savings Plan within the same cluster
• Available for Spark, Hive, and YARN workloads
11
Scale-up Strategy
• First, add core nodes, then task nodes,
up to max units specified
Scale-down Strategy
• First removes task nodes, then core
nodes, no further than minimum
constraints
Spot nodes always removed before on-demand
instances
EMR
Security
• EMRFS
• S3 encryption (SSE or CSE) at rest
• TLS in transit between EMR nodes and S3
• S3
• SSE-S3, SSE-KMS
• Local disk encryption
• Spark communication between drivers &
executors is encrypted
• Hive communication between Glue Meta store
and EMR uses TLS
• Force HTTPS (TLS) on S3 policies with aws:
Secure Transport.
• IAM roles and policy.
12
EMR Pricing
• Amazon EMR on Amazon EC2:
• The Amazon EMR price is added to the Amazon EC2 price (the
price for the underlying servers) and Amazon Elastic Block
Store (Amazon EBS) price (if attaching Amazon EBS volumes).
These are also billed per second, with a one-minute minimum.
• Amazon EMR on Amazon EKS:
• The Amazon EMR price is added to the Amazon EKS pricing or
any other services used with EKS. You can run EKS on AWS
using either EC2 or AWS Fargate.
• Amazon EMR Serverless:
• With EMR Serverless, there are no upfront costs, and you pay
for only the resources you use. You pay for vCPU, memory, and
storage resources consumed by your applications.
13
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
14
Amazon EMR
Hands-on
EMR Cluster
Hands - on
• EMR portal overview
• EMR cluster creation overview
• SSH into the Cluster.
• Running application
• Spark shell
• Spark submit option
• EMR step
• EMR Notebook
• Logs overview
16
Spark Deployment
Modes
Client Mode
17
Spark Deployment
Modes
Cluster Mode
18
Spark Memory Allocation
19
Spark Memory Allocation
• Storage Memory:
• It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on.
• Execution Memory:
• It’s mainly used to store temporary data in the calculation process of Shuffle, Join,
Sort, Aggregation, etc.
• User Memory:
• It’s mainly used to store the data needed for RDD conversion operations, such as the
information for RDD dependency.
• Reserved Memory:
• The memory is reserved for the system and is used to store Spark’s internal object
20
EMR Bootstrap
• Use a bootstrap action to install additional
software or customize the configuration of
cluster instances
• Bootstrap actions are scripts that run on
the cluster after Amazon EMR launches
the instance using the Amazon Linux
Amazon Machine Image (AMI).
• Bootstrap actions run before Amazon EMR
installs the applications that you specify
when you create the cluster and before
cluster nodes begin processing data.
21
EMR Spark
Configuration
• spark.dynamicAllocation.enabled
• spark.executor.memory
• spark.driver.memory
• spark.driver.memoryOverhead
• spark.executor.memoryOverhead
• spark.driver.cores
• spark.executor.instances
• Spark arguments:
• --num-executors
• --executor-memory
• --executor-cores
• --py-files
• --packages
22
EMR Hands-
On
Write data to S3 using the EMR spark application.
23
EMR Hands-
On
Write data to RDS PostgreSQL using the EMR spark application.
24
EMR Hands-
On
Write data to S3 using the EMR spark kinesis streaming application.
25
EMR
Assignments
• Explore different file formats,
• CSV file format
• JSON file format
• Avro file format
• ORC file format
• Parquet file format.
•
Explore different compressions,
• ZIP
• GZIP
• BZIP
• Snappy
26
EMR
Assignments
• Create an S3 bucket and configure lambda as a trigger for every new object creation.
• Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments.
• EMR spark application should read the file from S3 and add some additional metadata columns such as load
datetime.
• After transformation, the output data frame should be stored under a target s3 bucket.
27
EMR
Assignments
• Create a spark streaming application
with kinesis as input.
• Perform a real-time insert, update, and
delete data on the RDS database.
28
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Feedback
29

More Related Content

Similar to Amazon EMR Big Data Training - Learn EMR Fundamentals

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot InstancesArun Sirimalla
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataWeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 

Similar to Amazon EMR Big Data Training - Learn EMR Fundamentals (20)

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Amazon EMR Big Data Training - Learn EMR Fundamentals

  • 1. Big Data Training - Amazon EMR
  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 4. Agenda • EMR Overview • EMR Fundamental blocks • Launch types of EMR • EMR Storage • EMR Managed Scaling • EMR Security • EMR Pricing • Hands-on 4
  • 5. What is EMR? 5 Elastic MapReduce Managed Hadoop framework on EC2 instances. Includes Spark, HBase, Presto, Hive & more Several integration points with AWS.
  • 6. Basic blocks of EMR • Master node: The master node manages the cluster and typically runs master components of distributed applications. All the major services like spark- history server, resource manager, and node manager runs on the master node. 6
  • 7. Basic blocks of EMR • Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node. 7
  • 8. Basic blocks of EMR • Task node: A node with software components that only runs tasks, and you can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon nor store data in HDFS. 8
  • 9. Launch types of EMR • EMR on EKS cluster. • EMR serverless (November 2021.) • EMR on EC2 instances. • Instance Group • Instance Fleets 9
  • 10. EMR Storage HDFS • Hadoop Distributed File System • Multiple copies stored across cluster instances for redundancy • Files stored as blocks (128MB default size) • Ephemeral – HDFS data is lost when cluster is terminated! • But, useful for caching intermediate results or workloads with significant random I/O • Hadoop tries to process data where it is stored on HDFS Local file system: • Suitable only for temporary data (buffers, caches, etc) 10 EMRFS: • Access S3 as if it were HDFS • Allows persistent storage after cluster termination • EMRFS Consistent View – Optional for S3 consistency • Uses DynamoDB to track consistency • May need to tinker with read/write capacity on DynamoDB • New in 2021: S3 is Now Strongly Consistent!
  • 11. EMR Scaling EMR Automatic Scaling : • The old way of doing it • Custom scaling rules based on CloudWatch metrics • Supports instance groups only. EMR Managed Scaling: • Support instance groups and instance fleets • Scales spot, on-demand, and instances in a Savings Plan within the same cluster • Available for Spark, Hive, and YARN workloads 11 Scale-up Strategy • First, add core nodes, then task nodes, up to max units specified Scale-down Strategy • First removes task nodes, then core nodes, no further than minimum constraints Spot nodes always removed before on-demand instances
  • 12. EMR Security • EMRFS • S3 encryption (SSE or CSE) at rest • TLS in transit between EMR nodes and S3 • S3 • SSE-S3, SSE-KMS • Local disk encryption • Spark communication between drivers & executors is encrypted • Hive communication between Glue Meta store and EMR uses TLS • Force HTTPS (TLS) on S3 policies with aws: Secure Transport. • IAM roles and policy. 12
  • 13. EMR Pricing • Amazon EMR on Amazon EC2: • The Amazon EMR price is added to the Amazon EC2 price (the price for the underlying servers) and Amazon Elastic Block Store (Amazon EBS) price (if attaching Amazon EBS volumes). These are also billed per second, with a one-minute minimum. • Amazon EMR on Amazon EKS: • The Amazon EMR price is added to the Amazon EKS pricing or any other services used with EKS. You can run EKS on AWS using either EC2 or AWS Fargate. • Amazon EMR Serverless: • With EMR Serverless, there are no upfront costs, and you pay for only the resources you use. You pay for vCPU, memory, and storage resources consumed by your applications. 13
  • 14. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 14
  • 16. EMR Cluster Hands - on • EMR portal overview • EMR cluster creation overview • SSH into the Cluster. • Running application • Spark shell • Spark submit option • EMR step • EMR Notebook • Logs overview 16
  • 20. Spark Memory Allocation • Storage Memory: • It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on. • Execution Memory: • It’s mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc. • User Memory: • It’s mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency. • Reserved Memory: • The memory is reserved for the system and is used to store Spark’s internal object 20
  • 21. EMR Bootstrap • Use a bootstrap action to install additional software or customize the configuration of cluster instances • Bootstrap actions are scripts that run on the cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). • Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. 21
  • 22. EMR Spark Configuration • spark.dynamicAllocation.enabled • spark.executor.memory • spark.driver.memory • spark.driver.memoryOverhead • spark.executor.memoryOverhead • spark.driver.cores • spark.executor.instances • Spark arguments: • --num-executors • --executor-memory • --executor-cores • --py-files • --packages 22
  • 23. EMR Hands- On Write data to S3 using the EMR spark application. 23
  • 24. EMR Hands- On Write data to RDS PostgreSQL using the EMR spark application. 24
  • 25. EMR Hands- On Write data to S3 using the EMR spark kinesis streaming application. 25
  • 26. EMR Assignments • Explore different file formats, • CSV file format • JSON file format • Avro file format • ORC file format • Parquet file format. • Explore different compressions, • ZIP • GZIP • BZIP • Snappy 26
  • 27. EMR Assignments • Create an S3 bucket and configure lambda as a trigger for every new object creation. • Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments. • EMR spark application should read the file from S3 and add some additional metadata columns such as load datetime. • After transformation, the output data frame should be stored under a target s3 bucket. 27
  • 28. EMR Assignments • Create a spark streaming application with kinesis as input. • Perform a real-time insert, update, and delete data on the RDS database. 28
  • 29. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Feedback 29