SlideShare a Scribd company logo
1 of 25
Lessons Learned : Scaling Hadoop and Big Data in Cloud



                                     Vijay Rayapati
                                     @amnigos
Big Data
Data Keeps Growing
What you can do with Data?
Problem : Local Intelligence
High Level Architecture



                                             Processing
Multiple Data
                Ingest                     (Hadoop Jobs)
Sources                  Data Storage




                                               Output
                            Location          Datasets
                  API
                           Intelligence
Our Challenges
•  Mul$ple	
  data	
  sources	
  –	
  social,	
  retail,	
  events,	
  news,	
  census,	
  loca$on	
  etc	
  

	
  
•  Spa$al	
  data	
  analysis	
  and	
  querying	
  –	
  loca$on	
  overlay	
  on	
  data	
  

	
  
•  Temporal	
  nature	
  of	
  the	
  input	
  datasets	
  
	
  
	
  
•  Large	
  input	
  data	
  sets	
  and	
  hundreds	
  of	
  GB	
  compressed	
  inputs	
  for	
  jobs	
  


•  Complex	
  processing	
  and	
  business	
  logic	
  based	
  on	
  use	
  cases	
  
	
  

•  Custom	
  output	
  data	
  formats	
  –	
  JSON,	
  XML,	
  XLS,	
  Flat	
  files	
  etc	
  
Why Amazon EMR?



I am interested in using Hadoop to solve business problems
and not in building and managing Hadoop infrastructure!
	
  

Scalable Storage - S3

Flexible Computing – EC2

No Hadoop Management – EMR
Amazon EMR - Service Architecture
How to move existing data to Cloud?


  10’s	
  of	
  GB	
       100’s	
  of	
  GB	
      	
  >	
  Terabyte	
  


  Direct	
  Upload	
       Any	
  S3	
  tools	
     AWS	
  Import/
                                                    Export	
  


  Any	
  S3	
  tools	
     Tsunami	
                Aspera,	
  Tsunami	
  
Solution Architecture
                         EC2            S3              EMR



                                                  Processing
Multiple Data
                Ingest                          (Hadoop Jobs)
Sources                        Data Storage




                                                    Output
                                   Location        Datasets
                  API
                                 Intelligence

                                                        S3
                     EC2                EC2
Amazon EMR – Setup
Launching	
  a	
  500	
  node	
  and	
  fully	
  configured	
  cluster	
  is	
  as	
  simple	
  as	
  execu$ng	
  
one	
  command.	
  
       	
  
       >	
  elas$c-­‐mapreduce	
  	
  -­‐-­‐create	
  -­‐-­‐alive	
  -­‐-­‐plain-­‐output	
  -­‐-­‐master-­‐instance-­‐type	
  
       m1.xlarge	
  -­‐-­‐slave-­‐instance-­‐type	
  m2.2xlarge	
  	
  -­‐-­‐num-­‐instances	
  500	
  	
  -­‐-­‐name	
  "Site	
  
       Analy$cs	
  Cluster"	
  -­‐-­‐bootstrap-­‐ac$on	
  s3://com.bcb11.emr/scripts/bootstrap-­‐
       custom.sh	
  -­‐-­‐bootstrap-­‐ac$on	
  s3://elas$cmapreduce/bootstrap-­‐ac$ons/install-­‐
       ganglia	
  -­‐-­‐bootstrap-­‐ac$on	
  s3://elas$cmapreduce/bootstrap-­‐ac$ons/configure-­‐
       hadoop	
  -­‐-­‐args	
  "-­‐-­‐mapred-­‐config-­‐file,	
  s3://com.bcb11.emr/conf/custom-­‐mapred-­‐
       site.xml"	
  	
  
       	
  
       >	
  elas$c-­‐mapreduce	
  -­‐j	
  ${jobflow}	
  -­‐-­‐stream	
  -­‐-­‐step-­‐name	
  “Profile	
  Analyzer"	
  -­‐-­‐
       jobconf	
  mapred.task.$meout=0	
  -­‐-­‐mapper	
  s3://com.bcb11.emr/code/mapper.rb	
  
       -­‐-­‐reducer	
  s3://com.bcb11.emr/bin/reducer.rb	
  -­‐-­‐cache	
  s3://com.bcb11.emr/
       cache/customdata.dat#data.txt	
  -­‐-­‐input	
  s3://com.bcb11.emr/input/	
  -­‐-­‐output	
  s3://
       com.bcb11.emr/output	
  
EMR Map Reduce Jobs
	
  
Amazon	
  EMR	
  supports	
  –	
  streaming,	
  custom	
  jar,	
  cascading,	
  pig	
  and	
  hive.	
  	
  
	
  
	
  
Streaming	
  –	
  Write	
  Map	
  Reduce	
  jobs	
  in	
  any	
  scrip$ng	
  language.	
  
	
  
	
  
Custom	
  Jar	
  –	
  Write	
  using	
  Java	
  and	
  good	
  for	
  speed/control.	
  
	
  
	
  
Cascading,	
  Hive	
  and	
  Pig	
  –	
  Higher	
  level	
  of	
  abstrac$on.	
  
	
  
	
  
AWS	
  EMR	
  forums	
  if	
  you	
  need	
  help.	
  
Hadoop and EMR – Lesson Learned
EMR – Good, Bad and Ugly
Great	
  for	
  bootstrapping	
  large	
  clusters	
  and	
  very	
  cost-­‐effec$ve	
  for	
  transient	
  
clusters.	
  
	
  
	
  
Most	
  patches	
  are	
  applied	
  and	
  Amazon	
  creates	
  new	
  AMI’s	
  with	
  
improvements	
  –	
  but	
  not	
  for	
  everything.	
  
	
  
	
  
Intermiient	
  network	
  issues	
  –	
  Some$mes	
  could	
  cause	
  serious	
  degrada$on	
  
of	
  performance.	
  
	
  
	
  
Network/Disk	
  IO	
  is	
  variable	
  based	
  on	
  instance	
  types	
  and	
  streaming	
  jobs	
  
will	
  be	
  much	
  sluggish	
  on	
  EMR	
  compared	
  to	
  dedicated	
  setup.	
  
	
  
	
  
Be	
  ready	
  to	
  face	
  variable	
  performance	
  in	
  Cloud.	
  
Hadoop and EMR – Jobs
Use	
  local	
  Hadoop	
  setup	
  for	
  debugging	
  your	
  jobs	
  –	
  there	
  is	
  no	
  easy	
  way	
  on	
  
EMR.	
  
	
  
	
  
Capture	
  EMR	
  cluster	
  metrics	
  -­‐	
  always	
  bootstrap	
  with	
  Ganglia.	
  
	
  
	
  
High	
  JVM	
  memory	
  alloca$on	
  lead	
  to	
  long	
  GC	
  pauses.	
  
	
  
	
  
Don’t	
  trust	
  EMR	
  tuned	
  sekngs	
  of	
  Hadoop	
  configura$ons.	
  	
  
	
  
	
  
Benchmark	
  on	
  small	
  cluster	
  for	
  data	
  points.	
  
Hadoop and EMR – Jobs performance
	
  
GC	
  Overhead	
  -­‐	
  	
  increase	
  memory	
  and	
  reduce	
  the	
  jvm	
  reuse	
  tasks.	
  
	
  
Avoid	
  read	
  conten$on	
  at	
  S3	
  –	
  Have	
  equal	
  or	
  more	
  files	
  in	
  S3	
  compared	
  to	
  
available	
  mappers.	
  
	
  
Use	
  mapred	
  output	
  compression	
  to	
  save	
  storage,	
  processing	
  $me	
  and	
  
bandwidth	
  costs.	
  
	
  
Set	
  mapred	
  task	
  $meout	
  to	
  0	
  if	
  you	
  have	
  long	
  running	
  jobs	
  (>	
  10	
  mins)	
  
and	
  can	
  disable	
  specula$ve	
  execu$on	
  $me.	
  
	
  
	
  
Always	
  benchmark	
  third	
  party	
  libraries	
  used	
  in	
  your	
  jobs	
  code	
  before	
  
pukng	
  them	
  in	
  produc$on	
  –	
  too	
  much	
  sluggish	
  stuff	
  out	
  there.	
  
	
  
	
  
Hadoop – High Level Tuning

                                                       	
  	
  
   Small	
  files	
  problem	
  –	
  avoid	
  too	
     Tune	
  your	
  sekngs	
  –	
  JVM	
  Reuse,	
  
   many	
  small	
  files	
  in	
  S3	
                 Sort	
  Buffer,	
  Sort	
  Factor,	
  Map/
                                                       Reduce	
  Tasks,	
  Parallel	
  Copies,	
  
                                                       MapRed	
  Output	
  Compression	
  etc	
  




                                                          Good	
  thing	
  is	
  that	
  you	
  can	
  use	
  
Know	
  what	
  is	
  limi$ng	
  you	
  at	
  a	
  
                                                          small	
  cluster	
  and	
  sample	
  input	
  
node	
  level	
  –	
  CPU,	
  Memory,	
  DISK	
  
                                                          size	
  for	
  tuning	
  
IO	
  or	
  Network	
  IN/OUT	
  
Performance Tuning Golden Rules

When you are operating at very large scale
even a 10 ms makes a big difference!


Example	
  :	
  Moving	
  away	
  from	
  Simple-­‐Json	
  to	
  Jackson	
  
	
  
JSON	
  Parsing	
  –	
  600	
  ms	
  
	
  
Op$mized	
  Parsing	
  –	
  500	
  ms	
  
	
  
Number	
  of	
  input	
  JSON	
  records	
  –	
  3	
  million	
  
	
  
Time	
  saved	
  by	
  simple	
  op$miza$on	
  –	
  84	
  hrs	
  of	
  savings	
  
We have seen improvements from 10x to
100x in our production clusters –
significant money savings.
Lesson Learned – Saving Time
Hadoop	
  Job	
  with	
  complex	
  business	
  logic	
  opera$ng	
  on	
  350	
  MB	
  input	
  size	
  	
  

Job Language	
                Cluster Size	
               Input Files	
                Processing Time	
  


Ruby	
                        6 m1.xlarge	
                1000	
                       184 mins	
  


Java	
                        6 m1.xlarge	
                1000	
                       69 mins	
  


Java	
                        6 m1.xlarge	
                100                          39 mins	
  
                                                           (1000 files combined)	
  


Java                          6 m1.xlarge	
                100                          25 mins	
  
(EMR tuned)	
                                              (1000 files combined)
                                                           	
  
Java                          6 m1.xlarge	
                100                          13 mins	
  
(EMR and Code tuned)	
                                     (1000 files combined)	
  
Lesson Learned – Saving Cost
A	
  data	
  mining	
  	
  job	
  in	
  produc$on	
  with	
  50	
  GB	
  compressed	
  input	
  data	
  

Job          Cluster Size	
                  Processing Each Job                      100	
  Jobs	
  Cost	
  Per	
  
Language	
                                   Time	
     Cost	
                        Month	
  

Ruby	
              50 m2.2xlarge	
   240 mins	
                   $242	
             $24200	
  


Java	
              20 m1.xlarge	
           200 mins	
            $68	
              $6800	
  


Java                20 m1.xlarge	
                                 $50	
              $5000	
  
(EMR tuned)	
                                165 mins	
  


Java                20 m1.xlarge	
           50 mins               $17	
              $1700	
  
(EMR and Code                                	
  
tuned)	
  
EMR Cost Optimization
	
  
Use	
  a	
  small	
  dedicated/transient	
  cluster	
  
	
  
	
            	
          	
  	
  
Leverage	
  spot	
  instance	
  for	
  Task	
  Nodes	
  
	
  
	
  
Op$mize,	
  profile	
  and	
  tune	
  your	
  code	
  always	
  –	
  code	
  first	
  and	
  config	
  next	
  
	
  
	
  
Tune	
  EMR	
  configura$on	
  based	
  on	
  historical	
  jobs	
  data	
  
	
  

Always	
  benchmark	
  third	
  party	
  libraries	
  	
  
Q & A
Like what we do? – connect with me
         Kuliza.com | vijay.rayapati@kuliza.com | @kuliza




                             vijay.rayapati@kuliza.com
                             @amnigos

More Related Content

What's hot

Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Javasrisatish ambati
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer BuildPetteriTeikariPhD
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
MyCassandra (Full English Version)
MyCassandra (Full English Version)MyCassandra (Full English Version)
MyCassandra (Full English Version)Shun Nakamura
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...Shun Nakamura
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 

What's hot (20)

Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Java
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer Build
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
MyCassandra (Full English Version)
MyCassandra (Full English Version)MyCassandra (Full English Version)
MyCassandra (Full English Version)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 

Viewers also liked

Drupal Project Lifecycle
Drupal Project LifecycleDrupal Project Lifecycle
Drupal Project LifecycleAndy Pemberton
 
4 a mdc planificación (1)
4 a mdc planificación (1)4 a mdc planificación (1)
4 a mdc planificación (1)Lizeth Sànchez
 
04 lecture outline 1 semester (first year)
04 lecture outline  1 semester (first year)04 lecture outline  1 semester (first year)
04 lecture outline 1 semester (first year)Dr UAK
 
Mision y vision espoch
Mision y vision espochMision y vision espoch
Mision y vision espochLuis Peralvo
 
Andrea's Letter of Recommendation
Andrea's Letter of RecommendationAndrea's Letter of Recommendation
Andrea's Letter of RecommendationAndrea O'Briant
 
Causing Economic Development Action with Effective Communications
Causing Economic Development Action with Effective CommunicationsCausing Economic Development Action with Effective Communications
Causing Economic Development Action with Effective CommunicationsChrista Ouderkirk Franzi
 
Fuller Center Bicycle Adventure - What's My Risk?
Fuller Center Bicycle Adventure - What's My Risk?Fuller Center Bicycle Adventure - What's My Risk?
Fuller Center Bicycle Adventure - What's My Risk?Ryan Iafigliola
 
Top 10 Software Engineering Practices You Might Not Known
Top 10 Software Engineering Practices You Might Not KnownTop 10 Software Engineering Practices You Might Not Known
Top 10 Software Engineering Practices You Might Not KnownMatt Harasymczuk
 
Economic Development Strategies for Los Angeles
Economic Development Strategies for Los AngelesEconomic Development Strategies for Los Angeles
Economic Development Strategies for Los AngelesHR&A Advisors
 
Growing Local Economies
Growing Local EconomiesGrowing Local Economies
Growing Local EconomiesJim Damicis
 
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service Strategy
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service StrategyWSO2Con EU 2016: WSO2 Cloud and Platform as a Service Strategy
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service StrategyWSO2
 
Santa Monica Civic Center Mixed Use Arts & Cultural District
Santa Monica Civic Center Mixed Use Arts & Cultural DistrictSanta Monica Civic Center Mixed Use Arts & Cultural District
Santa Monica Civic Center Mixed Use Arts & Cultural DistrictHR&A Advisors
 
Sustainability Essay on Costco
Sustainability Essay on CostcoSustainability Essay on Costco
Sustainability Essay on CostcoDominic DeMicco
 

Viewers also liked (16)

Drupal Project Lifecycle
Drupal Project LifecycleDrupal Project Lifecycle
Drupal Project Lifecycle
 
4 a mdc planificación (1)
4 a mdc planificación (1)4 a mdc planificación (1)
4 a mdc planificación (1)
 
04 lecture outline 1 semester (first year)
04 lecture outline  1 semester (first year)04 lecture outline  1 semester (first year)
04 lecture outline 1 semester (first year)
 
Mision y vision espoch
Mision y vision espochMision y vision espoch
Mision y vision espoch
 
Andrea's Letter of Recommendation
Andrea's Letter of RecommendationAndrea's Letter of Recommendation
Andrea's Letter of Recommendation
 
Trabaj de excel
Trabaj de excelTrabaj de excel
Trabaj de excel
 
Causing Economic Development Action with Effective Communications
Causing Economic Development Action with Effective CommunicationsCausing Economic Development Action with Effective Communications
Causing Economic Development Action with Effective Communications
 
Fuller Center Bicycle Adventure - What's My Risk?
Fuller Center Bicycle Adventure - What's My Risk?Fuller Center Bicycle Adventure - What's My Risk?
Fuller Center Bicycle Adventure - What's My Risk?
 
Top 10 Software Engineering Practices You Might Not Known
Top 10 Software Engineering Practices You Might Not KnownTop 10 Software Engineering Practices You Might Not Known
Top 10 Software Engineering Practices You Might Not Known
 
Lectura de plano civilfree.com
Lectura de plano civilfree.comLectura de plano civilfree.com
Lectura de plano civilfree.com
 
Economic Development Strategies for Los Angeles
Economic Development Strategies for Los AngelesEconomic Development Strategies for Los Angeles
Economic Development Strategies for Los Angeles
 
Growing Local Economies
Growing Local EconomiesGrowing Local Economies
Growing Local Economies
 
Oil Sands 101
Oil Sands 101Oil Sands 101
Oil Sands 101
 
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service Strategy
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service StrategyWSO2Con EU 2016: WSO2 Cloud and Platform as a Service Strategy
WSO2Con EU 2016: WSO2 Cloud and Platform as a Service Strategy
 
Santa Monica Civic Center Mixed Use Arts & Cultural District
Santa Monica Civic Center Mixed Use Arts & Cultural DistrictSanta Monica Civic Center Mixed Use Arts & Cultural District
Santa Monica Civic Center Mixed Use Arts & Cultural District
 
Sustainability Essay on Costco
Sustainability Essay on CostcoSustainability Essay on Costco
Sustainability Essay on Costco
 

Similar to Lessons Learned Scaling Hadoop and Big Data in the Cloud

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentationEliran Yamin
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebCloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebjineshvaria
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the CloudJohn Doxaras
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon AuroraAmazon Web Services
 

Similar to Lessons Learned Scaling Hadoop and Big Data in the Cloud (20)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentation
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebCloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
4K Media Workflows on AWS
4K Media Workflows on AWS4K Media Workflows on AWS
4K Media Workflows on AWS
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the Cloud
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 

More from Vijay Rayapati

Botmetric Product Design Process
Botmetric Product Design ProcessBotmetric Product Design Process
Botmetric Product Design ProcessVijay Rayapati
 
Scalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloudScalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloudVijay Rayapati
 
Building Culture at Kuliza
Building Culture at KulizaBuilding Culture at Kuliza
Building Culture at KulizaVijay Rayapati
 
Introduction to cloud computing - za garage talks
Introduction to cloud computing -  za garage talksIntroduction to cloud computing -  za garage talks
Introduction to cloud computing - za garage talksVijay Rayapati
 
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ..."Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...Vijay Rayapati
 
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar..."Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...Vijay Rayapati
 
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook" - Facebook Developer Garage BangaloreVijay Rayapati
 
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" - Facebook Developer Garage BangaloreVijay Rayapati
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedVijay Rayapati
 
How Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online CrisisHow Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online CrisisVijay Rayapati
 
Nasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using TwitterNasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using TwitterVijay Rayapati
 
Social Media Engagement
Social Media EngagementSocial Media Engagement
Social Media EngagementVijay Rayapati
 

More from Vijay Rayapati (13)

Botmetric Product Design Process
Botmetric Product Design ProcessBotmetric Product Design Process
Botmetric Product Design Process
 
Scalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloudScalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloud
 
Building Culture at Kuliza
Building Culture at KulizaBuilding Culture at Kuliza
Building Culture at Kuliza
 
Introduction to cloud computing - za garage talks
Introduction to cloud computing -  za garage talksIntroduction to cloud computing -  za garage talks
Introduction to cloud computing - za garage talks
 
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ..."Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
 
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar..."Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
 
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
 
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For Speed
 
How Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online CrisisHow Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online Crisis
 
Giza Page Hiring
Giza Page HiringGiza Page Hiring
Giza Page Hiring
 
Nasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using TwitterNasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using Twitter
 
Social Media Engagement
Social Media EngagementSocial Media Engagement
Social Media Engagement
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Lessons Learned Scaling Hadoop and Big Data in the Cloud

  • 1. Lessons Learned : Scaling Hadoop and Big Data in Cloud Vijay Rayapati @amnigos
  • 4. What you can do with Data?
  • 5. Problem : Local Intelligence
  • 6. High Level Architecture Processing Multiple Data Ingest (Hadoop Jobs) Sources Data Storage Output Location Datasets API Intelligence
  • 7. Our Challenges •  Mul$ple  data  sources  –  social,  retail,  events,  news,  census,  loca$on  etc     •  Spa$al  data  analysis  and  querying  –  loca$on  overlay  on  data     •  Temporal  nature  of  the  input  datasets       •  Large  input  data  sets  and  hundreds  of  GB  compressed  inputs  for  jobs   •  Complex  processing  and  business  logic  based  on  use  cases     •  Custom  output  data  formats  –  JSON,  XML,  XLS,  Flat  files  etc  
  • 8. Why Amazon EMR? I am interested in using Hadoop to solve business problems and not in building and managing Hadoop infrastructure!   Scalable Storage - S3 Flexible Computing – EC2 No Hadoop Management – EMR
  • 9. Amazon EMR - Service Architecture
  • 10. How to move existing data to Cloud? 10’s  of  GB   100’s  of  GB    >  Terabyte   Direct  Upload   Any  S3  tools   AWS  Import/ Export   Any  S3  tools   Tsunami   Aspera,  Tsunami  
  • 11. Solution Architecture EC2 S3 EMR Processing Multiple Data Ingest (Hadoop Jobs) Sources Data Storage Output Location Datasets API Intelligence S3 EC2 EC2
  • 12. Amazon EMR – Setup Launching  a  500  node  and  fully  configured  cluster  is  as  simple  as  execu$ng   one  command.     >  elas$c-­‐mapreduce    -­‐-­‐create  -­‐-­‐alive  -­‐-­‐plain-­‐output  -­‐-­‐master-­‐instance-­‐type   m1.xlarge  -­‐-­‐slave-­‐instance-­‐type  m2.2xlarge    -­‐-­‐num-­‐instances  500    -­‐-­‐name  "Site   Analy$cs  Cluster"  -­‐-­‐bootstrap-­‐ac$on  s3://com.bcb11.emr/scripts/bootstrap-­‐ custom.sh  -­‐-­‐bootstrap-­‐ac$on  s3://elas$cmapreduce/bootstrap-­‐ac$ons/install-­‐ ganglia  -­‐-­‐bootstrap-­‐ac$on  s3://elas$cmapreduce/bootstrap-­‐ac$ons/configure-­‐ hadoop  -­‐-­‐args  "-­‐-­‐mapred-­‐config-­‐file,  s3://com.bcb11.emr/conf/custom-­‐mapred-­‐ site.xml"       >  elas$c-­‐mapreduce  -­‐j  ${jobflow}  -­‐-­‐stream  -­‐-­‐step-­‐name  “Profile  Analyzer"  -­‐-­‐ jobconf  mapred.task.$meout=0  -­‐-­‐mapper  s3://com.bcb11.emr/code/mapper.rb   -­‐-­‐reducer  s3://com.bcb11.emr/bin/reducer.rb  -­‐-­‐cache  s3://com.bcb11.emr/ cache/customdata.dat#data.txt  -­‐-­‐input  s3://com.bcb11.emr/input/  -­‐-­‐output  s3:// com.bcb11.emr/output  
  • 13. EMR Map Reduce Jobs   Amazon  EMR  supports  –  streaming,  custom  jar,  cascading,  pig  and  hive.         Streaming  –  Write  Map  Reduce  jobs  in  any  scrip$ng  language.       Custom  Jar  –  Write  using  Java  and  good  for  speed/control.       Cascading,  Hive  and  Pig  –  Higher  level  of  abstrac$on.       AWS  EMR  forums  if  you  need  help.  
  • 14. Hadoop and EMR – Lesson Learned
  • 15. EMR – Good, Bad and Ugly Great  for  bootstrapping  large  clusters  and  very  cost-­‐effec$ve  for  transient   clusters.       Most  patches  are  applied  and  Amazon  creates  new  AMI’s  with   improvements  –  but  not  for  everything.       Intermiient  network  issues  –  Some$mes  could  cause  serious  degrada$on   of  performance.       Network/Disk  IO  is  variable  based  on  instance  types  and  streaming  jobs   will  be  much  sluggish  on  EMR  compared  to  dedicated  setup.       Be  ready  to  face  variable  performance  in  Cloud.  
  • 16. Hadoop and EMR – Jobs Use  local  Hadoop  setup  for  debugging  your  jobs  –  there  is  no  easy  way  on   EMR.       Capture  EMR  cluster  metrics  -­‐  always  bootstrap  with  Ganglia.       High  JVM  memory  alloca$on  lead  to  long  GC  pauses.       Don’t  trust  EMR  tuned  sekngs  of  Hadoop  configura$ons.         Benchmark  on  small  cluster  for  data  points.  
  • 17. Hadoop and EMR – Jobs performance   GC  Overhead  -­‐    increase  memory  and  reduce  the  jvm  reuse  tasks.     Avoid  read  conten$on  at  S3  –  Have  equal  or  more  files  in  S3  compared  to   available  mappers.     Use  mapred  output  compression  to  save  storage,  processing  $me  and   bandwidth  costs.     Set  mapred  task  $meout  to  0  if  you  have  long  running  jobs  (>  10  mins)   and  can  disable  specula$ve  execu$on  $me.       Always  benchmark  third  party  libraries  used  in  your  jobs  code  before   pukng  them  in  produc$on  –  too  much  sluggish  stuff  out  there.      
  • 18. Hadoop – High Level Tuning     Small  files  problem  –  avoid  too   Tune  your  sekngs  –  JVM  Reuse,   many  small  files  in  S3   Sort  Buffer,  Sort  Factor,  Map/ Reduce  Tasks,  Parallel  Copies,   MapRed  Output  Compression  etc   Good  thing  is  that  you  can  use   Know  what  is  limi$ng  you  at  a   small  cluster  and  sample  input   node  level  –  CPU,  Memory,  DISK   size  for  tuning   IO  or  Network  IN/OUT  
  • 19. Performance Tuning Golden Rules When you are operating at very large scale even a 10 ms makes a big difference! Example  :  Moving  away  from  Simple-­‐Json  to  Jackson     JSON  Parsing  –  600  ms     Op$mized  Parsing  –  500  ms     Number  of  input  JSON  records  –  3  million     Time  saved  by  simple  op$miza$on  –  84  hrs  of  savings  
  • 20. We have seen improvements from 10x to 100x in our production clusters – significant money savings.
  • 21. Lesson Learned – Saving Time Hadoop  Job  with  complex  business  logic  opera$ng  on  350  MB  input  size     Job Language   Cluster Size   Input Files   Processing Time   Ruby   6 m1.xlarge   1000   184 mins   Java   6 m1.xlarge   1000   69 mins   Java   6 m1.xlarge   100 39 mins   (1000 files combined)   Java 6 m1.xlarge   100 25 mins   (EMR tuned)   (1000 files combined)   Java 6 m1.xlarge   100 13 mins   (EMR and Code tuned)   (1000 files combined)  
  • 22. Lesson Learned – Saving Cost A  data  mining    job  in  produc$on  with  50  GB  compressed  input  data   Job Cluster Size   Processing Each Job 100  Jobs  Cost  Per   Language   Time   Cost   Month   Ruby   50 m2.2xlarge   240 mins   $242   $24200   Java   20 m1.xlarge   200 mins   $68   $6800   Java 20 m1.xlarge   $50   $5000   (EMR tuned)   165 mins   Java 20 m1.xlarge   50 mins $17   $1700   (EMR and Code   tuned)  
  • 23. EMR Cost Optimization   Use  a  small  dedicated/transient  cluster             Leverage  spot  instance  for  Task  Nodes       Op$mize,  profile  and  tune  your  code  always  –  code  first  and  config  next       Tune  EMR  configura$on  based  on  historical  jobs  data     Always  benchmark  third  party  libraries    
  • 24. Q & A
  • 25. Like what we do? – connect with me Kuliza.com | vijay.rayapati@kuliza.com | @kuliza vijay.rayapati@kuliza.com @amnigos