SlideShare a Scribd company logo
1 of 27
Revolution Analytics



Leveraging R in Hadoop
  Environments


November 9, 2011



                         1
In Today’s Presentation:
 About Revolution Analytics
 Why R and Hadoop?
 The Packages (rhdfs, rhbase, rmr)
 Examples
Most advanced statistical
analysis software available
                                 The professor who invented analytic software for

Half the cost of                the experts now wants to take it to the masses


commercial alternatives

2M+ Users
                                                                                    Power
 3,000+ Packages


                 Finance
    Statistics
                 Life Sciences
   Predictive    Manufacturing
    Analytics                                                                               Productivity
                 Retail
  Data Mining    Telecom                                 Enterprise
 Visualization
                 Social Media                            Readiness
                 Government
What’s the Difference Between R and
Revolution R Enterprise?
                                 Revolution R is 100% R and More®
               Multi-Threaded      Web-Based       Web Services     Big Data          Parallel
               Math Libraries         GUI              API          Analysis           Tools


 Technical                                                                                       IDE / Developer
  Support                                                                                              GUI




                     3,000+ Community                                            Build
                         Packages                  R Engine                    Assurance
                                               Language Libraries




             For more information contact: info@revolutionanalytics.com
                                                                                                                   4
Let’s Talk about R and Hadoop




                                5
Why R and Hadoop?
 Hadoop - a scalable infrastructure for
 processing massive amounts of data
   Storage – HDFS, HBASE
   Distributed Computing - MapReduce
 R - a statistical programming language
 Need for more than counts and averages
 Analyze all of the data


                                          6
Motivation for this project

 Make it easy for the R programmer to
 interact with the Hadoop data stores and
 write MapReduce programs
 Run R on a massively distributed system
 without having to understand the underlying
 infrastructure
 Statisticians stay focused on the analysis
 Open source

                                               7
R and Hadoop – The R Packages

                                                Capabilities delivered as individual
                        HBASE                   R packages
              HDFS
                                                       rhdfs - R and HDFS
   R
                                   Thrift              rhbase - R and HBASE
 Map or
 Reduce
                                                       rmr - R and MapReduce
 Task                                        rhbase
                    rhdfs
 Node

                                                      Downloads available from
                                  R Client            Github
            Job
          Tracker           rmr




                                                                                   8
rhdfs
 Manipulate HDFS directly from R
 Mimic as much of the HDFS Java API as
 possible
 Examples:
   Read a HDFS text file into a data frame.
   Serialize/Deserialize a model to HDFS
   Write an HDFS file to local storage
   rhdfs/pkg/inst/unitTests
   rhdfs/pkg/inst/examples


                                              9
rhdfs Functions
 File Manipulations - hdfs.copy, hdfs.move, hdfs.rename,
 hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put,
 hdfs.get
 File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush,
 hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader,
 hdfs.read.text.file
 Directory - hdfs.dircreate, hdfs.mkdir
 Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
 Initialization – hdfs.init, hdfs.defaults




                                                                10
rhbase
 Manipulate HBASE tables and their content
 Uses Thrift C++ API as the mechanism to
 communicate to HBASE
 Examples
   Create a data frame from a collection of rows
   and columns in an HBASE table
   Update an HBASE table with values from a data
   frame
   rhbase/pkg/inst/unitTests


                                               11
rhbase Functions
 Table Manipulation – hb.new.table, hb.delete.table,
 hb.describe.table, hb.set.table.mode, hb.regions.table
 Row Read/Write - hb.insert, hb.get, hb.delete,
 hb.insert.data.frame, hb.get.data.frame, hb.scan
 Utility - hb.list.tables
 Initialization - hb.defaults, hb.init




                                                          12
Writing MapReduce programs in R




                                  13
rmr - For R Programmers
• A way to access big data sets
• A simple way to write parallel programs –
    everyone will have to
•   Very R-like, building on the functional
    characteristics of R
•   Just a library




                                              14
rmr – For MapReduce Developers
• Much simpler than writing Java
• Not as simple as Hive, Pig at what they do,
    but more general
•   Great for prototyping, can transition to
    production -- optimize instead of rewriting!
    Lower risk, always executable.




                                                   15
rmr mapreduce Function
 mapreduce (input, output, map, reduce, …)

    input – input folder
    output – output folder
    map – R function used as map
    reduce – R function used as reduce

    … - other advanced parameters




                                             16
Some Simple Things
Example showing sampling and counting

    map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)
    reduce = function(k, vv) keyval(k, length(vv))
    mapreduce(input, output, map, reduce)
More Simple Things
 HIVE
  INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count (DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;

 rmr
  mapreduce(input =
    mapreduce(input = "pv_users",
      map = function(k, v) keyval(v['userid'], v['gender']),
      reduce = function(k, vv) keyval(k, vv[[1]]),
   output = "pv_gender_sum",
   map = function(k,v) keyval(v, 1)
   reduce = function(k, vv) keyval(k, sum(unlist(vv)))

 Takeaways
     A language like HIVE makes a class of problems easy to solve, but it is not a general tool
     The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities




                                                                                                      18
Complex Things
k-means Clustering




                     19
k-means - Implementation
 Well known design (MacQueen, 1967)
 Comparison of the k-means in MapReduce
   Pig
     From Hortonworks
     Requires coding in 3 languages (Python-Pig-Java)
     100 lines of code
   rmr
         20 lines of only R code




                                                        22
k-means - Highlights


  map = function(k,v)
   keyval(which.min(distances(centers,v)),v)

  reduce = function(k,vv)
   keyval(NULL, col.average(vv))

  centers = from.dfs(
   mapreduce("data-points", map, reduce))




                                               23
k-means - Optimizations
Slow                       Fast                             Notes

for(i in 1:100)            a=b+c                            light use of R interpreter, use
  a[i] = b[i] + c[i]                                        fast vector primitives, C if
                                                            necessary
[ 1, 2, 3, 4, 5]           [[ 1, 2, 3, 4, 5],[6, 7, 8, 9,   use beefier records, say 1k
                           10],[11, 12, 13, 14, 15]...      points per record
distance(center, point)    norm(center - P)                 compute all distances with
                                                            fast matrix operations
combiner = FALSE           combiner = TRUE                  reduce often and early, use
                                                            combiner
keyval(k, mean(…))         keyval(k,                        replace means with (sum,
                               c(total, count))             count) pairs to enable early
                                                            reduction

 https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means



                                                                                           24
Final thoughts
 R and Hadoop together offer innovation and
 flexibility needed to meet analytics
 challenges of big data
 We need contributors to this project!
   Developers
   Documentation
   Use cases
   General Feedback


                                              25
Resources

 RHadoop Open source project:
 https://github.com/RevolutionAnalytics/RHa
 doop/wiki

 Revolution R Enterprise: bit.ly/Enterprise-R

 Cloudera CDH:
 http://www.cloudera.com/hadoop/

 Email: rhadoop@revolutionanalytics.com


                                                26
Thank you.




            The leading commercial provider of software and support for the popular
                             open source R statistics language.




  www.revolutionanalytics.com            650.330.0553                  Twitter: @RevolutionR




                                                                                               27

More Related Content

What's hot

Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Data Hacking with RHadoop
Data Hacking with RHadoopData Hacking with RHadoop
Data Hacking with RHadoopEd Kohlwey
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Backstage with Drupal localization - Part 1
Backstage with Drupal localization - Part 1Backstage with Drupal localization - Part 1
Backstage with Drupal localization - Part 1Gábor Hojtsy
 

What's hot (8)

Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Data Hacking with RHadoop
Data Hacking with RHadoopData Hacking with RHadoop
Data Hacking with RHadoop
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Chtp415
Chtp415Chtp415
Chtp415
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Backstage with Drupal localization - Part 1
Backstage with Drupal localization - Part 1Backstage with Drupal localization - Part 1
Backstage with Drupal localization - Part 1
 

Similar to The Powerful Marriage of Hadoop and R (David Champagne)

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processingBryan Downing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 

Similar to The Powerful Marriage of Hadoop and R (David Champagne) (20)

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

The Powerful Marriage of Hadoop and R (David Champagne)

  • 1. Revolution Analytics Leveraging R in Hadoop Environments November 9, 2011 1
  • 2. In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
  • 3. Most advanced statistical analysis software available The professor who invented analytic software for Half the cost of the experts now wants to take it to the masses commercial alternatives 2M+ Users Power  3,000+ Packages Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  • 4. What’s the Difference Between R and Revolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 3,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: info@revolutionanalytics.com 4
  • 5. Let’s Talk about R and Hadoop 5
  • 6. Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
  • 7. Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
  • 8. R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
  • 9. rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
  • 10. rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory - hdfs.dircreate, hdfs.mkdir Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
  • 11. rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
  • 12. rhbase Functions Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
  • 14. rmr - For R Programmers • A way to access big data sets • A simple way to write parallel programs – everyone will have to • Very R-like, building on the functional characteristics of R • Just a library 14
  • 15. rmr – For MapReduce Developers • Much simpler than writing Java • Not as simple as Hive, Pig at what they do, but more general • Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
  • 16. rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
  • 17. Some Simple Things Example showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
  • 18. More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v['userid'], v['gender']), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
  • 20.
  • 21.
  • 22. k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
  • 23. k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
  • 24. k-means - Optimizations Slow Fast Notes for(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary [ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per record distance(center, point) norm(center - P) compute all distances with fast matrix operations combiner = FALSE combiner = TRUE reduce often and early, use combiner keyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means 24
  • 25. Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
  • 26. Resources RHadoop Open source project: https://github.com/RevolutionAnalytics/RHa doop/wiki Revolution R Enterprise: bit.ly/Enterprise-R Cloudera CDH: http://www.cloudera.com/hadoop/ Email: rhadoop@revolutionanalytics.com 26
  • 27. Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 27