SlideShare a Scribd company logo
1 of 27
Revolution Analytics



Leveraging R in Hadoop
  Environments


November 9, 2011



                         1
In Today’s Presentation:
 About Revolution Analytics
 Why R and Hadoop?
 The Packages (rhdfs, rhbase, rmr)
 Examples
Most advanced statistical
analysis software available
                                 The professor who invented analytic software for

Half the cost of                the experts now wants to take it to the masses


commercial alternatives

2M+ Users
                                                                                    Power
 3,000+ Applications


                 Finance
    Statistics
                 Life Sciences
   Predictive    Manufacturing
    Analytics                                                                               Productivity
                 Retail
  Data Mining    Telecom                                 Enterprise
 Visualization
                 Social Media                            Readiness
                 Government
What’s the Difference Between R and
Revolution R Enterprise?
                                 Revolution R is 100% R and More®
               Multi-Threaded      Web-Based       Web Services     Big Data          Parallel
               Math Libraries         GUI              API          Analysis           Tools


 Technical                                                                                       IDE / Developer
  Support                                                                                              GUI




                     4,000+ Community                                            Build
                         Packages                  R Engine                    Assurance
                                               Language Libraries




             For more information contact: info@revolutionanalytics.com
                                                                                                                   4
Let’s Talk about R and Hadoop




                                5
Why R and Hadoop?
 Hadoop - a scalable infrastructure for
 processing massive amounts of data
   Storage – HDFS, HBASE
   Distributed Computing - MapReduce
 R - a statistical programming language
 Need for more than counts and averages
 Analyze all of the data


                                          6
Motivation for this project

 Make it easy for the R programmer to
 interact with the Hadoop data stores and
 write MapReduce programs
 Run R on a massively distributed system
 without having to understand the underlying
 infrastructure
 Statisticians stay focused on the analysis
 Open source

                                               7
R and Hadoop – The R Packages

                                                Capabilities delivered as individual
                        HBASE                   R packages
              HDFS
                                                       rhdfs - R and HDFS
   R
                                   Thrift              rhbase - R and HBASE
 Map or
 Reduce
                                                       rmr - R and MapReduce
 Task                                        rhbase
                    rhdfs
 Node

                                                      Downloads available from
                                  R Client            Github
            Job
          Tracker           rmr




                                                                                   8
rhdfs
 Manipulate HDFS directly from R
 Mimic as much of the HDFS Java API as
 possible
 Examples:
   Read a HDFS text file into a data frame.
   Serialize/Deserialize a model to HDFS
   Write an HDFS file to local storage
   rhdfs/pkg/inst/unitTests
   rhdfs/pkg/inst/examples


                                              9
rhdfs Functions
 File Manipulations - hdfs.copy, hdfs.move, hdfs.rename,
 hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put,
 hdfs.get
 File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush,
 hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader,
 hdfs.read.text.file
 Directory - hdfs.dircreate, hdfs.mkdir
 Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
 Initialization – hdfs.init, hdfs.defaults




                                                                10
rhbase
 Manipulate HBASE tables and their content
 Uses Thrift C++ API as the mechanism to
 communicate to HBASE
 Examples
   Create a data frame from a collection of rows
   and columns in an HBASE table
   Update an HBASE table with values from a data
   frame
   rhbase/pkg/inst/unitTests


                                               11
rhbase Functions
 Table Manipulation – hb.new.table, hb.delete.table,
 hb.describe.table, hb.set.table.mode, hb.regions.table
 Row Read/Write - hb.insert, hb.get, hb.delete,
 hb.insert.data.frame, hb.get.data.frame, hb.scan
 Utility - hb.list.tables
 Initialization - hb.defaults, hb.init




                                                          12
Writing MapReduce programs in R




                                  13
rmr - For R Programmers
• A way to access big data sets
• A simple way to write parallel programs –
    everyone will have to
•   Very R-like, building on the functional
    characteristics of R
•   Just a library




                                              14
rmr – For MapReduce Developers
• Much simpler than writing Java
• Not as simple as Hive, Pig at what they do,
    but more general
•   Great for prototyping, can transition to
    production -- optimize instead of rewriting!
    Lower risk, always executable.




                                                   15
rmr mapreduce Function
 mapreduce (input, output, map, reduce, …)

    input – input folder
    output – output folder
    map – R function used as map
    reduce – R function used as reduce

    … - other advanced parameters




                                             16
Some Simple Things
Example showing sampling and counting

    map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)
    reduce = function(k, vv) keyval(k, length(vv))
    mapreduce(input, output, map, reduce)
More Simple Things
 HIVE
  INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count (DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;

 rmr
  mapreduce(input =
    mapreduce(input = "pv_users",
      map = function(k, v) keyval(v['userid'], v['gender']),
      reduce = function(k, vv) keyval(k, vv[[1]]),
   output = "pv_gender_sum",
   map = function(k,v) keyval(v, 1)
   reduce = function(k, vv) keyval(k, sum(unlist(vv)))

 Takeaways
     A language like HIVE makes a class of problems easy to solve, but it is not a general tool
     The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities




                                                                                                      18
Complex Things
k-means Clustering




                     19
k-means - Implementation
 Well known design (MacQueen, 1967)
 Comparison of the k-means in MapReduce
   Pig
     From Hortonworks
     Requires coding in 3 languages (Python-Pig-Java)
     100 lines of code
   rmr
         20 lines of only R code




                                                        22
k-means - Highlights


  map = function(k,v)
   keyval(which.min(distances(centers,v)),v)

  reduce = function(k,vv)
   keyval(NULL, col.average(vv))

  centers = from.dfs(
   mapreduce("data-points", map, reduce))




                                               23
k-means - Optimizations
Slow                       Fast                             Notes

for(i in 1:100)            a=b+c                            light use of R interpreter, use
  a[i] = b[i] + c[i]                                        fast vector primitives, C if
                                                            necessary
[ 1, 2, 3, 4, 5]           [[ 1, 2, 3, 4, 5],[6, 7, 8, 9,   use beefier records, say 1k
                           10],[11, 12, 13, 14, 15]...      points per record
distance(center, point)    norm(center - P)                 compute all distances with
                                                            fast matrix operations
combiner = FALSE           combiner = TRUE                  reduce often and early, use
                                                            combiner
keyval(k, mean(…))         keyval(k,                        replace means with (sum,
                               c(total, count))             count) pairs to enable early
                                                            reduction

 https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means



                                                                                           24
Final thoughts
 R and Hadoop together offer innovation and
 flexibility needed to meet analytics
 challenges of big data
 We need contributors to this project!
   Developers
   Documentation
   Use cases
   General Feedback


                                              25
Resources

 RHadoop Open source project:
 https://github.com/RevolutionAnalytics/RHa
 doop/wiki

 Revolution R Enterprise: bit.ly/Enterprise-R

 Cloudera CDH:
 http://www.cloudera.com/hadoop/

 Email: rhadoop@revolutionanalytics.com


                                                26
Thank you.




            The leading commercial provider of software and support for the popular
                             open source R statistics language.




  www.revolutionanalytics.com            650.330.0553                  Twitter: @RevolutionR




                                                                                               27

More Related Content

What's hot

Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 

What's hot (20)

Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
RHadoop
RHadoopRHadoop
RHadoop
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Unit 1
Unit 1Unit 1
Unit 1
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 

Viewers also liked

Rユーザのためのspark入門
Rユーザのためのspark入門Rユーザのためのspark入門
Rユーザのためのspark入門Shintaro Fukushima
 
「初めてのWatson」ハンズオン
「初めてのWatson」ハンズオン「初めてのWatson」ハンズオン
「初めてのWatson」ハンズオンKenichi Inoue
 
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみる
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみるLTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみる
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみるFuminobu Takeyama
 
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)今さら聞けない― Linuxコマンドラインツールテクニック (導入編)
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)Fuminobu Takeyama
 
そろそろRStudioの話
そろそろRStudioの話そろそろRStudioの話
そろそろRStudioの話Kazuya Wada
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) hamaken
 
Transformation or Transition
Transformation or TransitionTransformation or Transition
Transformation or TransitionMike Pounsford
 
Innovative Uses of Technology in International Education
Innovative Uses of Technology in International Education Innovative Uses of Technology in International Education
Innovative Uses of Technology in International Education Marty Bennett
 
モバイルとソーシャルによる「学びの進化」を考える
モバイルとソーシャルによる「学びの進化」を考えるモバイルとソーシャルによる「学びの進化」を考える
モバイルとソーシャルによる「学びの進化」を考えるwebcampusschoo
 
這一打‧好貴
這一打‧好貴這一打‧好貴
這一打‧好貴fudy9015
 
PHP 7 Crash Course
PHP 7 Crash CoursePHP 7 Crash Course
PHP 7 Crash CourseColin O'Dell
 
Understanding Product/Market Fit
Understanding Product/Market FitUnderstanding Product/Market Fit
Understanding Product/Market FitGabor Papp
 
CANENERO Advertising - Gilberto Chiacchiera
CANENERO Advertising - Gilberto ChiacchieraCANENERO Advertising - Gilberto Chiacchiera
CANENERO Advertising - Gilberto Chiacchierabnioceanoblu
 
Unityを使ったVRアプリ作成入門 ABCD2015金沢編
Unityを使ったVRアプリ作成入門 ABCD2015金沢編Unityを使ったVRアプリ作成入門 ABCD2015金沢編
Unityを使ったVRアプリ作成入門 ABCD2015金沢編kinneko
 
[KGC 2013] Online Game Security in China
[KGC 2013] Online Game Security in China[KGC 2013] Online Game Security in China
[KGC 2013] Online Game Security in ChinaSeungmin Shin
 

Viewers also liked (20)

Rユーザのためのspark入門
Rユーザのためのspark入門Rユーザのためのspark入門
Rユーザのためのspark入門
 
Rの高速化
Rの高速化Rの高速化
Rの高速化
 
「初めてのWatson」ハンズオン
「初めてのWatson」ハンズオン「初めてのWatson」ハンズオン
「初めてのWatson」ハンズオン
 
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみる
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみるLTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみる
LTS & ローリングリリース! ― ランキング4位のLinuxディストリビューション openSUSE を Azure で使ってみる
 
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)今さら聞けない― Linuxコマンドラインツールテクニック (導入編)
今さら聞けない― Linuxコマンドラインツールテクニック (導入編)
 
R高速化
R高速化R高速化
R高速化
 
RをAWSで使おう
RをAWSで使おうRをAWSで使おう
RをAWSで使おう
 
そろそろRStudioの話
そろそろRStudioの話そろそろRStudioの話
そろそろRStudioの話
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 
Transformation or Transition
Transformation or TransitionTransformation or Transition
Transformation or Transition
 
Innovative Uses of Technology in International Education
Innovative Uses of Technology in International Education Innovative Uses of Technology in International Education
Innovative Uses of Technology in International Education
 
モバイルとソーシャルによる「学びの進化」を考える
モバイルとソーシャルによる「学びの進化」を考えるモバイルとソーシャルによる「学びの進化」を考える
モバイルとソーシャルによる「学びの進化」を考える
 
這一打‧好貴
這一打‧好貴這一打‧好貴
這一打‧好貴
 
PHP 7 Crash Course
PHP 7 Crash CoursePHP 7 Crash Course
PHP 7 Crash Course
 
Understanding Product/Market Fit
Understanding Product/Market FitUnderstanding Product/Market Fit
Understanding Product/Market Fit
 
CANENERO Advertising - Gilberto Chiacchiera
CANENERO Advertising - Gilberto ChiacchieraCANENERO Advertising - Gilberto Chiacchiera
CANENERO Advertising - Gilberto Chiacchiera
 
Unityを使ったVRアプリ作成入門 ABCD2015金沢編
Unityを使ったVRアプリ作成入門 ABCD2015金沢編Unityを使ったVRアプリ作成入門 ABCD2015金沢編
Unityを使ったVRアプリ作成入門 ABCD2015金沢編
 
[KGC 2013] Online Game Security in China
[KGC 2013] Online Game Security in China[KGC 2013] Online Game Security in China
[KGC 2013] Online Game Security in China
 
Casa Duplex
Casa DuplexCasa Duplex
Casa Duplex
 
ΒΑΚΑΛΟΥΔΗ - ΠΡΟΣΚΛΗΣΗ
ΒΑΚΑΛΟΥΔΗ - ΠΡΟΣΚΛΗΣΗΒΑΚΑΛΟΥΔΗ - ΠΡΟΣΚΛΗΣΗ
ΒΑΚΑΛΟΥΔΗ - ΠΡΟΣΚΛΗΣΗ
 

Similar to Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processingBryan Downing
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 

Similar to Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics (20)

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
R on Hadoop
R on HadoopR on Hadoop
R on Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

  • 1. Revolution Analytics Leveraging R in Hadoop Environments November 9, 2011 1
  • 2. In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
  • 3. Most advanced statistical analysis software available The professor who invented analytic software for Half the cost of the experts now wants to take it to the masses commercial alternatives 2M+ Users Power  3,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  • 4. What’s the Difference Between R and Revolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: info@revolutionanalytics.com 4
  • 5. Let’s Talk about R and Hadoop 5
  • 6. Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
  • 7. Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
  • 8. R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
  • 9. rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
  • 10. rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory - hdfs.dircreate, hdfs.mkdir Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
  • 11. rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
  • 12. rhbase Functions Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
  • 14. rmr - For R Programmers • A way to access big data sets • A simple way to write parallel programs – everyone will have to • Very R-like, building on the functional characteristics of R • Just a library 14
  • 15. rmr – For MapReduce Developers • Much simpler than writing Java • Not as simple as Hive, Pig at what they do, but more general • Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
  • 16. rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
  • 17. Some Simple Things Example showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
  • 18. More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v['userid'], v['gender']), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
  • 20.
  • 21.
  • 22. k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
  • 23. k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
  • 24. k-means - Optimizations Slow Fast Notes for(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary [ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per record distance(center, point) norm(center - P) compute all distances with fast matrix operations combiner = FALSE combiner = TRUE reduce often and early, use combiner keyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means 24
  • 25. Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
  • 26. Resources RHadoop Open source project: https://github.com/RevolutionAnalytics/RHa doop/wiki Revolution R Enterprise: bit.ly/Enterprise-R Cloudera CDH: http://www.cloudera.com/hadoop/ Email: rhadoop@revolutionanalytics.com 26
  • 27. Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 27