SlideShare a Scribd company logo
1 of 18
Download to read offline
(not so) Big Data with R
GUR Fltaur
Matthieu Cornec
Matthieu.cornec@cdiscount.com
10/09/2013
Cdiscount.com - Commark
Outline
2
• A- Intro
• B- Problem setup
• C- 3 strategies
• D- Packages : Rsqlite, ff and biglm, data.sample
• E- Conclusion
Cdiscount.com - Commark
1 – Intro
3Cdiscount.com - Commark
Problem setup
- Your csv file is too big to import into R. Say multiple of
10GO,
- Typically, your first read.table ends up with an error
message
« Cannot allocate a vector of size XXX »
How to fix it?
It depends on:
- What you want to do (data management sql like queries,
datamining,…)
- Your environnment (Corporate with a Datawarehouse?)
- The size of your data
Three basic strategies
4Cdiscount.com - Commark
• Buy memory in a cloud environnement
- Can handle multiple 10Go
- Cheap (1,5 euro per hour for 60Go)
- No need to rewrite all your code
But you need to configure it (see for example )
 Preferred strategy in most cases
• Try packages for SQL-like needs, try ff, rsqlite
- Not limited to RAM (multiple 10Go)
But no advanced datamining libraries
And you need to rewrite your code….
• Sampling :data.sample package
Dataset
5Cdiscount.com - Commark
• http://stat-computing.org/dataexpo/2009/the-data.html
• More 100 million observations, 12 G0
The data comes originally from RITA where it is described in detail. You can
download the data there, or from the bzipped csv files listed below. These
files have derivable variables removed, are packaged in yearly chunks and
have been more heavily compressed than the originals.
Download individual years:
1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008
29 variables
Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 ….
1 Import the data files and create one unique large csv file
6Cdiscount.com - Commark
##import the data from http://stat-computing.org/dataexpo/2009/the-data.html
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
if ( !file.exists(file.name) ) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/",
year, ".csv.bz2", sep = "")
cat("Downloading missing data file ", file.name, "n", sep = "")
download.file(url.text, file.name)
}
}
##create a unique large data file named airlines.csv by
first <- TRUE
csv.file <- "airlines.csv" # Write combined integer-only data to this file
csv.con <- file(csv.file, open = "w")
system.time(
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
cat("Processing ", file.name, "n", sep = "")
d <- read.csv(file.name)
## Convert the strings to integers
write.table(d, file = csv.con, sep = ",",
row.names = FALSE, col.names = first)
first <- FALSE
}
)
close(csv.con)
BigMemory Package
7Cdiscount.com - Commark
##09/09/2013: does not seem to exist on windows for R.3.0.0
install.packages("bigmemory", repos="http://R-Forge.R-
project.org")
install.packages("biganalytics", repos="http://R-Forge.R-
project.org")
#library(bigmemory)
#x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE
,backingfile ="airline.bin",
# descriptorfile ="airline.desc",extraCols ="Age")
#library(biganalytics)
#blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)
ff package
8Cdiscount.com - Commark
library(ffbase)
system.time(hhp <- read.table.ffdf(file="airlines.csv",
FUN = "read.csv", na.strings = "NA",
nrows=10000000))
#takes 1min40sec
#with no nrows arguement, message error,
# ffbase does not support char type
class(hhp)
dim(hhp)
str(hhp[1:10,])
result <- list()
## Some basic showoff
result$UniqueCarrier <- unique(hhp$UniqueCarrier)
#15 sec
## Basic example of operators is.na.ff, the ! operator and sum.ff
sum(!is.na(hhp$ArrDelay ))
## all and any
any(is.na(hhp$ArrDelay))
all(!is.na(hhp$ArrDelay))
ff package and Biglm
9Cdiscount.com - Commark
##
## Make a linear model using biglm
##
require(biglm)
mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek,
data =hhp)
#takes 30 sec for 10M rows
summary(mymodel)
predict(mymodel,newdata=hhp)
RSQLITE
10Cdiscount.com - Commark
library(RSQLite)
library(sqldf)
library(foreign)
# create an empty database.
# can skip this step if database already exists.
# read into table called iris in the testingdb sqlite database
sqldf("attach testingdb as new")
read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file",
dbname = "testingdb",row.names=F, eol="n")
#on Windows, specifiy eol="n"
#takes 2,5 hours
# look at first three lines
sqldf("select * from baseflux limit 10", dbname = "testingdb")
#takes 1 minute ?
#count the number of flights whose distance is greater than 500, departing from SF
sqldf("select count(*) as nb
from baseflux
where distance>500
and Origin='SFO'"
, dbname = "testingdb")
Rsqlite
11Cdiscount.com - Commark
##If your intention was to read the file into R immediately after
#reading it into the database
#and you don't really need the database after that then see
airlines <- read.csv.sql("airlines.csv", sql = "select * from
file",eol="n")
######
#NB: the package does not handle missing value,
#Translate the empty fields to some number
#that will represent NA and then fix it up on the R end.
Sampling is bad for...
12Cdiscount.com - Commark
• Reporting
The boss wants to know the accurate growth rate, not a statistical
estimation...
• Data management
You will not be able to access the role of this particular customer
Sampling is good for analysis
13Cdiscount.com - Commark
Because
1 what matters is the order of magnitude, not the accurate results
2. sampling error is very small compared to Model error,
Measurement errors, estimation error, Model noise,...
3 sampling error depends on the size of the sample, not on the
whole dataset.
4 everything is a sample at the end
5 when sampling works very bad, then your conclusions are not
robust
6 Anyway, how will we deal with non linear complexity, even in
the cloud?
data.sample
14Cdiscount.com - Commark
Features of data.sample
• it works on your laptop, whatever your RAM is, it just takes
time
• no need to install other Big Data soft (RBD, NoSQL) on top
of R
• no need to rewrite all your code, just change one single line
data.sample takes the same arguments as read.table: nothing
to learn
Simulations
Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e
X = 1; :::;N, G discrete random variables, e some noise
Simulate 100 millions observations: 2.3Go
Code
dataset<-data.sample(simulations.csv,sep=,,header=T)
#takes 12min on my laptop
t<-lm(y.,data=dataset)
summary(t)
Call: lm(formula = y ~ -1 + x + g, data = dataset)
Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963
data.sample package
15Cdiscount.com - Commark
install.packages("D:/U/Data.sample/data.s
ample_1.0.zip", repos = NULL)
library(data.sample)
system.time(resultsample<-
data.sample(file="airlines.csv",header=T,s
ep=",")$df)
#takes 52 minutes on my laptop if you
don’t know the number of records
# this step is done only once!
data.sample package
16Cdiscount.com - Commark
#fit your linear model
mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data
=resultsample)
Summary(mymodelsample)
Estimate Std. Error t value Pr(>|t|)
as.factor(DayOfWe
ek)1 6.58383 0.08041 81.88 <2e-16 ***
as.factor(DayOfWe
ek)2 6.04881 0.08054 75.10 <2e-16 ***
as.factor(DayOfWe
ek)3 6.80039 0.08037 84.61 <2e-16 ***
as.factor(DayOfWe
ek)4 8.96406 0.08045 111.42 <2e-16 ***
as.factor(DayOfWe
ek)5 9.45303 0.08015 117.94 <2e-16 ***
as.factor(DayOfWe
ek)6 4.15234 0.08535 48.65 <2e-16 ***
as.factor(DayOfWe
ek)7 6.40236 0.08222 77.87 <2e-16 ***
data.sample package
17Cdiscount.com - Commark
Conclusion
18Cdiscount.com - Commark
SQL like Datamining
strategies
Beyond the
RAM
Pros Cons
cloud OK OK OK No rewrite,
cheap
Cloud
configuratio
n
Ff, biglm OK KO but
regression
OK Not limited
to RAM
Rewrite,
very limited
for
datamining
rsqlite OK KO OK Not limited
to RAM
Rewrite, no
datamining
Data.sample OK OK OK No rewrite,
fast coding,
can use all
libraries
No
reporting,
lack of
theoretical
results
Data.table OK KO KO Limited to
RAM, no
datamining
Fast (index)

More Related Content

What's hot

Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)Bopyo Hong
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12Andrew Dunstan
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming PatternsHao Chen
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?José Lin
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Groupacogoluegnes
 
12c SQL Plan Directives
12c SQL Plan Directives12c SQL Plan Directives
12c SQL Plan DirectivesFranck Pachot
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondTomas Vondra
 
Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)Noriyoshi Shinoda
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

What's hot (20)

PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Group
 
12c SQL Plan Directives
12c SQL Plan Directives12c SQL Plan Directives
12c SQL Plan Directives
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
StORM preview
StORM previewStORM preview
StORM preview
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyond
 
Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Viewers also liked

R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORDCdiscount
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown Cdiscount
 
FLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretFLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretjfeudeline
 
Exports de r vers office
Exports de r vers officeExports de r vers office
Exports de r vers officefrancoismarical
 
R aux enquêtes de conjoncture
R aux enquêtes de conjonctureR aux enquêtes de conjoncture
R aux enquêtes de conjoncturefrancoismarical
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interfaceCdiscount
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous RCdiscount
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageCdiscount
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec RCdiscount
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec RCdiscount
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1Cdiscount
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Cdiscount
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3bCdiscount
 

Viewers also liked (20)

Big data with r
Big data with rBig data with r
Big data with r
 
R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORD
 
R Devtools
R DevtoolsR Devtools
R Devtools
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown
 
R versur Python
R versur PythonR versur Python
R versur Python
 
FLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretFLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caret
 
R in latex
R in latexR in latex
R in latex
 
Exports de r vers office
Exports de r vers officeExports de r vers office
Exports de r vers office
 
R aux enquêtes de conjoncture
R aux enquêtes de conjonctureR aux enquêtes de conjoncture
R aux enquêtes de conjoncture
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interface
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous R
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son package
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec R
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec R
 
HADOOP + R
HADOOP + RHADOOP + R
HADOOP + R
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3b
 

Similar to Gur1009

R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management Systempsathishcs
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core ModuleKatie Gulley
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revisedBarry DeCicco
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesDoris Chen
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWSAmazon Web Services
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 

Similar to Gur1009 (20)

R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Handout3o
Handout3oHandout3o
Handout3o
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 

More from Cdiscount

Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4Cdiscount
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Cdiscount
 
State Space Model
State Space ModelState Space Model
State Space ModelCdiscount
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2Cdiscount
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérienCdiscount
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learningCdiscount
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Cdiscount
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec RCdiscount
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cdiscount
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cdiscount
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMCdiscount
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for youCdiscount
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la texCdiscount
 
Forecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business SurveysForecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business SurveysCdiscount
 
Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic GraphsCdiscount
 

More from Cdiscount (16)

Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06
 
Scm risques
Scm risquesScm risques
Scm risques
 
State Space Model
State Space ModelState Space Model
State Space Model
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérien
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learning
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec R
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1)
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAM
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for you
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la tex
 
Forecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business SurveysForecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business Surveys
 
Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic Graphs
 

Recently uploaded

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 

Recently uploaded (20)

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 

Gur1009

  • 1. (not so) Big Data with R GUR Fltaur Matthieu Cornec Matthieu.cornec@cdiscount.com 10/09/2013 Cdiscount.com - Commark
  • 2. Outline 2 • A- Intro • B- Problem setup • C- 3 strategies • D- Packages : Rsqlite, ff and biglm, data.sample • E- Conclusion Cdiscount.com - Commark
  • 3. 1 – Intro 3Cdiscount.com - Commark Problem setup - Your csv file is too big to import into R. Say multiple of 10GO, - Typically, your first read.table ends up with an error message « Cannot allocate a vector of size XXX » How to fix it? It depends on: - What you want to do (data management sql like queries, datamining,…) - Your environnment (Corporate with a Datawarehouse?) - The size of your data
  • 4. Three basic strategies 4Cdiscount.com - Commark • Buy memory in a cloud environnement - Can handle multiple 10Go - Cheap (1,5 euro per hour for 60Go) - No need to rewrite all your code But you need to configure it (see for example )  Preferred strategy in most cases • Try packages for SQL-like needs, try ff, rsqlite - Not limited to RAM (multiple 10Go) But no advanced datamining libraries And you need to rewrite your code…. • Sampling :data.sample package
  • 5. Dataset 5Cdiscount.com - Commark • http://stat-computing.org/dataexpo/2009/the-data.html • More 100 million observations, 12 G0 The data comes originally from RITA where it is described in detail. You can download the data there, or from the bzipped csv files listed below. These files have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals. Download individual years: 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008 29 variables Name Description 1 Year 1987-2008 2 Month 1-12 3 DayofMonth 1-31 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure time (local, hhmm) 6 ….
  • 6. 1 Import the data files and create one unique large csv file 6Cdiscount.com - Commark ##import the data from http://stat-computing.org/dataexpo/2009/the-data.html for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") if ( !file.exists(file.name) ) { url.text <- paste("http://stat-computing.org/dataexpo/2009/", year, ".csv.bz2", sep = "") cat("Downloading missing data file ", file.name, "n", sep = "") download.file(url.text, file.name) } } ##create a unique large data file named airlines.csv by first <- TRUE csv.file <- "airlines.csv" # Write combined integer-only data to this file csv.con <- file(csv.file, open = "w") system.time( for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") cat("Processing ", file.name, "n", sep = "") d <- read.csv(file.name) ## Convert the strings to integers write.table(d, file = csv.con, sep = ",", row.names = FALSE, col.names = first) first <- FALSE } ) close(csv.con)
  • 7. BigMemory Package 7Cdiscount.com - Commark ##09/09/2013: does not seem to exist on windows for R.3.0.0 install.packages("bigmemory", repos="http://R-Forge.R- project.org") install.packages("biganalytics", repos="http://R-Forge.R- project.org") #library(bigmemory) #x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE ,backingfile ="airline.bin", # descriptorfile ="airline.desc",extraCols ="Age") #library(biganalytics) #blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)
  • 8. ff package 8Cdiscount.com - Commark library(ffbase) system.time(hhp <- read.table.ffdf(file="airlines.csv", FUN = "read.csv", na.strings = "NA", nrows=10000000)) #takes 1min40sec #with no nrows arguement, message error, # ffbase does not support char type class(hhp) dim(hhp) str(hhp[1:10,]) result <- list() ## Some basic showoff result$UniqueCarrier <- unique(hhp$UniqueCarrier) #15 sec ## Basic example of operators is.na.ff, the ! operator and sum.ff sum(!is.na(hhp$ArrDelay )) ## all and any any(is.na(hhp$ArrDelay)) all(!is.na(hhp$ArrDelay))
  • 9. ff package and Biglm 9Cdiscount.com - Commark ## ## Make a linear model using biglm ## require(biglm) mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek, data =hhp) #takes 30 sec for 10M rows summary(mymodel) predict(mymodel,newdata=hhp)
  • 10. RSQLITE 10Cdiscount.com - Commark library(RSQLite) library(sqldf) library(foreign) # create an empty database. # can skip this step if database already exists. # read into table called iris in the testingdb sqlite database sqldf("attach testingdb as new") read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file", dbname = "testingdb",row.names=F, eol="n") #on Windows, specifiy eol="n" #takes 2,5 hours # look at first three lines sqldf("select * from baseflux limit 10", dbname = "testingdb") #takes 1 minute ? #count the number of flights whose distance is greater than 500, departing from SF sqldf("select count(*) as nb from baseflux where distance>500 and Origin='SFO'" , dbname = "testingdb")
  • 11. Rsqlite 11Cdiscount.com - Commark ##If your intention was to read the file into R immediately after #reading it into the database #and you don't really need the database after that then see airlines <- read.csv.sql("airlines.csv", sql = "select * from file",eol="n") ###### #NB: the package does not handle missing value, #Translate the empty fields to some number #that will represent NA and then fix it up on the R end.
  • 12. Sampling is bad for... 12Cdiscount.com - Commark • Reporting The boss wants to know the accurate growth rate, not a statistical estimation... • Data management You will not be able to access the role of this particular customer
  • 13. Sampling is good for analysis 13Cdiscount.com - Commark Because 1 what matters is the order of magnitude, not the accurate results 2. sampling error is very small compared to Model error, Measurement errors, estimation error, Model noise,... 3 sampling error depends on the size of the sample, not on the whole dataset. 4 everything is a sample at the end 5 when sampling works very bad, then your conclusions are not robust 6 Anyway, how will we deal with non linear complexity, even in the cloud?
  • 14. data.sample 14Cdiscount.com - Commark Features of data.sample • it works on your laptop, whatever your RAM is, it just takes time • no need to install other Big Data soft (RBD, NoSQL) on top of R • no need to rewrite all your code, just change one single line data.sample takes the same arguments as read.table: nothing to learn Simulations Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e X = 1; :::;N, G discrete random variables, e some noise Simulate 100 millions observations: 2.3Go Code dataset<-data.sample(simulations.csv,sep=,,header=T) #takes 12min on my laptop t<-lm(y.,data=dataset) summary(t) Call: lm(formula = y ~ -1 + x + g, data = dataset) Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963
  • 15. data.sample package 15Cdiscount.com - Commark install.packages("D:/U/Data.sample/data.s ample_1.0.zip", repos = NULL) library(data.sample) system.time(resultsample<- data.sample(file="airlines.csv",header=T,s ep=",")$df) #takes 52 minutes on my laptop if you don’t know the number of records # this step is done only once!
  • 16. data.sample package 16Cdiscount.com - Commark #fit your linear model mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data =resultsample) Summary(mymodelsample) Estimate Std. Error t value Pr(>|t|) as.factor(DayOfWe ek)1 6.58383 0.08041 81.88 <2e-16 *** as.factor(DayOfWe ek)2 6.04881 0.08054 75.10 <2e-16 *** as.factor(DayOfWe ek)3 6.80039 0.08037 84.61 <2e-16 *** as.factor(DayOfWe ek)4 8.96406 0.08045 111.42 <2e-16 *** as.factor(DayOfWe ek)5 9.45303 0.08015 117.94 <2e-16 *** as.factor(DayOfWe ek)6 4.15234 0.08535 48.65 <2e-16 *** as.factor(DayOfWe ek)7 6.40236 0.08222 77.87 <2e-16 ***
  • 18. Conclusion 18Cdiscount.com - Commark SQL like Datamining strategies Beyond the RAM Pros Cons cloud OK OK OK No rewrite, cheap Cloud configuratio n Ff, biglm OK KO but regression OK Not limited to RAM Rewrite, very limited for datamining rsqlite OK KO OK Not limited to RAM Rewrite, no datamining Data.sample OK OK OK No rewrite, fast coding, can use all libraries No reporting, lack of theoretical results Data.table OK KO KO Limited to RAM, no datamining Fast (index)