SlideShare a Scribd company logo
1 of 44
Download to read offline
Introduction To Big Data Pipelining
with Docker, Cassandra, Spark,
Spark-Notebook & Akka
Apache Cassandra and DataStax enthusiast who enjoys explaining to customers
that the traditional approaches to data management just don’t cut it anymore in
the new always on, no single point of failure, high volume, high velocity, real time
distributed data management world.
Previously 25 years designing, building, implementing and supporting complex
data management solutions with traditional RDBMS technology includingOracle
Hyperion & E-Business Suite deployments at clients such as the Financial
Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment,
HP, Sun and Oracle.
Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and
OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite,
Oracle Virtual Machine and Oracle Exalytics.
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
Big Data Pipelining: Outline
•  1-Hour introduction to Big Data Pipelining and a working sandbox
•  Presented at a half-day workshop Devoxx November 2015
•  Uses Data Pipeline environment from Data Fellas
•  Contributors from Typesafe, Mesos, Datastax
•  Demonstrates how to use scalable, distributed technologies
•  Docker
•  Spark
•  Spark-Notebook
•  Cassandra
•  Objective is to introduce the demo environment
•  Key takeaway – understanding how to build a reactive, repeatable Big Data
pipeline
Big Data Pipelining: Devoxx & Data Fellas
•  Co-founder of Data Fellas
•  Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book.
•  Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala
•  Co-founder of Data Fellas
•  Ph.D in experimental atomic physics
•  Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning
methodologies
•  Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe.
•  For the last six years he has been the main contributor for many critical Scala components including the compiler
backend, its optimizer and the Eclipse build manager
•  Datastax Solutions Engineer
•  Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC
etc.
Andy Petrella
Xavier Tordoir
Iulian Dragos
Simon Ambridge
Big Data Pipelining: Legacy
SamplingData Modeling Tuning Report Interpret
•  Sampling and analysis often run on a single machine
•  CPU and memory limitations
•  Frequently dictates limited sampling because of data size limitations
•  Multiple iterations over large datasets
Repeated iterations
Big Data Pipelining: Big Data Problems
•  Data is getting bigger or, more accurately, the number of
available data sources is exploding
•  Sampling the data is becoming more difficult
•  The validity of the analysis becomes obsolete faster
•  Analysis becomes too slow to get any ROI from the data
Big Data Pipelining: Big Data Needs
•  Scalable infrastructure + distributed technologies
•  Allow data volumes to be scaled
•  Faster processing
•  More complex processing
•  Constant data flow
•  Visible, reproducible analysis
•  For example, SHAR3 from Data Fellas
Big Data Pipelining: Pipeline Flow
ADAM
Intro To Docker: Quick History
What is Docker?
•  Open source project started in 2013
•  Easy to build, deploy, copy containers
•  Great for packaging and deploying applications
•  Similar resource isolation to VMs, but different architecture
•  Lightweight
•  Containers share the OS kernel
•  Fast start
•  Layered filesystems – share underlying OS files, directories
“Each virtual machine includes
the application, the necessary
binaries and libraries and an
entire guest operating system -
all of which may be tens of GBs
in size.”
“Containers include the application and all of its
dependencies, but share the kernel with other
containers. They run as an isolated process in
userspace on the host operating system. They’re
also not tied to any specific infrastructure – Docker
containers run on any computer, on any
infrastructure and in any cloud.”
Intro To ADAM: Quick History
What is ADAM?
•  Started at UC Berkeley in 2012
•  Open-source library for bioinformatics analysis, written for Spark
•  Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics
methods
•  A set of formats, APIs, and processing stage implementations for genomic
data
•  Fully open source under the Apache 2 license
•  Implemented on top of Avro and Parquet for data storage
•  Compatible with Spark up to 1.5.1
Intro To Spark: Quick History
What is Apache Spark?
•  Started at UC Berkeley in 2009
•  Apache Project since 2010
•  Fast - 10x-100x faster than Hadoop MapReduce
•  Distributed in-memory processing
•  Rich Scala, Java and Python APIs
•  2x-5x less code than R
•  Batch and streaming analytics
•  Interactive shell (REPL)
Intro To Spark-Notebook: Quick History
What is Spark-Notebook?
•  Drive your data analysis from the browser
•  Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc.
•  Features tight integration with Apache Spark and offers handy tools to
analysts:
•  Reproducible visual analysis
•  Charting
•  Widgets
•  Dynamic forms
•  SQL support
•  Extensible with custom libraries
Intro To Parquet: Quick History
What is Parquet?
•  Started at Twitter and Cloudera in 2013
•  Databases traditionally store information in rows and are optimized for
working with one record at a time
•  Columnar storage systems optimised to store data by column
•  Netflix big user - 7 PB of warehoused data in Parquet format
•  A compressed, efficient columnar data representation
•  Allows complex data to be encoded efficiently
•  Compression schemes can be specified on a per-column level
•  Not as compressed as ORC (Hortonworks) but faster read/analysis
Intro To Cassandra: Quick History
What is Apache Cassandra?
•  Originally started at Facebook in 2008
•  Top level Apache project since 2010
•  Open source distributed database
•  Handles large amounts of data
•  At high velocity
•  Across multiple data centres
•  No single point of failure
•  Continuous Availability
•  Disaster avoidance
•  Enterprise Cassandra from Datastax
Intro To Akka: Quick History
What is Akka?
•  Open source toolkit first released in 2009
•  Simplifies the construction of concurrent and distributed Java applications
•  Primarily designed for actor-based concurrency
•  Akka enforces parental supervision
•  Actors are arranged hierarchically
•  Each actor is created and supervised by its parent actor
•  Program failures treated as events handled by an actor's supervisor
•  Message-based and asynchronous; typically no mutable data are shared
•  Language bindings exist for both Java and Scala
Spark: RDD
What Is A Resilient Distributed Dataset?
•  RDD - a distributed, memory abstraction for parallel in-memory
computations
•  RDD represents a dataset consisting of objects and records
•  Such as Scala, Java or Python objects
•  RDD is distributed across nodes in the Spark cluster
•  Nodes hold partitions and partitions hold records
•  RDD is read-only or immutable
•  RDD can be transformed into a new RDD
•  Operations
•  Transformations (e.g. map, filter, groupBy)
•  Actions (e.g. count, collect, save)
Spark: DataFrames
What Is A DataFrame?
•  Inspired by data frames in R and Python
•  Data is organized into named columns
•  Conceptually equivalent to a table in a relational database
•  Can be constructed from a wide array of sources
•  structured data files - JSON, Parquet
•  tables in Hive
•  relational database systems via JDBC
•  existing RDDs
•  Can be extended to support any third-party data formats or sources
•  Existing third-party extensions already include Avro, CSV, ElasticSearch,
and Cassandra
•  Enables applications to easily combine data from disparate sources
Spark & Cassandra: How?
How Does Spark Access Cassandra?
•  DataStax Cassandra Spark driver – open source!
•  Open source:
•  https://github.com/datastax/spark-cassandra-connector
•  Compatible with
•  Spark 0.9+
•  Cassandra 2.0+
•  DataStax Enterprise 4.5+
•  Scala 2.10 and 2.11
•  Java and Python
•  Expose Cassandra tables as Spark RDDs
•  Execute arbitrary CQL queries in Spark applications
•  Saves RDDs back to Cassandra via saveToCassandra call
Spark: How Do You Access RDDs?
Create A ‘Spark Context’
•  To create an RDD you need a Spark Context object
•  A Spark Context represents a connection to a Spark Cluster
•  In the Spark shell the sc object is created automatically
•  In a standalone application a Spark Context must be constructed
Spark: Architecture
Spark Architecture
•  Master-worker architecture
•  One master
•  Spark Workers run on all nodes
•  Executors belonging to different clients/SCs are isolated
•  Executors belonging to the same client/SCs can communicate
•  Client jobs are divided into tasks, executed by multiple threads
•  First Spark node promoted as Spark Master
•  Master HA feature available in DataStax Enterprise
•  Standby Master promoted on failure
•  Workers are resilient by default
Open Source: Analytics Integration
•  Apache Spark for Real-Time Analytics 
•  Analytics nodes separate from data nodes
•  ETL required
Cassandra Cluster
ETL
Spark Cluster
•  Loose integration
•  Data separate from processing
•  Millisecond response times
Solr Cluster
ES Cluster
10 core 16GB minimum
DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark, Solr Cluster
ETL
Spark Cluster
•  Tight integration
•  Data locality
•  Microsecond response times
X
•  Integrated Apache Spark for Real-Time Analytics 
•  Integrated Apache Solr for Enterprise Search
•  Search and analytics nodes close to data
•  No ETL required
X
Solr Cluster
ES Cluster
12+ core 32GB+
Big Data Pipelining: Demo
Build & Run Steps
1.  Provision a 64-bit Linux environment
2.  Pre-requisites (5 mins)
3.  Install Docker (5 mins)
4.  Clone the Pipeline Repo from GitHub (2 mins)
5.  Pull the Docker image from Docker Hub (20 mins)
6.  Run the image as a container (5 mins)
7.  Run the demo setup script - inside the container (2 mins)
8.  Run the demo from a browser - on the host (30 mins)
Big Data Pipelining: Demo
Steps
1.  Provision a host
Required machine spec: 3 cores, 5GB
•  Linux machine
http://www.ubuntu.com/download/desktop
•  Create a VM (e.g. Ubuntu)
http://virtualboxes.org/images/ubuntu/
http://www.osboxes.org/ubuntu/
Big Data Pipelining: Demo
Steps
2.  Pre-requisites
https://docs.docker.com/installation/ubuntulinux/
•  Updates to apt-get sources and gpg key
•  Check kernel version
Big Data Pipelining: Demo
Steps
3.  Install Docker
$	
  sudo	
  apt-­‐get	
  update	
  	
  
$	
  sudo	
  apt-­‐get	
  install	
  docker	
  
$	
  sudo	
  usermod	
  -­‐aG	
  docker	
  <myuserid>	
  
Log out/in
$	
  docker	
  run	
  hello-­‐world	
  	
  
Big Data Pipelining: Demo
Steps
4.  Clone the Pipeline repo
$	
  mkdir	
  ~/pipeline	
  
$	
  cd	
  ~/pipeline	
  
$	
  git	
  clone	
  https://github.com/distributed-­‐freaks/pipeline.git	
  
Big Data Pipelining: Demo
Steps
5.  Pull the Pipeline image
$	
  docker	
  pull	
  xtordoir/pipeline	
  
Big Data Pipelining: Demo
Steps
6.  Run the Pipeline image as a container
$	
  docker	
  run	
  -­‐it	
  -­‐m	
  8g	
  	
  -­‐p	
  30080:80	
  -­‐p	
  
34040-­‐34045:4040-­‐4045	
  -­‐p	
  9160:9160	
  -­‐p	
  9042:9042	
  -­‐p	
  
39200:9200	
  -­‐p	
  37077:7077	
  -­‐p	
  36060:6060	
  -­‐p	
  36061:6061	
  -­‐p	
  
32181:2181	
  -­‐p	
  38090:8090	
  -­‐p	
  38099:8099	
  -­‐p	
  30000:10000	
  -­‐p	
  
30070:50070	
  -­‐p	
  30090:50090	
  -­‐p	
  39092:9092	
  -­‐p	
  36066:6066	
  -­‐p	
  
39000:9000	
  -­‐p	
  39999:19999	
  -­‐p	
  36081:6081	
  -­‐p	
  35601:5601	
  -­‐p	
  
37979:7979	
  -­‐p	
  38989:8989	
  xtordoir/pipeline	
  bash	
  
Big Data Pipelining: Demo
Steps
7.  Run the demo setup script in the container
$	
  cd	
  pipeline	
  
$	
  source	
  devoxx-­‐setup.sh	
  	
  	
  	
  	
  	
  	
  #	
  ignore	
  Cassandra	
  errors	
  
	
  
Run cqlsh
Big Data Pipelining: Demo
Steps
8.  Run the demo in the host browser
http://localhost:39000/tree/pipeline
Sa introduction to big data pipelining with cassandra &amp; spark   west minster meetup - black-2015 0.11-2
Thank you!
Big Data Pipelining: Appendix
RDD/Cassandra Reference
Spark: RDD
How Do You Create An RDD?
1.  From an existing collection:
‘action’
Spark: RDD
How Do You Create An RDD?
2.  From a text file:
‘action’
Spark: RDD
How Do You Create An RDD?
3.  From a data in a Cassandra database:
‘action’
Spark: RDD
How Do You Create An RDD?
4.  From an existing RDD:
‘action’
‘transformation’
Spark: RDD’s & Cassandra
Accessing Data As An RDD
‘action’
RDD method
Spark: Filtering Data In Cassandra
Server-side Selection
•  Reduce the amount of data transferred
•  Selecting rows (by clustering columns and/or secondary indexes)
Spark: Saving Data In Cassandra
Saving Data
•  saveToCassandra
Spark: Using SparkSQL & Cassandra
You Can Also Access Cassandra Via SparkSQL!
•  Spark Conf object can be used to create a Cassandra-aware Spark SQL context
object
•  Use regular CQL syntax
•  Cross table operations - joins, unions etc!
Spark: Streaming Data
Spark Streaming
•  High velocity data – IoT, sensors, Twitter etc
•  Micro batching
•  Each batch represented as RDD
•  Fault tolerant
•  Exactly-once processing
•  Represents a unified stream and batch processing framework
Spark: Streaming Data Into Cassandra
Streaming Example

More Related Content

What's hot

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 

What's hot (20)

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 

Viewers also liked

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration WarsKarl Isenberg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 

Viewers also liked (10)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Sa introduction to big data pipelining with cassandra &amp; spark west minster meetup - black-2015 0.11-2

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 

Similar to Sa introduction to big data pipelining with cassandra &amp; spark west minster meetup - black-2015 0.11-2 (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

Recently uploaded

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 

Recently uploaded (20)

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 

Sa introduction to big data pipelining with cassandra &amp; spark west minster meetup - black-2015 0.11-2

  • 1. Introduction To Big Data Pipelining with Docker, Cassandra, Spark, Spark-Notebook & Akka
  • 2. Apache Cassandra and DataStax enthusiast who enjoys explaining to customers that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years designing, building, implementing and supporting complex data management solutions with traditional RDBMS technology includingOracle Hyperion & E-Business Suite deployments at clients such as the Financial Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment, HP, Sun and Oracle. Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite, Oracle Virtual Machine and Oracle Exalytics. simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  • 3. Big Data Pipelining: Outline •  1-Hour introduction to Big Data Pipelining and a working sandbox •  Presented at a half-day workshop Devoxx November 2015 •  Uses Data Pipeline environment from Data Fellas •  Contributors from Typesafe, Mesos, Datastax •  Demonstrates how to use scalable, distributed technologies •  Docker •  Spark •  Spark-Notebook •  Cassandra •  Objective is to introduce the demo environment •  Key takeaway – understanding how to build a reactive, repeatable Big Data pipeline
  • 4. Big Data Pipelining: Devoxx & Data Fellas •  Co-founder of Data Fellas •  Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book. •  Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala •  Co-founder of Data Fellas •  Ph.D in experimental atomic physics •  Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning methodologies •  Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe. •  For the last six years he has been the main contributor for many critical Scala components including the compiler backend, its optimizer and the Eclipse build manager •  Datastax Solutions Engineer •  Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC etc. Andy Petrella Xavier Tordoir Iulian Dragos Simon Ambridge
  • 5. Big Data Pipelining: Legacy SamplingData Modeling Tuning Report Interpret •  Sampling and analysis often run on a single machine •  CPU and memory limitations •  Frequently dictates limited sampling because of data size limitations •  Multiple iterations over large datasets Repeated iterations
  • 6. Big Data Pipelining: Big Data Problems •  Data is getting bigger or, more accurately, the number of available data sources is exploding •  Sampling the data is becoming more difficult •  The validity of the analysis becomes obsolete faster •  Analysis becomes too slow to get any ROI from the data
  • 7. Big Data Pipelining: Big Data Needs •  Scalable infrastructure + distributed technologies •  Allow data volumes to be scaled •  Faster processing •  More complex processing •  Constant data flow •  Visible, reproducible analysis •  For example, SHAR3 from Data Fellas
  • 8. Big Data Pipelining: Pipeline Flow ADAM
  • 9. Intro To Docker: Quick History What is Docker? •  Open source project started in 2013 •  Easy to build, deploy, copy containers •  Great for packaging and deploying applications •  Similar resource isolation to VMs, but different architecture •  Lightweight •  Containers share the OS kernel •  Fast start •  Layered filesystems – share underlying OS files, directories “Each virtual machine includes the application, the necessary binaries and libraries and an entire guest operating system - all of which may be tens of GBs in size.” “Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system. They’re also not tied to any specific infrastructure – Docker containers run on any computer, on any infrastructure and in any cloud.”
  • 10. Intro To ADAM: Quick History What is ADAM? •  Started at UC Berkeley in 2012 •  Open-source library for bioinformatics analysis, written for Spark •  Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics methods •  A set of formats, APIs, and processing stage implementations for genomic data •  Fully open source under the Apache 2 license •  Implemented on top of Avro and Parquet for data storage •  Compatible with Spark up to 1.5.1
  • 11. Intro To Spark: Quick History What is Apache Spark? •  Started at UC Berkeley in 2009 •  Apache Project since 2010 •  Fast - 10x-100x faster than Hadoop MapReduce •  Distributed in-memory processing •  Rich Scala, Java and Python APIs •  2x-5x less code than R •  Batch and streaming analytics •  Interactive shell (REPL)
  • 12. Intro To Spark-Notebook: Quick History What is Spark-Notebook? •  Drive your data analysis from the browser •  Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc. •  Features tight integration with Apache Spark and offers handy tools to analysts: •  Reproducible visual analysis •  Charting •  Widgets •  Dynamic forms •  SQL support •  Extensible with custom libraries
  • 13. Intro To Parquet: Quick History What is Parquet? •  Started at Twitter and Cloudera in 2013 •  Databases traditionally store information in rows and are optimized for working with one record at a time •  Columnar storage systems optimised to store data by column •  Netflix big user - 7 PB of warehoused data in Parquet format •  A compressed, efficient columnar data representation •  Allows complex data to be encoded efficiently •  Compression schemes can be specified on a per-column level •  Not as compressed as ORC (Hortonworks) but faster read/analysis
  • 14. Intro To Cassandra: Quick History What is Apache Cassandra? •  Originally started at Facebook in 2008 •  Top level Apache project since 2010 •  Open source distributed database •  Handles large amounts of data •  At high velocity •  Across multiple data centres •  No single point of failure •  Continuous Availability •  Disaster avoidance •  Enterprise Cassandra from Datastax
  • 15. Intro To Akka: Quick History What is Akka? •  Open source toolkit first released in 2009 •  Simplifies the construction of concurrent and distributed Java applications •  Primarily designed for actor-based concurrency •  Akka enforces parental supervision •  Actors are arranged hierarchically •  Each actor is created and supervised by its parent actor •  Program failures treated as events handled by an actor's supervisor •  Message-based and asynchronous; typically no mutable data are shared •  Language bindings exist for both Java and Scala
  • 16. Spark: RDD What Is A Resilient Distributed Dataset? •  RDD - a distributed, memory abstraction for parallel in-memory computations •  RDD represents a dataset consisting of objects and records •  Such as Scala, Java or Python objects •  RDD is distributed across nodes in the Spark cluster •  Nodes hold partitions and partitions hold records •  RDD is read-only or immutable •  RDD can be transformed into a new RDD •  Operations •  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save)
  • 17. Spark: DataFrames What Is A DataFrame? •  Inspired by data frames in R and Python •  Data is organized into named columns •  Conceptually equivalent to a table in a relational database •  Can be constructed from a wide array of sources •  structured data files - JSON, Parquet •  tables in Hive •  relational database systems via JDBC •  existing RDDs •  Can be extended to support any third-party data formats or sources •  Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra •  Enables applications to easily combine data from disparate sources
  • 18. Spark & Cassandra: How? How Does Spark Access Cassandra? •  DataStax Cassandra Spark driver – open source! •  Open source: •  https://github.com/datastax/spark-cassandra-connector •  Compatible with •  Spark 0.9+ •  Cassandra 2.0+ •  DataStax Enterprise 4.5+ •  Scala 2.10 and 2.11 •  Java and Python •  Expose Cassandra tables as Spark RDDs •  Execute arbitrary CQL queries in Spark applications •  Saves RDDs back to Cassandra via saveToCassandra call
  • 19. Spark: How Do You Access RDDs? Create A ‘Spark Context’ •  To create an RDD you need a Spark Context object •  A Spark Context represents a connection to a Spark Cluster •  In the Spark shell the sc object is created automatically •  In a standalone application a Spark Context must be constructed
  • 20. Spark: Architecture Spark Architecture •  Master-worker architecture •  One master •  Spark Workers run on all nodes •  Executors belonging to different clients/SCs are isolated •  Executors belonging to the same client/SCs can communicate •  Client jobs are divided into tasks, executed by multiple threads •  First Spark node promoted as Spark Master •  Master HA feature available in DataStax Enterprise •  Standby Master promoted on failure •  Workers are resilient by default
  • 21. Open Source: Analytics Integration •  Apache Spark for Real-Time Analytics •  Analytics nodes separate from data nodes •  ETL required Cassandra Cluster ETL Spark Cluster •  Loose integration •  Data separate from processing •  Millisecond response times Solr Cluster ES Cluster 10 core 16GB minimum
  • 22. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark, Solr Cluster ETL Spark Cluster •  Tight integration •  Data locality •  Microsecond response times X •  Integrated Apache Spark for Real-Time Analytics •  Integrated Apache Solr for Enterprise Search •  Search and analytics nodes close to data •  No ETL required X Solr Cluster ES Cluster 12+ core 32GB+
  • 23. Big Data Pipelining: Demo Build & Run Steps 1.  Provision a 64-bit Linux environment 2.  Pre-requisites (5 mins) 3.  Install Docker (5 mins) 4.  Clone the Pipeline Repo from GitHub (2 mins) 5.  Pull the Docker image from Docker Hub (20 mins) 6.  Run the image as a container (5 mins) 7.  Run the demo setup script - inside the container (2 mins) 8.  Run the demo from a browser - on the host (30 mins)
  • 24. Big Data Pipelining: Demo Steps 1.  Provision a host Required machine spec: 3 cores, 5GB •  Linux machine http://www.ubuntu.com/download/desktop •  Create a VM (e.g. Ubuntu) http://virtualboxes.org/images/ubuntu/ http://www.osboxes.org/ubuntu/
  • 25. Big Data Pipelining: Demo Steps 2.  Pre-requisites https://docs.docker.com/installation/ubuntulinux/ •  Updates to apt-get sources and gpg key •  Check kernel version
  • 26. Big Data Pipelining: Demo Steps 3.  Install Docker $  sudo  apt-­‐get  update     $  sudo  apt-­‐get  install  docker   $  sudo  usermod  -­‐aG  docker  <myuserid>   Log out/in $  docker  run  hello-­‐world    
  • 27. Big Data Pipelining: Demo Steps 4.  Clone the Pipeline repo $  mkdir  ~/pipeline   $  cd  ~/pipeline   $  git  clone  https://github.com/distributed-­‐freaks/pipeline.git  
  • 28. Big Data Pipelining: Demo Steps 5.  Pull the Pipeline image $  docker  pull  xtordoir/pipeline  
  • 29. Big Data Pipelining: Demo Steps 6.  Run the Pipeline image as a container $  docker  run  -­‐it  -­‐m  8g    -­‐p  30080:80  -­‐p   34040-­‐34045:4040-­‐4045  -­‐p  9160:9160  -­‐p  9042:9042  -­‐p   39200:9200  -­‐p  37077:7077  -­‐p  36060:6060  -­‐p  36061:6061  -­‐p   32181:2181  -­‐p  38090:8090  -­‐p  38099:8099  -­‐p  30000:10000  -­‐p   30070:50070  -­‐p  30090:50090  -­‐p  39092:9092  -­‐p  36066:6066  -­‐p   39000:9000  -­‐p  39999:19999  -­‐p  36081:6081  -­‐p  35601:5601  -­‐p   37979:7979  -­‐p  38989:8989  xtordoir/pipeline  bash  
  • 30. Big Data Pipelining: Demo Steps 7.  Run the demo setup script in the container $  cd  pipeline   $  source  devoxx-­‐setup.sh              #  ignore  Cassandra  errors     Run cqlsh
  • 31. Big Data Pipelining: Demo Steps 8.  Run the demo in the host browser http://localhost:39000/tree/pipeline
  • 34. Big Data Pipelining: Appendix RDD/Cassandra Reference
  • 35. Spark: RDD How Do You Create An RDD? 1.  From an existing collection: ‘action’
  • 36. Spark: RDD How Do You Create An RDD? 2.  From a text file: ‘action’
  • 37. Spark: RDD How Do You Create An RDD? 3.  From a data in a Cassandra database: ‘action’
  • 38. Spark: RDD How Do You Create An RDD? 4.  From an existing RDD: ‘action’ ‘transformation’
  • 39. Spark: RDD’s & Cassandra Accessing Data As An RDD ‘action’ RDD method
  • 40. Spark: Filtering Data In Cassandra Server-side Selection •  Reduce the amount of data transferred •  Selecting rows (by clustering columns and/or secondary indexes)
  • 41. Spark: Saving Data In Cassandra Saving Data •  saveToCassandra
  • 42. Spark: Using SparkSQL & Cassandra You Can Also Access Cassandra Via SparkSQL! •  Spark Conf object can be used to create a Cassandra-aware Spark SQL context object •  Use regular CQL syntax •  Cross table operations - joins, unions etc!
  • 43. Spark: Streaming Data Spark Streaming •  High velocity data – IoT, sensors, Twitter etc •  Micro batching •  Each batch represented as RDD •  Fault tolerant •  Exactly-once processing •  Represents a unified stream and batch processing framework
  • 44. Spark: Streaming Data Into Cassandra Streaming Example