UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
Sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015 0.11-2
1. Introduction To Big Data Pipelining
with Docker, Cassandra, Spark,
Spark-Notebook & Akka
2. Apache Cassandra and DataStax enthusiast who enjoys explaining to customers
that the traditional approaches to data management just don’t cut it anymore in
the new always on, no single point of failure, high volume, high velocity, real time
distributed data management world.
Previously 25 years designing, building, implementing and supporting complex
data management solutions with traditional RDBMS technology includingOracle
Hyperion & E-Business Suite deployments at clients such as the Financial
Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment,
HP, Sun and Oracle.
Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and
OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite,
Oracle Virtual Machine and Oracle Exalytics.
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
3. Big Data Pipelining: Outline
• 1-Hour introduction to Big Data Pipelining and a working sandbox
• Presented at a half-day workshop Devoxx November 2015
• Uses Data Pipeline environment from Data Fellas
• Contributors from Typesafe, Mesos, Datastax
• Demonstrates how to use scalable, distributed technologies
• Docker
• Spark
• Spark-Notebook
• Cassandra
• Objective is to introduce the demo environment
• Key takeaway – understanding how to build a reactive, repeatable Big Data
pipeline
4. Big Data Pipelining: Devoxx & Data Fellas
• Co-founder of Data Fellas
• Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book.
• Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala
• Co-founder of Data Fellas
• Ph.D in experimental atomic physics
• Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning
methodologies
• Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe.
• For the last six years he has been the main contributor for many critical Scala components including the compiler
backend, its optimizer and the Eclipse build manager
• Datastax Solutions Engineer
• Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC
etc.
Andy Petrella
Xavier Tordoir
Iulian Dragos
Simon Ambridge
5. Big Data Pipelining: Legacy
SamplingData Modeling Tuning Report Interpret
• Sampling and analysis often run on a single machine
• CPU and memory limitations
• Frequently dictates limited sampling because of data size limitations
• Multiple iterations over large datasets
Repeated iterations
6. Big Data Pipelining: Big Data Problems
• Data is getting bigger or, more accurately, the number of
available data sources is exploding
• Sampling the data is becoming more difficult
• The validity of the analysis becomes obsolete faster
• Analysis becomes too slow to get any ROI from the data
7. Big Data Pipelining: Big Data Needs
• Scalable infrastructure + distributed technologies
• Allow data volumes to be scaled
• Faster processing
• More complex processing
• Constant data flow
• Visible, reproducible analysis
• For example, SHAR3 from Data Fellas
9. Intro To Docker: Quick History
What is Docker?
• Open source project started in 2013
• Easy to build, deploy, copy containers
• Great for packaging and deploying applications
• Similar resource isolation to VMs, but different architecture
• Lightweight
• Containers share the OS kernel
• Fast start
• Layered filesystems – share underlying OS files, directories
“Each virtual machine includes
the application, the necessary
binaries and libraries and an
entire guest operating system -
all of which may be tens of GBs
in size.”
“Containers include the application and all of its
dependencies, but share the kernel with other
containers. They run as an isolated process in
userspace on the host operating system. They’re
also not tied to any specific infrastructure – Docker
containers run on any computer, on any
infrastructure and in any cloud.”
10. Intro To ADAM: Quick History
What is ADAM?
• Started at UC Berkeley in 2012
• Open-source library for bioinformatics analysis, written for Spark
• Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics
methods
• A set of formats, APIs, and processing stage implementations for genomic
data
• Fully open source under the Apache 2 license
• Implemented on top of Avro and Parquet for data storage
• Compatible with Spark up to 1.5.1
11. Intro To Spark: Quick History
What is Apache Spark?
• Started at UC Berkeley in 2009
• Apache Project since 2010
• Fast - 10x-100x faster than Hadoop MapReduce
• Distributed in-memory processing
• Rich Scala, Java and Python APIs
• 2x-5x less code than R
• Batch and streaming analytics
• Interactive shell (REPL)
12. Intro To Spark-Notebook: Quick History
What is Spark-Notebook?
• Drive your data analysis from the browser
• Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc.
• Features tight integration with Apache Spark and offers handy tools to
analysts:
• Reproducible visual analysis
• Charting
• Widgets
• Dynamic forms
• SQL support
• Extensible with custom libraries
13. Intro To Parquet: Quick History
What is Parquet?
• Started at Twitter and Cloudera in 2013
• Databases traditionally store information in rows and are optimized for
working with one record at a time
• Columnar storage systems optimised to store data by column
• Netflix big user - 7 PB of warehoused data in Parquet format
• A compressed, efficient columnar data representation
• Allows complex data to be encoded efficiently
• Compression schemes can be specified on a per-column level
• Not as compressed as ORC (Hortonworks) but faster read/analysis
14. Intro To Cassandra: Quick History
What is Apache Cassandra?
• Originally started at Facebook in 2008
• Top level Apache project since 2010
• Open source distributed database
• Handles large amounts of data
• At high velocity
• Across multiple data centres
• No single point of failure
• Continuous Availability
• Disaster avoidance
• Enterprise Cassandra from Datastax
15. Intro To Akka: Quick History
What is Akka?
• Open source toolkit first released in 2009
• Simplifies the construction of concurrent and distributed Java applications
• Primarily designed for actor-based concurrency
• Akka enforces parental supervision
• Actors are arranged hierarchically
• Each actor is created and supervised by its parent actor
• Program failures treated as events handled by an actor's supervisor
• Message-based and asynchronous; typically no mutable data are shared
• Language bindings exist for both Java and Scala
16. Spark: RDD
What Is A Resilient Distributed Dataset?
• RDD - a distributed, memory abstraction for parallel in-memory
computations
• RDD represents a dataset consisting of objects and records
• Such as Scala, Java or Python objects
• RDD is distributed across nodes in the Spark cluster
• Nodes hold partitions and partitions hold records
• RDD is read-only or immutable
• RDD can be transformed into a new RDD
• Operations
• Transformations (e.g. map, filter, groupBy)
• Actions (e.g. count, collect, save)
17. Spark: DataFrames
What Is A DataFrame?
• Inspired by data frames in R and Python
• Data is organized into named columns
• Conceptually equivalent to a table in a relational database
• Can be constructed from a wide array of sources
• structured data files - JSON, Parquet
• tables in Hive
• relational database systems via JDBC
• existing RDDs
• Can be extended to support any third-party data formats or sources
• Existing third-party extensions already include Avro, CSV, ElasticSearch,
and Cassandra
• Enables applications to easily combine data from disparate sources
18. Spark & Cassandra: How?
How Does Spark Access Cassandra?
• DataStax Cassandra Spark driver – open source!
• Open source:
• https://github.com/datastax/spark-cassandra-connector
• Compatible with
• Spark 0.9+
• Cassandra 2.0+
• DataStax Enterprise 4.5+
• Scala 2.10 and 2.11
• Java and Python
• Expose Cassandra tables as Spark RDDs
• Execute arbitrary CQL queries in Spark applications
• Saves RDDs back to Cassandra via saveToCassandra call
19. Spark: How Do You Access RDDs?
Create A ‘Spark Context’
• To create an RDD you need a Spark Context object
• A Spark Context represents a connection to a Spark Cluster
• In the Spark shell the sc object is created automatically
• In a standalone application a Spark Context must be constructed
20. Spark: Architecture
Spark Architecture
• Master-worker architecture
• One master
• Spark Workers run on all nodes
• Executors belonging to different clients/SCs are isolated
• Executors belonging to the same client/SCs can communicate
• Client jobs are divided into tasks, executed by multiple threads
• First Spark node promoted as Spark Master
• Master HA feature available in DataStax Enterprise
• Standby Master promoted on failure
• Workers are resilient by default
21. Open Source: Analytics Integration
• Apache Spark for Real-Time Analytics
• Analytics nodes separate from data nodes
• ETL required
Cassandra Cluster
ETL
Spark Cluster
• Loose integration
• Data separate from processing
• Millisecond response times
Solr Cluster
ES Cluster
10 core 16GB minimum
22. DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark, Solr Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Integrated Apache Spark for Real-Time Analytics
• Integrated Apache Solr for Enterprise Search
• Search and analytics nodes close to data
• No ETL required
X
Solr Cluster
ES Cluster
12+ core 32GB+
23. Big Data Pipelining: Demo
Build & Run Steps
1. Provision a 64-bit Linux environment
2. Pre-requisites (5 mins)
3. Install Docker (5 mins)
4. Clone the Pipeline Repo from GitHub (2 mins)
5. Pull the Docker image from Docker Hub (20 mins)
6. Run the image as a container (5 mins)
7. Run the demo setup script - inside the container (2 mins)
8. Run the demo from a browser - on the host (30 mins)
24. Big Data Pipelining: Demo
Steps
1. Provision a host
Required machine spec: 3 cores, 5GB
• Linux machine
http://www.ubuntu.com/download/desktop
• Create a VM (e.g. Ubuntu)
http://virtualboxes.org/images/ubuntu/
http://www.osboxes.org/ubuntu/
25. Big Data Pipelining: Demo
Steps
2. Pre-requisites
https://docs.docker.com/installation/ubuntulinux/
• Updates to apt-get sources and gpg key
• Check kernel version
26. Big Data Pipelining: Demo
Steps
3. Install Docker
$
sudo
apt-‐get
update
$
sudo
apt-‐get
install
docker
$
sudo
usermod
-‐aG
docker
<myuserid>
Log out/in
$
docker
run
hello-‐world
27. Big Data Pipelining: Demo
Steps
4. Clone the Pipeline repo
$
mkdir
~/pipeline
$
cd
~/pipeline
$
git
clone
https://github.com/distributed-‐freaks/pipeline.git
28. Big Data Pipelining: Demo
Steps
5. Pull the Pipeline image
$
docker
pull
xtordoir/pipeline
29. Big Data Pipelining: Demo
Steps
6. Run the Pipeline image as a container
$
docker
run
-‐it
-‐m
8g
-‐p
30080:80
-‐p
34040-‐34045:4040-‐4045
-‐p
9160:9160
-‐p
9042:9042
-‐p
39200:9200
-‐p
37077:7077
-‐p
36060:6060
-‐p
36061:6061
-‐p
32181:2181
-‐p
38090:8090
-‐p
38099:8099
-‐p
30000:10000
-‐p
30070:50070
-‐p
30090:50090
-‐p
39092:9092
-‐p
36066:6066
-‐p
39000:9000
-‐p
39999:19999
-‐p
36081:6081
-‐p
35601:5601
-‐p
37979:7979
-‐p
38989:8989
xtordoir/pipeline
bash
30. Big Data Pipelining: Demo
Steps
7. Run the demo setup script in the container
$
cd
pipeline
$
source
devoxx-‐setup.sh
#
ignore
Cassandra
errors
Run cqlsh
31. Big Data Pipelining: Demo
Steps
8. Run the demo in the host browser
http://localhost:39000/tree/pipeline
37. Spark: RDD
How Do You Create An RDD?
3. From a data in a Cassandra database:
‘action’
38. Spark: RDD
How Do You Create An RDD?
4. From an existing RDD:
‘action’
‘transformation’
39. Spark: RDD’s & Cassandra
Accessing Data As An RDD
‘action’
RDD method
40. Spark: Filtering Data In Cassandra
Server-side Selection
• Reduce the amount of data transferred
• Selecting rows (by clustering columns and/or secondary indexes)
42. Spark: Using SparkSQL & Cassandra
You Can Also Access Cassandra Via SparkSQL!
• Spark Conf object can be used to create a Cassandra-aware Spark SQL context
object
• Use regular CQL syntax
• Cross table operations - joins, unions etc!
43. Spark: Streaming Data
Spark Streaming
• High velocity data – IoT, sensors, Twitter etc
• Micro batching
• Each batch represented as RDD
• Fault tolerant
• Exactly-once processing
• Represents a unified stream and batch processing framework