Getting Started with DataStax Enterprise from a Technical Perspective

Getting Started with DataStax
Enterprise
A Technical Overview
Confidential 1

Agenda
Confidential 3
Why Cassandra?
Why DataStax Enterprise?
How to Evaluate?

What is Apache Cassandra?
Apache Cassandra™ is a massively scalable NoSQL database.
• Continuous availability
• High performing writes and reads
• Linear scalability
• Multi-data center support

10
50
3070
80
40
20
60
Client
Client
Replication Factor = 3
We could still
retrieve the data
from the other 2
nodes
Token Order_id Qty Sale
70 1001 10 100
44 1002 5 50
15 1003 30 200
Node failure or it goes
down temporarily
Cassandra is Fault Tolerant

Source: Netflix Tech Blog
Netflix Cloud Benchmark…
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of nodes in
all experiments with a linear increasing throughput.”
Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference,
2013.
End Point Independent NoSQL Benchmark
Highest in throughput…
Lowest in latency…
The NoSQL Performance Leader

Linearly Scalable
10
50
3070
80
40
20
60
10
30
2040100,000 txns
per sec
200,000
txns
per sec
400,000 txns/
per sec
Simply add nodes to double, quadruple performance
and capacity
10
20

Client
10
50
3070
80
40
20
60
Client
15
55
3575
85
45
25
65
East Data CenterWest Data Center
10
50
3070
80
40
20
60
Data Center
Outage Occurs
No interruption to the business
Multi Data Center Support

Built for Modern Online Applications
• Architected for today’s needs
• Linear scalability at lowest cost
• 100% uptime
• Operationally simple

Agenda
Confidential 11
Why Cassandra?
• Scale with ease
• Always on
• Deploy across data centers

Agenda
Confidential 12
Why Cassandra?
• Scale with ease
• Always on

DataStax delivers
Apache Cassandra to the Enterprise
Confidential 13

DataStax supports both the open source community and modern business enterprises.
Why DataStax?
Open Source DataStax Enterprise
Apache Cassandra (Cassandra Chair
and 30% of committers)
Community Edition Enterprise Edition
(Tested & Certified for Production)
OpsCenter Standard Enterprise
(Alerts, Automated Management Services, Cluster
Management)
DevCenter  
Drivers/Connectors  
Online Documentation  
Online Training  
Mailing Lists and Forums  
Security Standard Enterprise
(Kerberos Authentication & SSL Encryption)
Built-in Real-time Analytics 
Built-in Enterprise Search 
In-Memory Database Option 
Expert Support (24x7x365) 
Consultative Support 
Onsite Training 

• Visual browser-based UI
• Point-and-click administration
• Visual cluster management
• Proactive alerts
• Built-in external notifications
• Visual backup operations
DataStax OpsCenter

Cassandra Query Language (CQL)
DataStax DevCenter – a free, visual query tool for creating and running CQL
statements against Cassandra and DataStax Enterprise.

Internal Authentication
Internal validation of
authorized users
Simple to implement &
easy to understand
No learning curve
Object Permission
Management
Deep control over who can
add/change/delete/read
data
Uses familiar
GRANT/REVOKE from
relational world
No learning curve
Client to Node Encryption
Ensures data cannot be
captured/stolen in route to a
server
Data is safe both in flight
from/to a database and on
the database
Complete coverage is
ensured
Cassandra Security

External Authentication
External validation of
authorized users
Leverages Kerberos &
LDAP)
Single sign-on to all data
domains
Transparent Data
Encryption
Protects sensitive data at
rest via SSL
No changes needed at
application level
Encrypt both Cassandra
and Hadoop data
Data Auditing
Audit trail of all accesses
and changes
Control to audit only what’s
needed
Uses log4j interface to
ensure performance &
efficient audit operations
DataStax Enterprise Security

• Delivers Solr integration
• Very fast performance
• Search indexes span multiple
data centers (regular Solr
cannot)
• Online scalability via adding
new nodes
• Built-in failover; continuously
available
Built-in Enterprise Search
C* &
Solr
C* &
Solr
C* &
Solr
C* &
Solr

• Real-time analytics on
Cassandra hot data
• MapReduce, Hive, Pig,
Sqoop, and Mahout
• No single points of failure
Built-In Enterprise Analytics
Enterprise
Analytics
MapReduce,
Hive, Pig,
More
Continuous
availability
Integrated
big data
platform
C* &
Hadoo
p
C* &
Hadoo
p
C* &
Hadoo
p
C* &
Hadoo
p

Agenda
Confidential 21
Why Cassandra?
• Scale with ease
• Always on
• Enterprise-ready capabilities
• 24x7x365 support

Agenda
Confidential 22
Why Cassandra?
• Scale with ease
• Always on
How to Evaluate?

Evaluation Process
Download& installbinaries
or sandbox
Leverageusecasesto
identifyneeds
InstallDSE/OpsCenteron
servers
Design/Modifydatamodel
Implementdata model
Load sampledata
Stresstest servers
Developapplication
1) R&D Mode
2) POC Cycle
3) Optimize
Add Nodes
(C*, SOLR, and/orHadoop)

A Typical POC Environment
• Ideally at least 4 nodes, RF=3
• Hardware per node:
• At least 8 core
• At least16 GBs RAM (more the better)
• SSD physically attached
• Linux (ideally 3.x for improved buffered cache)
• Each environment has its own steps/requirements:
• EC2, Rackspace, Google Compute, Other cloud
providers
• In-house servers
• In-house servers VM

Tailored to Meet Your Needs
Confidential 25
FREE Resources PAID Services
DSE Sandbox
DSE for
Non-Production
OpsCenter (Standard)
DevCenter
DataStax Academy
Community Forums
White Papers &
Documentation
Onsite Consulting
Remote Consulting
Onsite Training
Public Training
PAID Subscription
Production
DSE Pro
Production
DSE Standard
Non-Production
DSE Max
Non-Production
DSE Pro
Non-Production
DSE Standard
Production
DSE Max
PAID Bundles
Quick Start
Enterprise
Quick Start
Standard
 Customer Success Manager
 Proactive Guidance
 Free Health Check
 Free MigrationAssessment
 Monthly Bulletin Best Practices
Customer Benefits

The Right Mix of Support Resources
Confidential 26
Education & Training Planning & Design Develop & Test
Training Consulting Support
How to use DataStax
Enterprise
Learn DataStax admin
features
How to use integrated search
How to use integrated
analytics
DataStax Enterprise
architecture
Data modeling with
DataStax
Cluster tuning and
performance
Best practices and planning
Troubleshooting errors
Experiencing unexpected
results
Clarification on
documentation
Critical issue support
Production Support

Available Online Resources
• Patrick McFadin’s data modeling series
• CQL/Data modeling on DataStax
• Virtual training
• Java driver sample code
• SOLR documentation and tutorial on DataStax
• Analytics documentation
• Github code samples
• Advance time series best practices
Massively
Scale a DB!

Agenda
Confidential 28
Why Cassandra?
• Scale with ease
• Always on
How to Evaluate?
• Evaluate efficiently

Q&A and Next Steps
Confidential 29
Want to learn more about the evaluation process?
• Contact your account manager or email us at
sales@datastax.com
Want access to more Cassandra resources?
• Visit Planet Cassandra at www.planetcassandra.com

EC2 Install Process with Linux AMI’s
• Read through ec2 production planning:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2
_c.html
• Go for i2.2xlarge to i2.4xlarge
• Create security group:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIse
curity.html
• Pick a reputable reliable Linux flavored image to start with - preferably an image with the 3.x kernel on it
• Run through the wizard and start AMI's up
• Install the prereq's:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html
• Install dse node (depends on OS):
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.ht
ml
• Following the "what's next at the bottom of installation instructions, including configuring dse node
multidc or single dc (topology should be planned for):
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingl
eDC.html#deploySingleDC or
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMulti
DC.html#deployMultiDC
• Follow and set recommended production settings:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

Cassandra Architecture Basics –
One Node
Organizes Data in Partitions
Inserted data is written to a Commit
Log
As well as a MemTable
MemTables are flushed to disk in an
SSTable based on size.
SSTables are immutable
Changes to a partition are written to
additional SSTables.
Deletes write tombstones
Node 1
Row Data
Partition
Key
75
Row Data
Partition
Key
9

Background –
How Cassandra Stores Data
Model brought from BigTable*
Partition key and a lot of cells
Cell names sorted (UTF8, Int, Timestamp, etc)
• CQL creates timestamp if not specified
Partition key
Cell Name ... Cell Name
Cell Value Cell Value
Timestamp Timestamp
TTL TTL
1 2 Billion
©2013 DataStax Confidential. Do not distribute without consent. 33

Node 1
Node 2Node 5
Node 3Node 4
Row
Data23
Row
Data76
Row
Data23
Row
Data23
Row
Data76
Row
Data76
Cassandra Architecture Basics –
Multi Data Center
• Nodes can be arranged in
multiple data centers
• Cassandra replicates data
efficiently between remote
data centers
• Each data center can have a
different RF
• Use data centers to segment
nodes for different query
patterns
Boston
San
FranciscoReal Time
Analytics

Reading Data
©2013 DataStax Confidential. Do not distribute without consent. Slide 35
/* Demonstrate an easy way to query data. */
try {
ResultSet result = session.execute (
"SELECT password from user " +
"WHERE username = 'user2';");
if (result.isExhausted())
return;
Row user = result.one();
System.out.println("Password is: " +
user.getString("password"));
} catch (NoHostAvailableException ex) {
System.out.println("No Host Available");
} catch (QueryValidationException ex) {
System.out.println(“Requested consistency” +
“level not met”);
}

Prepared Statements
PreparedStatement statement = session.prepare(
"INSERT INTO user (username, password) " +
"VALUES (?, ?);");
BoundStatement boundStatement = new
BoundStatement(statement);
try {
session.execute(boundStatement.bind("user4”,"user4password"));
} catch (NoHostAvailableException ex) {
System.out.println("Host Not Available");
} catch (QueryExecutionException ex) {
System.out.println (”Syntax error, runtime, not
authorized");
} catch (QueryValidationException ex) {
System.out.println ("Requested consistency level not met");
}

Query-Driven Data Modeling
Start by addressing the queries that you will need to
answer
• Your data should be able to match it directly
Think about:
• The actions your application needs to perform
• How you want to access the data
• What are the use cases?
• What does the data look like?

Queries (cont)
What are you trying to retrieve
• Does it need to be ordered?
• Is there any nesting of data?
• Do you need to group data?
• Do you need to filter data?
Does data expire?
Does data need to be retrieved in chronological order?

Relational Concept - Denormalization
• Combine table columns into a single view
• No joins
• All in how you set the data for fast reads
Employees
SELECT First, Last, Dept
FROM employees
WHERE id = ‘1’;
id First Last Dept
1 Edgar Codd
Engineeri
ng
2 Raymond Boyce Math

• Examples: medical device, energy devices/equipment, financial data
• Application for sensors, clickstreams, historical data
• Typical very high volume writes required
• Usually coupled with need to analyze data or search using real-time analytics
• Great fit for DSE Cassandra, SOLR, Analytics Nodes
Time Series – Patterns
StationID
Timestamp
Value/s
Timestamp
Value/s
1…N
FLGAZ101
20130611T01:01:
01
74.34
20130611T01:01:
11
74.28
20130611T01:01:
21
74.41

Hardware
• Ideal node:
• Processor: CPU 8 cores,
• Memory: RAM 16 - 64 GB, with 8 GB of Heap,
• Network: at least a Gigabit card,
• Disks: lots of small disks using JBOD or basic RAIDs (0 or
10), but prefer SSDs
• Exact needs vary by use case
• Production planning:
• http://www.datastax.com/documentation/cassandra/1.2/we
bhelp/index.html#cassandra/architecture/architecturePlann
ingHardware_c.html

Cassandra Query Language (CQL)
• Very similar to RDBMS SQL syntax
• Create objects via DDL (e.g. CREATE…)
• Core DML commands supported: INSERT, UPDATE,
DELETE
• Query data with SELECT
• Leverage Java drivers to execute queries via
PreparedStatements and ResultSets
SELECT *
FROM USERS
WHERE STATE = ‘TX’;

Cl
ie
nt
SSTable
Memory
SSTables
Commit Log
Flush to Disk
Cassandra is Durable
Data is organized into Partitions
Inserted data is written to a Commit Log for a node
As well as a MemTable
MemTables are flushed to disk in an SSTable based on size.
SSTables are immutable

Overview of Replication in Cassandra
• Replication is controlled by what is called the replication
factor. A replication factor of 1 means there is only one
copy of a row in a cluster. A replication factor of 2 means
there are two copies of a row stored in a cluster
• Replication is controlled at the keyspace level in
Cassandra
Original row
Copy of row
Replication Factor (RF)
determines additional
nodes that get a copy of
the partition Eg. RF=3
Copy of row

• The schema used in Cassandra is modeled after after Google
Bigtable. It is a row-oriented, column structure
• A keyspace is akin to a database in the RDBMS world
• A column family is similar to an RDBMS table but is more
flexible/dynamic
• A row in a column family is indexed by its key
ID Name SSN DOB
Portfolio Keyspace
Customer Column Family
Data Model

Tunable Data Consistency
• Choose between strong and eventual consistency
(one to all responding) depending on the need
• Can be done on a per-operation basis, and for both
reads and writes
• Handles multi-data center operations
• Any
• One
• Quorum
• Local_Quorum
• Each_Quorum
• All
Writes
• One
• Quorum
• Local_Quorum
• Each_Quorum
• All
Reads

Getting Started with DataStax Enterprise from a Technical Perspective

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Getting Started with DataStax Enterprise from a Technical Perspective

Editor's Notes