The requirements for building today’s online applications have changed. Implementing legacy technology hinders your ability to innovate, ensure application performance, and meet the demands of your customers. So how do you determine what underlying systems are the right fit for your needs?
Join us as we review the following to help you get started with DataStax Enterprise:
- What is Cassandra and why should you care?
- What is DataStax Enterprise and how does it differ from Cassandra?
- What are the steps to evaluating DataStax Enterprise?
- Valuable resources to get up to speed on Cassandra and DataStax Enterprise
5. What is Apache Cassandra?
Apache Cassandra™ is a massively scalable NoSQL database.
• Continuous availability
• High performing writes and reads
• Linear scalability
• Multi-data center support
6. 10
50
3070
80
40
20
60
Client
Client
Replication Factor = 3
We could still
retrieve the data
from the other 2
nodes
Token Order_id Qty Sale
70 1001 10 100
44 1002 5 50
15 1003 30 200
Node failure or it goes
down temporarily
Cassandra is Fault Tolerant
7. Source: Netflix Tech Blog
Netflix Cloud Benchmark…
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of nodes in
all experiments with a linear increasing throughput.”
Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference,
2013.
End Point Independent NoSQL Benchmark
Highest in throughput…
Lowest in latency…
The NoSQL Performance Leader
14. DataStax supports both the open source community and modern business enterprises.
Why DataStax?
Open Source DataStax Enterprise
Apache Cassandra (Cassandra Chair
and 30% of committers)
Community Edition Enterprise Edition
(Tested & Certified for Production)
OpsCenter Standard Enterprise
(Alerts, Automated Management Services, Cluster
Management)
DevCenter
Drivers/Connectors
Online Documentation
Online Training
Mailing Lists and Forums
Security Standard Enterprise
(Kerberos Authentication & SSL Encryption)
Built-in Real-time Analytics
Built-in Enterprise Search
In-Memory Database Option
Expert Support (24x7x365)
Consultative Support
Onsite Training
16. Cassandra Query Language (CQL)
DataStax DevCenter – a free, visual query tool for creating and running CQL
statements against Cassandra and DataStax Enterprise.
17. Internal Authentication
Internal validation of
authorized users
Simple to implement &
easy to understand
No learning curve
Object Permission
Management
Deep control over who can
add/change/delete/read
data
Uses familiar
GRANT/REVOKE from
relational world
No learning curve
Client to Node Encryption
Ensures data cannot be
captured/stolen in route to a
server
Data is safe both in flight
from/to a database and on
the database
Complete coverage is
ensured
Cassandra Security
18. External Authentication
External validation of
authorized users
Leverages Kerberos &
LDAP)
Single sign-on to all data
domains
Transparent Data
Encryption
Protects sensitive data at
rest via SSL
No changes needed at
application level
Encrypt both Cassandra
and Hadoop data
Data Auditing
Audit trail of all accesses
and changes
Control to audit only what’s
needed
Uses log4j interface to
ensure performance &
efficient audit operations
DataStax Enterprise Security
19. • Delivers Solr integration
• Very fast performance
• Search indexes span multiple
data centers (regular Solr
cannot)
• Online scalability via adding
new nodes
• Built-in failover; continuously
available
Built-in Enterprise Search
C* &
Solr
C* &
Solr
C* &
Solr
C* &
Solr
20. • Real-time analytics on
Cassandra hot data
• MapReduce, Hive, Pig,
Sqoop, and Mahout
• No single points of failure
Built-In Enterprise Analytics
Enterprise
Analytics
MapReduce,
Hive, Pig,
More
Continuous
availability
Integrated
big data
platform
C* &
Hadoo
p
C* &
Hadoo
p
C* &
Hadoo
p
C* &
Hadoo
p
21. Agenda
Confidential 21
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease
• Always on
• Deploy across data centers
• Enterprise-ready capabilities
• 24x7x365 support
22. Agenda
Confidential 22
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease
• Always on
• Deploy across data centers
• Enterprise-ready capabilities
• 24x7x365 support
How to Evaluate?
24. A Typical POC Environment
• Ideally at least 4 nodes, RF=3
• Hardware per node:
• At least 8 core
• At least16 GBs RAM (more the better)
• SSD physically attached
• Linux (ideally 3.x for improved buffered cache)
• Each environment has its own steps/requirements:
• EC2, Rackspace, Google Compute, Other cloud
providers
• In-house servers
• In-house servers VM
25. Tailored to Meet Your Needs
Confidential 25
FREE Resources PAID Services
DSE Sandbox
DSE for
Non-Production
OpsCenter (Standard)
DevCenter
DataStax Academy
Community Forums
White Papers &
Documentation
Onsite Consulting
Remote Consulting
Onsite Training
Public Training
PAID Subscription
Production
DSE Pro
Production
DSE Standard
Non-Production
DSE Max
Non-Production
DSE Pro
Non-Production
DSE Standard
Production
DSE Max
PAID Bundles
Quick Start
Enterprise
Quick Start
Standard
Customer Success Manager
Proactive Guidance
Free Health Check
Free MigrationAssessment
Monthly Bulletin Best Practices
Customer Benefits
26. The Right Mix of Support Resources
Confidential 26
Education & Training Planning & Design Develop & Test
Training Consulting Support
How to use DataStax
Enterprise
Learn DataStax admin
features
How to use integrated search
How to use integrated
analytics
DataStax Enterprise
architecture
Data modeling with
DataStax
Cluster tuning and
performance
Best practices and planning
Troubleshooting errors
Experiencing unexpected
results
Clarification on
documentation
Critical issue support
Production Support
27. Available Online Resources
• Patrick McFadin’s data modeling series
• CQL/Data modeling on DataStax
• Virtual training
• Java driver sample code
• SOLR documentation and tutorial on DataStax
• Analytics documentation
• Github code samples
• Advance time series best practices
Massively
Scale a DB!
28. Agenda
Confidential 28
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease
• Always on
• Deploy across data centers
• Enterprise-ready capabilities
• 24x7x365 support
How to Evaluate?
• Evaluate efficiently
29. Q&A and Next Steps
Confidential 29
Want to learn more about the evaluation process?
• Contact your account manager or email us at
sales@datastax.com
Want access to more Cassandra resources?
• Visit Planet Cassandra at www.planetcassandra.com
31. EC2 Install Process with Linux AMI’s
• Read through ec2 production planning:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2
_c.html
• Go for i2.2xlarge to i2.4xlarge
• Create security group:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIse
curity.html
• Pick a reputable reliable Linux flavored image to start with - preferably an image with the 3.x kernel on it
• Run through the wizard and start AMI's up
• Install the prereq's:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html
• Install dse node (depends on OS):
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.ht
ml
• Following the "what's next at the bottom of installation instructions, including configuring dse node
multidc or single dc (topology should be planned for):
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingl
eDC.html#deploySingleDC or
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMulti
DC.html#deployMultiDC
• Follow and set recommended production settings:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
32. Cassandra Architecture Basics –
One Node
Organizes Data in Partitions
Inserted data is written to a Commit
Log
As well as a MemTable
MemTables are flushed to disk in an
SSTable based on size.
SSTables are immutable
Changes to a partition are written to
additional SSTables.
Deletes write tombstones
Node 1
Row Data
Partition
Key
75
Row Data
Partition
Key
9
34. Node 1
Node 2Node 5
Node 3Node 4
Row
Data23
Row
Data76
Row
Data23
Row
Data23
Row
Data76
Row
Data76
Cassandra Architecture Basics –
Multi Data Center
• Nodes can be arranged in
multiple data centers
• Cassandra replicates data
efficiently between remote
data centers
• Each data center can have a
different RF
• Use data centers to segment
nodes for different query
patterns
Boston
San
FranciscoReal Time
Analytics
41. Hardware
• Ideal node:
• Processor: CPU 8 cores,
• Memory: RAM 16 - 64 GB, with 8 GB of Heap,
• Network: at least a Gigabit card,
• Disks: lots of small disks using JBOD or basic RAIDs (0 or
10), but prefer SSDs
• Exact needs vary by use case
• Production planning:
• http://www.datastax.com/documentation/cassandra/1.2/we
bhelp/index.html#cassandra/architecture/architecturePlann
ingHardware_c.html
42. Cassandra Query Language (CQL)
• Very similar to RDBMS SQL syntax
• Create objects via DDL (e.g. CREATE…)
• Core DML commands supported: INSERT, UPDATE,
DELETE
• Query data with SELECT
• Leverage Java drivers to execute queries via
PreparedStatements and ResultSets
SELECT *
FROM USERS
WHERE STATE = ‘TX’;
43. Cl
ie
nt
SSTable
Memory
SSTables
Commit Log
Flush to Disk
Cassandra is Durable
Data is organized into Partitions
Inserted data is written to a Commit Log for a node
As well as a MemTable
MemTables are flushed to disk in an SSTable based on size.
SSTables are immutable
44. Overview of Replication in Cassandra
• Replication is controlled by what is called the replication
factor. A replication factor of 1 means there is only one
copy of a row in a cluster. A replication factor of 2 means
there are two copies of a row stored in a cluster
• Replication is controlled at the keyspace level in
Cassandra
Original row
Copy of row
Replication Factor (RF)
determines additional
nodes that get a copy of
the partition Eg. RF=3
Copy of row
45. • The schema used in Cassandra is modeled after after Google
Bigtable. It is a row-oriented, column structure
• A keyspace is akin to a database in the RDBMS world
• A column family is similar to an RDBMS table but is more
flexible/dynamic
• A row in a column family is indexed by its key
ID Name SSN DOB
Portfolio Keyspace
Customer Column Family
Data Model
46. Tunable Data Consistency
• Choose between strong and eventual consistency
(one to all responding) depending on the need
• Can be done on a per-operation basis, and for both
reads and writes
• Handles multi-data center operations
• Any
• One
• Quorum
• Local_Quorum
• Each_Quorum
• All
Writes
• One
• Quorum
• Local_Quorum
• Each_Quorum
• All
Reads
Today, we are going to cover the basics of to go over the technical basics of Cassandra and DataStax Enterprise and then discuss the typical evaluation process.
Count of current companies/groups: over 1000 using Cassandra, over 500 using DataStax
This presentation will focus on three practical topics to getting started with DataStax Enterprise: (1) understanding why C*, (2)why DSE, and how clients typically evaluate the process with recommended resources, along the way.
Always on:
Peer to peer architecture – all nodes are equal; each node is responsible for an assigned range (or ranges) of data
Clients can write (or read0 data to any node in the ring – native drivers can round robin across a DC and distribute load to a coordinator node
Coordinate node writes (or reads) copies of data to nodes which own each copy
In the case of a failure (such as a drive going down),
2 out of the 3 nodes are still on, so the ability to write and read data still works for the majority of nodes and therefore C* is always on
Independent benchmarks proving out linear scalability – Netflix and University of Toronto; at any nodes, this is what we are seeing for read/write mix
Source: Solving Big Data Challenges for Enterprise Application Performance Management, Tilman Rable, et al., August 2013, p. 10. Benchmark paper presented at the Very Large Database Conference, 2013. http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2013.pdf
Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Need to speed up your reads and write? Very simple to add nodes. The improvement in response time is truly linear, as a result of the peer to peer architecture of sharing the data. Netflix – 3000 nodes – bring up or down 500 nodes to manage anticipated spikes in load
Multi-DC is very, very easy to configure with Cassandra
Datacenters are active – active: write to either DC and the other one will get a copy
In the case of a datacenter outage, applications can carry on a retry policy which flips over to the other datacenter which also has a copy of the data;
Outbrain story – Hurricane Sandy
Choice for today’’s modern online applications – architects know that these types of applications must always stay on and therefore need to easily scale to handle load
We’ve covered the benefits of using Cassandra: (1) high availability, (2) linear scalability, and (3) ease of multi-DC configuration
Now, we’ll cover the value of DSE – what does DataStax Enterprise bring to the table?
DataStax is the company that delivers Cassandra to the enterprise. First, we take the open source software and put it through rigorous quality assurance tests including a 1000 node scalability test. We certify it and provide the worlds most comprehensive support, training and consulting for Cassandra so that you can get up and running quickly. But that isn’t all DataStax does. We also build additional software features on top of DataStax including security, search, analytics as well as provide in memory capabilities that don’t come with the open source Cassandra product. We also provide management services to help visualize your nodes, plan your capacity and repair issues automatically. Finally, we also provide developer tools and drivers as well as monitoring tools. DataStax is the commercial company behind Apache Cassandra plus a whole host of additional software and services.
Side by side comparison of what C* open source offers compared to DSE; note the tested and certified version of the binaries for productions plus product features and support
Visual, browser-based user interface negates need to install client software
Administration tasks carried out in point-and-click fashion
Allows for visual rebalance of data across a cluster when new nodes are added
Contains proactive alerts that warn of impending issues.
Built-in external notification abilities
Visually perform and schedule backup operations
CQL as serviced up using DevCenter – works with community too; worth mentioning given the ease of working with CQL and its similarities with SQL
Internal Authentication Manages login IDs and passwords inside the database
Ensures only authorized users can access a database system using internal validation
Simple to implement and easy to understand
No learning curve from the relational world
Object Permission Management
controls who has access to what and who can do what in the database
Provides granular based control over who can add/change/delete/read data
Uses familiar GRANT/REVOKE from relational systems
No learning curve
Client to Node Encryption protects data in flight to and from a database cluster
Ensures data cannot be captured/stolen in route to a server
Data is safe both in flight from/to a database and on the database; complete coverage is ensured
External Authentication uses external security software packages to control security
Only authorized users have access to a database system using external validation
Uses most trusted external security packages (Kerberos, LDAP), mainstays in government and finance
Single sign on to all data domains
Transparent Data Encryptionencrypts data at rest
Protects sensitive data at rest from theft and from being read at the file system level
No changes needed at application level
Can encrypt both Cassandra and Hadoop data
Data Auditingprovides trail of who did and looked at what/when
Supplies admins with an audit trail of all accesses and changes
Granular control to audit only what’s needed
Uses log4j interface to ensure performance and efficient audit operations
Built-in enterprise search on Cassandra data via Solr integration
Very fast performance
Search indexes can span multiple data centers (regular Solr cannot)
Online scalability via adding new nodes
Built-in failover; continuously available
Same concepts apply for Hadoop in analytics nodes as compared with SOLR nodes: a great way to run reporting on your data in your database without having to worry about porting over to a separate Hadoop environment – not a substitute for Hadoop, but perfect for a great deal of use cases
Here is a diagram of the typical process which clients run through when trying out DataStax. Often, a developer and DBA downloads and installs the sandbox on their local laptop in a Linux environment, such as VM, or an a dev box, just to try it out. Along the way of discovery, use cases are evaluated for fit and data models are designed. At a certain point, there will be a desire to test out how Cassandra and DSE, as a whole works within a multi-clustered environment. Sample data loaded using a given data model and then benchmarks are performed – how hard can you hit the typically 4 nodes with 3 copies of data until the write/read breaks the box. Cassandra stresstool or the drivers are used to create the read/write mix. Based on behavior for 4 nodes, for example, load can be linearly projected (or tested for that matter) for more nodes.
Pertinents links are provided below:
Sandbox download – http://www.datastax.com/download#dl-sandbox
Binaries download – http://www.datastax.com/download#dl-enterprise
Typical use cases on Planet Cassandra – http://planetcassandra.org/functional-use-cases/ (by function) and http://planetcassandra.org/industry-use-cases/ (by industry)
SOLR Tutorial and Overview - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchTOC.html
Hadoop Overview - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaTOC.html
Data Modeling – http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddlCQLDataModelingTOC.html
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html http://www.datastax.com/documentation/cql/3.1/cql/cql_using/about_cql_c.html
Copy command – http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/copy_r.html
Java driver - http://www.datastax.com/documentation/developer/java-driver/2.0/common/drivers/introduction/introArchOverview_c.html
Cassandra Stress Tool - http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCStress_t.html
Here are some of the recommended settings for your PoC environment. Again, we highly recommend to start with at least 3 copies of data across 4 nodes. SSD’s are by far the preferred drive: you will save on number of servers needed and electricity paid and the response time of these drives is on the order of magnitude of 100 times faster for reads and writes. With the latest 3.x version of Linux, buffered caching is optimized which helps with performance, given buffered cache is a another way of caching data – the more RAM, the better especiallly for caching. RAM should be at least 16GB’s per box. We have no preference as to which cloud environment is used. There are Amazon AMI’s already set up to get folks jump started on DSE – they can be found by searching for DataStax in the EC2 marketplace. VM images on hosted boxes work fine but you will lose around 10% efficiency, due to resource sharing; if going VM, please certain to use directly physically mounted drives per image. SAN is highly discouraged.
Hardware Recommendations - http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html
Standard Install Instructions - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.html
EC2 Install with template DSE AMI’s - http://vimeo.com/89539972
EC2 Planning Out a Cluster - http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html
Reference Architecture - http://www.datastax.com/wp-content/uploads/2014/01/WP-DataStax-Enterprise-Reference-Architecture.pdf
See Appendix for EC2 Install with Linux AMI’s (Slide #27)
There are lots of free resources available at people’s disposal for both education and evaluation. Most of the items listed on the left of this slide are reachable through the datastax.com website. . In this discussion, we are focussing more on the items on the left hand side; however, there are places where paid-for items make a lot of sense. For example, public training events can be registered for and are listed on datastax.com. Some clients opt to have in-person specialized training for a day or two with an architect. Your account rep can walk you through options int terms of each of three engagement models we provide, tailored to meet your needs.
There is are also helpful starter packages which you can discuss with the account managers.
With respect to assistance, there are three categories of people support which DataStax provides. For example learning how SOLR and Hadoop nodes work, are covered in training. Specific questions, best practices, or performance tuning would be more along the lines of consulting. Support address bugs for clients.
Here are some links we’ve found that we’ve had to provide to lots of clients along the way and we felt with worth sharing, starting with Patrick McFadin’s four recorded videos on data modeling.
Patrick McFadin’s Data Modeling Series - http://wiki.apache.org/cassandra/DataModel
Advance Time Series Best Practices - http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
CQL/Data Modeling on DataStax - http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html http://www.datastax.com/documentation/cql/3.1/cql/cql_using/about_cql_c.html
Virtual Training - http://www.datastax.com/what-we-offer/products-services/training/virtual-training#tab
Public Training Signup - http://www.datastax.com/what-we-offer/products-services/training
Sample Projects (Java driver code, etc) - https://github.com/DataStaxCodeSamples/
SOLR Documentation and Tutorial on DataStax - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchTOC.html
Analytics documentation - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaTOC.html
Github code samples - https://github.com/DataStaxCodeSamples?query=+only%3Apublic+
There are lots of readily available resources, as you can see, so hopefully this will make your evaluation process as efficient as possible.
Learning Objective: Describe how to read data
This slide demonstrates how to check for “row not found” condition. Best practice to check
Also demonstrates the use of the one() method where just one row (or possibly notfound) is expected.
Learning Objective: Describe what prepared statements are and when to use them
This is an example of using prepared statements.
Prepared statements can be used for inserts or queries typically in a loop (not shown).
Focus on the exceptions here also, you don’t need to catch all of these but the strings point out the type error.
Conserving white space where possible here.
PreparedStatement statement = session.prepare(
"INSERT INTO user (username, password) " +
"VALUES (?, ?);");
BoundStatement boundStatement = new BoundStatement(statement);
try {
session.execute(boundStatement.bind("user4”,"user4password"));
} catch (NoHostAvailableException ex) {
System.out.println("Host Not Available");
} catch (QueryExecutionException ex) {
System.out.println (”Syntax error, runtime, not authorized");
} catch (QueryValidationException ex) {
System.out.println ("Requested consistency level not met");
}