Intro to Cassandra

Intro to

Cassandra
Tyler Hobbs

History

Dynamo BigTable
(clustering) (data model)

Cassandra

Clustering

Every node plays the same role
– No masters, slaves, or special nodes
– No single point of failure

Consistent Hashing

0

50 10

40 20

30

Consistent Hashing
Key: “www.google.com”
0

50 10

40 20

30

Consistent Hashing
0
md5(“www.google.com”)
50 10

14

40 20

30

Consistent Hashing
0
md5(“www.google.com”)
50 10

14

40 20

30
Replication Factor = 3

Clustering

Client can talk to any node

Scaling

RF = 2 0

50 10

The node at
50 owns the
red portion 20

30

Scaling

RF = 2 0

50 10

Add a new 40 20
node at 40
30

Node Failures

RF = 2 0

50 10

Replicas
40 20

30

Node Failures

RF = 2 0

50 10

40 20

30

Consistency, Availability

Consistency
– Can I read stale data?

Availability
– Can I write/read at all?

Tunable Consistency

Consistency

N = Total number of replicas

R = Number of replicas read from
– (before the response is returned)

W = Number of replicas written to
– (before the write is considered a success)

Consistency

N = Total number of replicas

R = Number of replicas read from
– (before the response is returned)

W = Number of replicas written to
– (before the write is considered a success)

W + R > N gives strong consistency

Consistency

N=3
W=2
R=2

2 + 2 > 3 ==> strongly consistent

Consistency

N=3
W=2
R=2

2 + 2 > 3 ==> strongly consistent

Only 2 of the 3 replicas must be
available.

Consistency

Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation

Consistency

Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
– Quorum: N/2 + 1
• R = W = Quorum
• Strong consistency
• Tolerate the loss of N – Quorum replicas
– R, W can also be 1 or N

Availability

Can tolerate the loss of:
– N – R replicas for reads
– N – W replicas for writes

CAP Theorem
During node or network failure:

100%
Not
Possible

Availability
Possible

Consistency 100%

CAP Theorem
During node or network failure:

100%
Not
Ca Possible
ss
an
dr
Availability a
Possible

Consistency 100%

Clustering

No single point of failure

Replication that works

Scales linearly
– 2x nodes = 2x performance
• For both writes and reads
– Up to 100's of nodes

Operationally simple

Multi-Datacenter Replication

Data Model

Comes from Google BigTable

Goals
– Minimize disk seeks
– High throughput
– Low latency
– Durable

Data Model

Keyspace
– A collection of Column Families
– Controls replication settings

Column Family
– Kinda resembles a table

Column Families

Static
– Object data
– Similar to a table in a relational database

Dynamic
– Pre-calculated query results
– Materialized views

Static Column Families
Users
zznate password: * name: Nate

driftx password: * name: Brandon

thobbs password: * name: Tyler

jbellis password: * name: Jonathan site: riptano.com

Dynamic Column Families

Rows
– Each row has a unique primary key
– Sorted list of (name, value) tuples
• Like a sorted map or dictionary
– The (name, value) tuple is called a “column”

Following
zznate driftx: thobbs:

driftx

thobbs zznate:

jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate


Column Timestamps
– Each column (tuple) has a timestamp
– In the case of a collision, the latest timestamp wins
– Client specifies timestamp with write
– Writes are idempotent
• Infinite retries allowed


Other Examples:
– Timeline of tweets by a user
– Timeline of tweets by all of the people a user is
following
– List of comments sorted by score
– List of friends grouped by state

The Data API

Two choices
– RPC-based API
– CQL
• Cassandra Query Language

Inserting Data
INSERT INTO users (KEY, “name”, “age”)
VALUES (“thobbs”, “Tyler”, 24);

Updating Data
Updates are the same as inserts:
INSERT INTO users (KEY, “age”)
VALUES (“thobbs”, 34);

Or
UPDATE users SET “age” = 34
WHERE KEY = “thobbs”;

Fetching Data
Whole row select:
SELECT * FROM users WHERE KEY = “thobbs”;

Fetching Data
Explicit column select:
SELECT “name”, “age” FROM users

Fetching Data
Get a slice of columns
UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
WHERE KEY = “key”;

SELECT 1..3 FROM letters WHERE KEY = “key”;

Returns [(1, a), (2, b), (3, c)]

Fetching Data
SELECT FIRST 2 FROM letters WHERE KEY = “key”;

Returns [(1, a), (2, b)]

SELECT FIRST 2 REVERSED FROM letters

Returns [(5, e), (4, d)]

Fetching Data
SELECT 3..'' FROM letters WHERE KEY = “key”;

Returns [(3, c), (4, d), (5, e)]

SELECT FIRST 2 REVERSED 4..'' FROM letters

Returns [(4, d), (3, c)]

Deleting Data
Delete a whole row:
DELETE FROM users WHERE KEY = “thobbs”;

Delete specific columns:
DELETE “age” FROM users

Secondary Indexes
Builtin basic indexes
CREATE INDEX ageIndex ON users (age);

SELECT name FROM USERS
WHERE age = 24 AND state = “TX”;

Performance

Writes
– 10k – 30k per second per node
– Sub-millisecond latency

Reads
– 1k – 10k per second per node
– Depends on data set, caching
– Usually 0.1 to 10ms latency

Other Features

Distributed Counters
– Can support millions of high-volume counters

Excellent Multi-datacenter Support
– Disaster recovery
– Locality

Hadoop Integration
– Isolation of resources
– Hive and Pig drivers

Compression

What Cassandra Can't Do

Transactions
– Unless you use a distributed lock
– Atomicity, Isolation
– These aren't needed as often as you'd think

Limited support for ad-hoc queries
– Know what you want to do with the data

Not One-size-fits-all

Use alongside an RDBMS
– Use the RDBMS for highly-transactional or highly-
relational data
• Usually a small set of data
– Let Cassandra scale to handle the rest

Language Support

Good:
– Java
– Python
– Ruby
– PHP
– C#

Coming Soon:
– Everything else, now that we have CQL

Questions?

Tyler Hobbs
@tylhobbs
tyler@datastax.com

Intro to Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Intro to Cassandra

Similar to Intro to Cassandra (20)

Recently uploaded

Recently uploaded (20)

Intro to Cassandra