An introduction to Apache Cassandra, covering the clustering model and the data model.
Presented by Tyler Hobbs at the October 2011 Austin NoSQL meetup.
18. Consistency, Availability
Consistency
– Can I read stale data?
Availability
– Can I write/read at all?
Tunable Consistency
19. Consistency
N = Total number of replicas
R = Number of replicas read from
– (before the response is returned)
W = Number of replicas written to
– (before the write is considered a success)
20. Consistency
N = Total number of replicas
R = Number of replicas read from
– (before the response is returned)
W = Number of replicas written to
– (before the write is considered a success)
W + R > N gives strong consistency
21. Consistency
W + R > N gives strong consistency
N=3
W=2
R=2
2 + 2 > 3 ==> strongly consistent
22. Consistency
W + R > N gives strong consistency
N=3
W=2
R=2
2 + 2 > 3 ==> strongly consistent
Only 2 of the 3 replicas must be
available.
23. Consistency
Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
24. Consistency
Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
– Quorum: N/2 + 1
• R = W = Quorum
• Strong consistency
• Tolerate the loss of N – Quorum replicas
– R, W can also be 1 or N
25. Availability
Can tolerate the loss of:
– N – R replicas for reads
– N – W replicas for writes
26. CAP Theorem
During node or network failure:
100%
Not
Possible
Availability
Possible
Consistency 100%
27. CAP Theorem
During node or network failure:
100%
Not
Ca Possible
ss
an
dr
Availability a
Possible
Consistency 100%
28. Clustering
No single point of failure
Replication that works
Scales linearly
– 2x nodes = 2x performance
• For both writes and reads
– Up to 100's of nodes
Operationally simple
Multi-Datacenter Replication
29. Data Model
Comes from Google BigTable
Goals
– Minimize disk seeks
– High throughput
– Low latency
– Durable
30. Data Model
Keyspace
– A collection of Column Families
– Controls replication settings
Column Family
– Kinda resembles a table
31. Column Families
Static
– Object data
– Similar to a table in a relational database
Dynamic
– Pre-calculated query results
– Materialized views
33. Dynamic Column Families
Rows
– Each row has a unique primary key
– Sorted list of (name, value) tuples
• Like a sorted map or dictionary
– The (name, value) tuple is called a “column”
35. Dynamic Column Families
Column Timestamps
– Each column (tuple) has a timestamp
– In the case of a collision, the latest timestamp wins
– Client specifies timestamp with write
– Writes are idempotent
• Infinite retries allowed
36. Dynamic Column Families
Other Examples:
– Timeline of tweets by a user
– Timeline of tweets by all of the people a user is
following
– List of comments sorted by score
– List of friends grouped by state
37. The Data API
Two choices
– RPC-based API
– CQL
• Cassandra Query Language
38. Inserting Data
INSERT INTO users (KEY, “name”, “age”)
VALUES (“thobbs”, “Tyler”, 24);
39. Updating Data
Updates are the same as inserts:
INSERT INTO users (KEY, “age”)
VALUES (“thobbs”, 34);
Or
UPDATE users SET “age” = 34
WHERE KEY = “thobbs”;
41. Fetching Data
Explicit column select:
SELECT “name”, “age” FROM users
WHERE KEY = “thobbs”;
42. Fetching Data
Get a slice of columns
UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
WHERE KEY = “key”;
SELECT 1..3 FROM letters WHERE KEY = “key”;
Returns [(1, a), (2, b), (3, c)]
43. Fetching Data
Get a slice of columns
SELECT FIRST 2 FROM letters WHERE KEY = “key”;
Returns [(1, a), (2, b)]
SELECT FIRST 2 REVERSED FROM letters
WHERE KEY = “key”;
Returns [(5, e), (4, d)]
44. Fetching Data
Get a slice of columns
SELECT 3..'' FROM letters WHERE KEY = “key”;
Returns [(3, c), (4, d), (5, e)]
SELECT FIRST 2 REVERSED 4..'' FROM letters
WHERE KEY = “key”;
Returns [(4, d), (3, c)]
45. Deleting Data
Delete a whole row:
DELETE FROM users WHERE KEY = “thobbs”;
Delete specific columns:
DELETE “age” FROM users
WHERE KEY = “thobbs”;
46. Secondary Indexes
Builtin basic indexes
CREATE INDEX ageIndex ON users (age);
SELECT name FROM USERS
WHERE age = 24 AND state = “TX”;
47. Performance
Writes
– 10k – 30k per second per node
– Sub-millisecond latency
Reads
– 1k – 10k per second per node
– Depends on data set, caching
– Usually 0.1 to 10ms latency
48. Other Features
Distributed Counters
– Can support millions of high-volume counters
Excellent Multi-datacenter Support
– Disaster recovery
– Locality
Hadoop Integration
– Isolation of resources
– Hive and Pig drivers
Compression
49. What Cassandra Can't Do
Transactions
– Unless you use a distributed lock
– Atomicity, Isolation
– These aren't needed as often as you'd think
Limited support for ad-hoc queries
– Know what you want to do with the data
50. Not One-size-fits-all
Use alongside an RDBMS
– Use the RDBMS for highly-transactional or highly-
relational data
• Usually a small set of data
– Let Cassandra scale to handle the rest
51. Language Support
Good:
– Java
– Python
– Ruby
– PHP
– C#
Coming Soon:
– Everything else, now that we have CQL