CFS: Cassandra backed storage for Hadoop

•

1 like•1,539 views

nickmbailey

Technology Business

©2012 DataStax
Help me Cassandra, you’re my
only hope
3

©2012 DataStax
Cassandra
• Distributed architecture
• No SPOF
• Scalable
• Real time data
• No ad-hoc query support
4

©2012 DataStax
Cassandra, why can’t you...
5

©2012 DataStax
...do the things Hadoop was
built for.
6

©2012 DataStax
Cassandra + Hadoop = <3
7

©2012 DataStax
The Solution
• InputFormat/OutputFormat
• Unfortunately, still need a DFS
• Run tasktrackers/datanodes locally
• Data Locality FTW!
• Run namenode/jobtracker somewhere
• Since Cassandra 0.6 (the dark ages)
8

©2012 DataStax
Ok, but what about these parts
that suck...
9

©2012 DataStax
Do not want...
• Multiple hadoop stacks?
• SPOF?
• 3 JVMS?
10

©2012 DataStax
Cassandra Data model in 1
minute
12

©2012 DataStax
Column Families
• Column Family ~= Table
• Row Key + columns
• Columns are sparse
13

©2012 DataStax
Static - Users Column Family
14
Row Key
nickmbailey password: * name: Nick
zznate password: * name: Nate phone: 512-7777

©2012 DataStax
Select * from Users where name=Nick;
Secondary Indexes
15

©2012 DataStax
Dynamic - Friends
16
Row Key
nickmbailey zznate: thobbs:
zznate jbeiber: thobbs: steve_watt:

©2012 DataStax
CF: inode
• Essentially, namenode replacement
• File metadata
20

©2012 DataStax
CF: inode
• Row Key = UUID
• Allows for ﬁle renames
• Secondary indexes for ﬁle browsing
• Columns:
22
Column
ﬁlename /home/nick/data.txt
parent_path /home/nick/
attributes nick:nick:777
TimeUUID1 <block metadata>
TimeUUID2 <block metadata>
TimeUUID3 <block metadata>
...

©2012 DataStax
CF: sblocks
• Essentially, datanode replacement
• Stores actual contents of ﬁles
• Each row is an hdfs block
• Row Key = Block ID
24
Column
TimeUUID1 <compressed ﬁle data>
TimeUUID2 <compressed ﬁle data>
TimeUUID3 <compressed ﬁle data>
...

©2012 DataStax
Writes
• Write ﬁle metadata
• Split into blocks
• Still controlled by ‘dfs.block.size’
• also ‘cfs.local.subblock.size’
• Read in a block
• split into sub blocks
• Update inode, sblocks
• rinse, repeat
26

©2012 DataStax
Reads
• Check for ﬁle in inode
• Determine appropriate blocks
• Request blocks via thrift
• If data is local...
• ...get location on local ﬁlesystem
• If data is remote...
• ...get actual ﬁle content via thrift
28

©2012 DataStax
What Else?
• Current Implementation: 1.0.4
• <property>
<name>fs.cfs.impl</name>
<value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value>
</property>
• Supports HDFS append()
• Immutability makes things easy
• See the ﬁrst incarnation
• https://github.com/riptano/brisk
29

What's hot

MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin

Keeping your application’s latency SLAs no matter whatScyllaDB

caching2012.pdfKarthikS573262

Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB

MySQL without the SQL -- Cascadia PHPDave Stokes

Understanding and tuning WiredTiger, the new high performance database engine...Ontico

Ceph Research at UCSCCeph Community

Introduction to RedisTO THE NEW | Technology

Introduction to Cassandra (June 2010)gdusbabek

CouchDBcodebits

Ceph and RocksDBSage Weil

RAIDZ on-disk format vs. small blocksChristie Barnes Andersen

Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax

XSKY - ceph luminous updateinwin stack

Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax

Nko workshop - node js & nosqlSimon Su

Drupal MySQL ClusterKris Buytaert

Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.

Mashing the dataFelix Crisan

What's hot (19)

MySQL NDB Cluster 8.0 SQL faster than NoSQL

Keeping your application’s latency SLAs no matter what

caching2012.pdf

Steering the Sea Monster - Integrating Scylla with Kubernetes

MySQL without the SQL -- Cascadia PHP

Understanding and tuning WiredTiger, the new high performance database engine...

Ceph Research at UCSC

Introduction to Redis

Introduction to Cassandra (June 2010)

CouchDB

Ceph and RocksDB

RAIDZ on-disk format vs. small blocks

Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...

XSKY - ceph luminous update

Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

Nko workshop - node js & nosql

Drupal MySQL Cluster

Scalable Filesystem Metadata Services with RocksDB

Mashing the data

Similar to CFS: Cassandra backed storage for Hadoop

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy

On Cassandra Development: Past, Present and Futurepcmanus

Toronto jaspersoft meetupPatrick McFadin

State of Cassandra 2012jbellis

An Introduction to Cassandra on Linuxnickmbailey

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe

Presentation day4 oracle12cPradeep Srivastava

Oracle Big Data Cloud servicemandeep kaur Sandhu

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

Openstack Swift - Lots of small filesAlexandre Lecuyer

week1slides1704202828322.pdfTusharAgarwal49094

Fun with Fabric in 15Neo4j

Vijfhart thema-avond-oracle-12c-new-featuresmkorremans

Is the database a solved problem?Kenneth Geisshirt

Big Data Uses with Distributed Asynchronous Object StorageIntel® Software

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn

Data Science Lab Meetup: Cassandra and SparkChristopher Batey

Cassandra 2.0 (Introduction)bigdatagurus_meetup

Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...Citrix

Similar to CFS: Cassandra backed storage for Hadoop (20)

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...

On Cassandra Development: Past, Present and Future

Toronto jaspersoft meetup

State of Cassandra 2012

An Introduction to Cassandra on Linux

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

Big Data in Container; Hadoop Spark in Docker and Mesos

Presentation day4 oracle12c

Oracle Big Data Cloud service

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

Openstack Swift - Lots of small files

week1slides1704202828322.pdf

Fun with Fabric in 15

Vijfhart thema-avond-oracle-12c-new-features

Is the database a solved problem?

Big Data Uses with Distributed Asynchronous Object Storage

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...

Data Science Lab Meetup: Cassandra and Spark

Cassandra 2.0 (Introduction)

Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

unit 4 immunoblotting technique complete.pptxBkGupta21

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024

Time Series Foundation Models - current state and future directions

The State of Passkeys with FIDO Alliance.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Anypoint Exchange: It’s Not Just a Repo!

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

DevEX - reference for building teams, processes, and platforms

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Developer Data Modeling Mistakes: From Postgres to NoSQL

"Debugging python applications inside k8s environment", Andrii Soldatenko

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

unit 4 immunoblotting technique complete.pptx

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

The Ultimate Guide to Choosing WordPress Pros and Cons

Generative AI for Technical Writer or Information Developers

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

CFS: Cassandra backed storage for Hadoop

1. CFS Cassandra-backed storage for Hadoop Nick Bailey @nickmbailey nick@datastax.com

8. ©2012 DataStax The Solution • InputFormat/OutputFormat • Unfortunately, still need a DFS • Run tasktrackers/datanodes locally • Data Locality FTW! • Run namenode/jobtracker somewhere • Since Cassandra 0.6 (the dark ages) 8

22. ©2012 DataStax CF: inode • Row Key = UUID • Allows for file renames • Secondary indexes for file browsing • Columns: 22 Column filename /home/nick/data.txt parent_path /home/nick/ attributes nick:nick:777 TimeUUID1 <block metadata> TimeUUID2 <block metadata> TimeUUID3 <block metadata> ...

24. ©2012 DataStax CF: sblocks • Essentially, datanode replacement • Stores actual contents of files • Each row is an hdfs block • Row Key = Block ID 24 Column TimeUUID1 <compressed file data> TimeUUID2 <compressed file data> TimeUUID3 <compressed file data> ...

26. ©2012 DataStax Writes • Write ﬁle metadata • Split into blocks • Still controlled by ‘dfs.block.size’ • also ‘cfs.local.subblock.size’ • Read in a block • split into sub blocks • Update inode, sblocks • rinse, repeat 26

28. ©2012 DataStax Reads • Check for file in inode • Determine appropriate blocks • Request blocks via thrift • If data is local... • ...get location on local filesystem • If data is remote... • ...get actual file content via thrift 28

29. ©2012 DataStax What Else? • Current Implementation: 1.0.4 • <property> <name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property> • Supports HDFS append() • Immutability makes things easy • See the ﬁrst incarnation • https://github.com/riptano/brisk 29

30. Want a job? nick@datastax.com

31. Questions?

CFS: Cassandra backed storage for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to CFS: Cassandra backed storage for Hadoop

Similar to CFS: Cassandra backed storage for Hadoop (20)

More from nickmbailey

More from nickmbailey (8)

Recently uploaded

Recently uploaded (20)

CFS: Cassandra backed storage for Hadoop