SlideShare a Scribd company logo
1 of 133
Download to read offline
SCALABLE DATA STORAGE
  GETTING YOU DOWN?
    TO THE CLOUD!
                Web 2.0 Expo SF 20

   Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop
THE CAST
   MIKE MALONE
   INFRASTRUCTURE ENGINEER
   @MJMALONE


   MIKE PANCHENKO
   INFRASTRUCTURE ENGINEER
   @MIHASYA


   DEREK SMITH
   INFRASTRUCTURE ENGINEER
   @DSMITTS


   PAUL LATHROP
   OPERATIONS
   @GREYTALYN
SIMPLEGEO

         We originally began as a mobile
            gaming startup, but quickly
       discovered that the location services
       and infrastructure needed to support
        our ideas didn’t exist. So we took
         matters into our own hands and
            began building it ourselves.


         Mt Gaig          Joe Stump
          CSO & co-founder   CTO & co-founder
THE STACK
             www
                    gnop

     AWS
     RDS




                      AWS                                                           auth/proxy
                       ELB
                                                                                  HTTP
                                             data centers
              ...                      ...
                                             api servers                                   record
                                                                                          storage
                                   reads                    geocoder


                              queues                                    reverse
                                                                       geocoder
                                                                                                   GeoIP

                                                                                         pushpin
                     writes



                                                    index
   storage


                                                                       Apache Cassandra
BUT WHY NOT
  POSTGIS?
DATABASES
WHAT ARE THEY GOOD FOR?
DATA STORAGE
Durably persist system state
CONSTRAINT MANAGEMENT
Enforce data integrity constraints
EFFICIENT ACCESS
Organize data and implement access methods for efficient
retrieval and summarization
DATA INDEPENDENCE
Data independence shields clients from the details
of the storage system, and data structure

LOGICAL DATA INDEPENDENCE
 Clients that operate on a subset of the attributes in a data set should
 not be affected later when new attributes are added
PHYSICAL DATA INDEPENDENCE
 Clients that interact with a logical schema remain the same despite
 physical data structure changes like
 • File organization
 • Compression
 • Indexing strategy
TRANSACTIONAL RELATIONAL
DATABASE SYSTEMS
HIGH DEGREE OF DATA INDEPENDENCE
 Logical structure: SQL Data Definition Language
 Physical structure: Managed by the DBMS
OTHER GOODIES
 They’re theoretically pure, well understood, and mostly
 standardized behind a relatively clean abstraction
 They provide robust contracts that make it easy to reason
 about the structure and nature of the data they contain
 They’re ubiquitous, battle hardened, robust, durable, etc.
ACID
These terms are not formally defined - they’re a
framework, not mathematical axioms

ATOMICITY
 Either all of a transaction’s actions are visible to another transaction, or none are

CONSISTENCY
 Application-specific constraints must be met for transaction to succeed

ISOLATION
 Two concurrent transactions will not see one another’s transactions while “in flight”

DURABILITY
 The updates made to the database in a committed transaction will be visible to
 future transactions
ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good

IT DOES SOMETHING HARD FOR YOU
 With ACID, you’re guaranteed to maintain a persistent global
 state as long as you’ve defined proper constraints and your
 logical transactions result in a valid system state
CAP THEOREM
At PODC 2000 Eric Brewer told us there were three
desirable DB characteristics. But we can only have two.

CONSISTENCY
 Every node in the system contains the same data (e.g., replicas are
 never out of date)

AVAILABILITY
 Every request to a non-failing node in the system returns a response

PARTITION TOLERANCE
 System properties (consistency and/or availability) hold even when
 the system is partitioned and data is lost
CAP THEOREM IN 30 SECONDS

CLIENT    SERVER    REPLICA
CAP THEOREM IN 30 SECONDS

CLIENT          SERVER   REPLICA
         wre
CAP THEOREM IN 30 SECONDS

CLIENT          SERVER   plice   REPLICA
         wre
CAP THEOREM IN 30 SECONDS

CLIENT          SERVER   plice   REPLICA
         wre



                           ack
CAP THEOREM IN 30 SECONDS

CLIENT            SERVER   plice   REPLICA
          wre



         aept              ack
CAP THEOREM IN 30 SECONDS

CLIENT           SERVER               REPLICA
         wre
                              FAIL!

         ni
                UNAVAILAB!
CAP THEOREM IN 30 SECONDS

CLIENT            SERVER              REPLICA
          wre
                              FAIL!

         aept
                  CSTT!
ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”

Unfortunately, ANSI SQL’s definition of isolation...
 relies in subtle ways on an assumption that a locking scheme is
 used for concurrency control, as opposed to an optimistic or
 multi-version concurrency scheme. This implies that the
 proposed semantics are ill-defined.
                            Joseph M. Hellerstein and Michael Stonebraker
                                           Anatomy of a Database System
BALANCE
IT’S A QUESTION OF VALUES
 For traditional databases CAP consistency is the holy grail: it’s
 maximized at the expense of availability and partition
 tolerance
 At scale, failures happen: when you’re doing something a
 million times a second a one-in-a-million failure happens every
 second
 We’re witnessing the birth of a new religion...
 • CAP consistency is a luxury that must be sacrificed at scale in order to
   maintain availability when faced with failures
NETWORK INDEPENDENCE
A distributed system must also manage the
network - if it doesn’t, the client has to

CLIENT APPLICATIONS ARE LEFT TO HANDLE
 Partitioning data across multiple machines
 Working with loosely defined replication semantics
 Detecting, routing around, and correcting network and
 hardware failures
WHAT’S WRONG
WITH MYSQL..?
TRADITIONAL RELATIONAL DATABASES
 They are from an era (er, one of the eras) when Big Iron was
 the answer to scaling up
 In general, the network was not considered part of the system
NEXT GENERATION DATABASES
 Deconstructing, and decoupling the beast
 Trying to create a loosely coupled structured storage system
 • Something that the current generation of database systems never
   quite accomplished
UNDERSTANDING
  CASSANDRA
APACHE CASSANDRA
A DISTRIBUTED STRUCTURED STORAGE SYSTEM
EMPHASIZING
 Extremely large data sets
 High transaction volumes
 High value data that necessitates high availability


TO USE CASSANDRA EFFECTIVELY IT HELPS TO
UNDERSTAND WHAT’S GOING ON BEHIND THE SCENES
APACHE CASSANDRA
A DISTRIBUTED HASH TABLE WITH SOME TRICKS
 Peer-to-peer architecture with no distinguished nodes, and
 therefore no single points of failure
 Gossip-based cluster management
 Generic distributed data placement strategy maps data to nodes
 • Pluggable partitioning
 • Pluggable replication strategy
 Quorum based consistency, tunable on a per-request basis
 Keys map to sparse, multi-dimensional sorted maps
 Append-only commit log and SSTables for efficient disk utilization
NETWORK MODEL
DYNAMO INSPIRED
CONSISTENT HASHING
 Simple random partitioning mechanism for distribution
 Low fuss online rebalancing when operational requirements
 change
GOSSIP PROTOCOL
 Simple decentralized cluster configuration and fault detection
 Core protocol for determining cluster membership and
 providing resilience to partial system failure
CONSISTENT HASHING
Improves shortcomings of modulo-based hashing


                                    1
    sh(alice) % 3
                                   2
     => 23 % 3
        => 2                       3
CONSISTENT HASHING
With modulo hashing, a change in the number of
nodes reshuffles the entire data set

                                     1
    sh(alice) % 4
                                     2
     => 23 % 4
        => 3                         3
                                     4
CONSISTENT HASHING
Instead the range of the hash function is mapped to a
ring, with each node responsible for a segment




                                            0
 sh(alice) => 23
                                    84             42
CONSISTENT HASHING
When nodes are added (or removed) most of the data
mappings remain the same


                                         0
sh(alice) => 23
                                 84            42
                                        64
CONSISTENT HASHING
Rebalancing the ring requires a minimal amount of
data shuffling


                                           0
 sh(alice) => 23
                                   96               32
                                          64
GOSSIP
DISSEMINATES CLUSTER MEMBERSHIP AND
RELATED CONTROL STATE
 Gossip is initiated by an interval timer
 At each gossip tick a node will
 • Randomly select a live node in the cluster, sending it a gossip message
 • Attempt to contact cluster members that were previously marked as
   down
 If the gossip message is unacknowledged for some period of
 time (statistically adjusted based on the inter-arrival time of
 previous messages) the remote node is marked as down
REPLICATION
REPLICATION FACTOR Determines how many
copies of each piece of data are created in the
system
                               RF=3
                                         0
 sh(alice) => 23
                                 96               32
                                        64
CONSISTENCY MODEL
DYNAMO INSPIRED
QUORUM-BASED CONSISTENCY

                                  W=2
                                        0
                      wre
   sh(alice) => 23
                      ad          96        32
    W+R>N
                            R=2         64
    Cstt
TUNABLE CONSISTENCY
WRITES

  ZERO DON’T BOTHER WAITING FOR A RESPONSE

   ANY WAIT FOR SOME NODE (NOT NECESSARILY A
       REPLICA) TO RESPOND

   ONE WAIT FOR ONE REPLICA TO RESPOND

 QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND

   ALL WAIT FOR ALL N REPLICAS TO RESPOND
TUNABLE CONSISTENCY
READS

   ONE WAIT FOR ONE REPLICA TO RESPOND

 QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND

   ALL WAIT FOR ALL N REPLICAS TO RESPOND
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY
                           W=2

                 wre




                        fail
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR Asynchronously checks replicas during
reads and repairs any inconsistencies
HINTED HANDOFF
ANTI-ENTROPY
                                W=2

                       wre
                     ad + fix
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
                         wre
                                plica
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
                             * ck *
                                          pair
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY Manual repair process where nodes
generate Merkle trees (hash trees) to detect and
repair data inconsistencies

                      pair
DATA MODEL
BIGTABLE INSPIRED
SPARSE MATRIX it’s a hash-map (associative array):
a simple, versatile data structure
SCHEMA-FREE data model, introduces new freedom
and new responsibilities
COLUMN FAMILIES blend row-oriented and column-
oriented structure, providing a high level mechanism
for clients to manage on-disk and inter-node data
locality
DATA MODEL
TERMINOLOGY
KEYSPACE A named collection of column families
(similar to a “database” in MySQL) you only need one and
you can mostly ignore it
COLUMN FAMILY A named mapping of keys to rows
ROW A named sorted map of columns or supercolumns
COLUMN A <name, value, timestamp> triple

SUPERCOLUMN A named collection of columns, for
people who want to get fancy
DATA MODEL
  {
                 column family
      “users”: {            key
          “alice”: {
              “city”: [“St. Louis”, 1287040737182],
                                                      columns
row                                                   (name, value, timestamp)
              “name”: [“Alice” 1287080340940],
                               ,
          },
          ...
      },
      “locations”: {
      },
      ...
  }
IT’S A DISTRIBUTED HASH TABLE
WITH A TWIST...
COLUMNS IN ARE STORED TOGETHER ON ONE NODE,
IDENTIFIED BY <keyspace, key>

{
                    column family
    “users”: {
                            key
       “
       alice”: {
                “city”: [“St. Louis” 1287040737182],
                                   ,
                                                       columns
                “name”: [“Alice” 1287080340940],
                                 ,
          },
          ...
    },

}
    ...
                                         bob
                                       alice s3b
                                                           3e8
HASH TABLE
SUPPORTED QUERIES

  EXACT MATCH
  RANGE
  PROXIMITY
  ANYTHING THAT’S NOT
  EXACT MATCH
COLUMNS
SUPPORTED QUERIES

  EXACT MATCH
                     {
  RANGE              “users”: {
                        “alice”: {
                            “city”: [“St. Louis”, 1287040737182],
  PROXIMITY                 “friend-1”: [“Bob” 1287080340940],
                                               ,
              friends “friend-2”: [“Joe”, 1287080340940],
                            “friend-3”: [“Meg” 1287080340940],
                                                  ,
                            “name”: [“Alice” 1287080340940],
                                             ,
                        },
                        ...
                     }
                   }
LOG-STRUCTURED MERGE
MEMTABLES are in memory data structures that
contain newly written data

COMMIT LOGS are append only files where new
data is durably written

SSTABLES are serialized memtables, persisted to
disk

COMPACTION periodically merges multiple
memtables to improve system performance
CASSANDRA
CONCEPTUAL SUMMARY...
IT’S A DISTRIBUTED HASH TABLE
 Gossip based peer-to-peer “ring” with no distinguished nodes and no
 single point of failure
 Consistent hashing distributes workload and simple replication
 strategy for fault tolerance and improved throughput
WITH TUNABLE CONSISTENCY
 Based on quorum protocol to ensure consistency
 And simple repair mechanisms to stay available during partial system
 failures
AND A SIMPLE, SCHEMA-FREE DATA MODEL
 It’s just a key-value store
 Whose values are multi-dimensional sorted map
ADVANCED CASSANDRA
     - A case study -
SPATIAL DATA IN A DHT
A FIRST PASS
THE ORDER PRESERVING PARTITIONER
CASSANDRA’S PARTITIONING
STRATEGY IS PLUGGABLE
 Partitioner maps keys to nodes
 Random partitioner destroys locality by hashing
 Order preserving partitioner retains locality, storing
 keys in natural lexicographical order around ring        z a
                                                                    alice
                                                          a
                                                                    bob
                                               u                h
                                   sam                    m
ORDER PRESERVING PARTITIONER

  EXACT MATCH
  RANGE
              On a single dimension
? PROXIMITY
SPATIAL DATA
IT’S INHERENTLY MULTIDIMENSIONAL



         2       x       2, 2


         1

             1       2
DIMENSIONALITY REDUCTION
WITH SPACE-FILLING CURVES



           1     2




           3     4
Z-CURVE
SECOND ITERATION
Z-VALUE




              14
          x
GEOHASH
SIMPLE TO COMPUTE
 Interleave the bits of decimal coordinates
 (equivalent to binary encoding of pre-order
 traversal!)
 Base32 encode the result
AWESOME CHARACTERISTICS
 Arbitrary precision
 Human readable
 Sorts lexicographically


                            01101
                                        e
DATA MODEL
{
    “record-index”: {
                                  key
                                       <geohash>:<id>
       “9yzgcjn0:moonrise hotel”: {
           “”: [“”, 1287040737182],
       },
       ...
    },
    “records”: {
       “moonrise hotel”: {
           “latitude”: [“38.6554420”, 1287040737182],
           “longitude”: [“-90.2992910”, 1287040737182],
           ...
        }
    }
}
BOUNDING BOX
E.G., MULTIDIMENSIONAL RANGE

 Gie stuff  bg box!   Gie 2  3


          1       2


          3       4


                          Gie 4  5
SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
 Clients must
 • Pre-process to compose multiple queries
 • Post-process to filter and merge results
 Degenerate cases can be bad, particularly for nearest-neighbor
 queries
Z-CURVE LOCALITY
Z-CURVE LOCALITY



                  x
              x
Z-CURVE LOCALITY



              x
      x
Z-CURVE LOCALITY



                   x
           o o o       x
            o
          o o
      o
THE WORLD
IS NOT BALANCED




          Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive
TOO MUCH LOCALITY


       1          2

  SAN FRANCISCO


      3           4
TOO MUCH LOCALITY


       1          2

  SAN FRANCISCO


      3           4
TOO MUCH LOCALITY


       1          2   I’m sad.


  SAN FRANCISCO


      3           4
TOO MUCH LOCALITY
                                 I’m b.


       1          2   I’m sad.              Me o.

  SAN FRANCISCO


      3           4

                                 Let’s py xbox.
A TURNING POINT
HELLO, DRAWING BOARD
SURVEY OF DISTRIBUTED P2P INDEXING
 An overlay-dependent index works directly with nodes of the
 peer-to-peer network, defining its own overlay
 An over-DHT index overlays a more sophisticated data
 structure on top of a peer-to-peer distributed hash table
ANOTHER LOOK AT POSTGIS
MIGHT WORK, BUT
 The relational transaction management system (which we’d
 want to change) and access methods (which we’d have to
 change) are tightly coupled (necessarily?) to other parts of
 the system
 Could work at a higher level and treat PostGIS as a black box
 • Now we’re back to implementing a peer-to-peer network with failure
   recovery, fault detection, etc... and Cassandra already had all that.
  • It’s probably clear by now that I think these problems are more
    difficult than actually storing structured data on disk
LET’S TAKE A STEP BACK
EARTH
EARTH
EARTH
EARTH
EARTH, TREE, RING
DATA MODEL
{
    “record-index”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182],
                                                               ,
           ...
       },
    },
    “record-index-meta”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “split”: [“false”, 1287040737182],
        }
        “layer-name: 37     .875, -90:42.265, -101.25” {
            “split”: [“true”, 1287040737182],
            “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182]
                                                                  ,
            “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182]
                                                                     ,
         }
    }
}
DATA MODEL
{
    “record-index”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182],
                                                               ,
           ...
       },
    },
    “record-index-meta”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “split”: [“false”, 1287040737182],
        }
        “layer-name: 37     .875, -90:42.265, -101.25” {
            “split”: [“true”, 1287040737182],
            “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182]
                                                                  ,
            “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182]
                                                                     ,
         }
    }
}
DATA MODEL
{
    “record-index”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182],
                                                               ,
           ...
       },
    },
    “record-index-meta”: {
       “layer-name:37     .875, -90:40.25, -101.25”: {
           “split”: [“false”, 1287040737182],
        }
        “layer-name: 37     .875, -90:42.265, -101.25” {
            “split”: [“true”, 1287040737182],
            “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182]
                                                                  ,
            “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182]
                                                                     ,
         }
    }
}
SPLITTING
IT’S PRETTY MUCH JUST A CONCURRENT TREE
 Splitting shouldn’t lock the tree for reads or writes and failures
 shouldn’t cause corruption
 • Splits are optimistic, idempotent, and fail-forward
 • Instead of locking, writes are replicated to the splitting node and the
   relevant child[ren] while a split operation is taking place
   • Cleanup occurs after the split is completed and all interested nodes are
     aware that the split has occurred
   • Cassandra writes are idempotent, so splits are too - if a split fails, it is
     simply be retried
 Split size: A Tunable knob for balancing locality and distributedness
 The other hard problem with concurrent trees is rebalancing - we
 just don’t do it! (more on this later)
THE ROOT IS HOT
MIGHT BE A DEAL BREAKER
 For a tree to be useful, it has to be traversed
 • Typically, tree traversal starts at the root
 • Root is the only discoverable node in our tree
 Traversing through the root meant reading the root for every
 read or write below it - unacceptable
 • Lots of academic solutions - most promising was a skip graph, but
   that required O(n log(n)) data - also unacceptable
 • Minimum tree depth was propsed, but then you just get multiple hot-
   spots at your minimum depth nodes
BACK TO THE BOOKS
LOTS OF ACADEMIC WORK ON THIS TOPIC
 But academia is obsessed with provable, deterministic,
 asymptotically optimal algorithms
 And we only need something that is probably fast enough
 most of the time (for some value of “probably” and “most of
 the time”)
 • And if the probably good enough algorithm is, you know... tractable...
   one might even consider it qualitatively better!
- THE ROOT -
A HOT SPOT AND A SPOF
We have   We want
THINKING HOLISTICALLY
WE OBSERVED THAT
 Once a node in the tree exists, it doesn’t go away
 Node state may change, but that state only really matters
 locally - thinking a node is a leaf when it really has children is
 not fatal
SO... WHAT IF WE JUST CACHED NODES THAT
WERE OBSERVED IN THE SYSTEM!?
CACHE IT
STUPID SIMPLE SOLUTION
 Keep an LRU cache of nodes that have been traversed
 Start traversals at the most selective relevant node
 If that node doesn’t satisfy you, traverse up the tree
 Along with your result set, return a list of nodes that were
 traversed so the caller can add them to its cache
TRAVERSAL
NEAREST NEIGHBOR


          o   o
          o xo
TRAVERSAL
NEAREST NEIGHBOR


          o   o
          o xo
TRAVERSAL
NEAREST NEIGHBOR


          o   o
          o xo
KEY CHARACTERISTICS
PERFORMANCE
 Best case on the happy path (everything cached) has zero
 read overhead
 Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
 Makes worst case more worser, but so far so good
DISTRIBUTED TREE
SUPPORTED QUERIES

  EXACT MATCH
  RANGE
  PROXIMITY
  SOMETHING ELSE I HAVEN’T
  EVEN HEARD OF
DISTRIBUTED TREE
SUPPORTED QUERIES
                       MUL
  EXACT MATCH       DI P
                       NS   
  RANGE
                           NS!
  PROXIMITY
  SOMETHING ELSE I HAVEN’T
  EVEN HEARD OF
LIFE OF A REQUEST
THE BIRDS ‘N THE BEES
                  ELB

       gate                  gate


     service               service




                  cass




    worker pool           worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB

       gate


     service




                  cass




    worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate


     service




                  cass




    worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate               auicn; fwdg

     service




                  cass




    worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate               auicn; fwdg

     service              buss logic - bic validn



                  cass




    worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate               auicn; fwdg

     service              buss logic - bic validn



                  cass    cd stage



    worker pool




                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate               auicn; fwdg

     service              buss logic - bic validn



                  cass    cd stage



    worker pool           buss logic - stage/xg



                  index
THE BIRDS ‘N THE BEES
                  ELB     load bag; AWS svice

       gate               auicn; fwdg

     service              buss logic - bic validn



                  cass    cd stage



    worker pool           buss logic - stage/xg



                  index   awome sauce f qryg
ELB
•Traffic management
 •Control which AZs are serving traffic
 •Upgrades without downtime
  •Able to remove an AZ, upgrade, test,
  replace

•API-level failure scenarios
 •Periodically runs healthchecks on nodes
 •Removes nodes that fail
GATE
•Basic auth
•HTTP proxy to specific services
 •Services are independent of one another
 •Auth is Decoupled from business logic
•First line of defense
 •Very fast, very cheap
 •Keeps services from being overwhelmed
 by poorly authenticated requests
RABBITMQ
•Decouple accepting writes from
performing the heavy lifting
 •Don’t block client while we write to db/
 index
•Flexibility in the event of degradation
further down the stack
 •Queues can hold a lot, and can keep
 accepting writes throughout incident
•Heterogenous consumers - pass the same
message through multiple code paths easily
AWS
EC2
• Security groups
• Static IPs
• Choose your data center
• Choose your instance type
• On-demand vs. Reserved
ELASTIC BLOCK SUPPORT
• Storage devices that can be anywhere from 1GB to 1TB
• Snapshotting
• Automatic replication
• Mobility
ELASTIC LOAD BALANCING
• Distribute incoming traffic
• Automatic scaling
• Health checks
SIMPLE STORAGE SERVICE
• File sizes can be up to 5TBs
• Unlimited amount of files
• Individual read/write access credentials
RELATIONAL DATABASE SERVICE
• MySQL in the cloud
• Manages replication
• Specify instance types based on need
• Snapshots
CONFIGURATION
 MANAGEMENT
WHY IS THIS NECESSARY?
• Easier DevOps integration
• Reusable modules
• Infrastructure as code
• One word: WINNING
POSSIBLE OPTIONS
SAMPLE MANIFEST
     # /root/learning-manifests/apache2.pp
 package {
   'apache2':
      ensure => present;
 }

 file {
   '/etc/apache2/apache2.conf':
      ensure => file,
      mode   => 600,
      notify => Service[‘apache2’],
      source => '/root/learning-manifests/apache2.conf',
 }

 service {
   'apache2':
     ensure      => running,
     enable      => true,
     subscribe   => File['/etc/apache2/apache2.conf'],
 }
AN EXAMPLE



       TERMINAL
CONTINUOUS
INTEGRATION
GET ‘ER DONE
• Revision control
• Automate build process
• Automate testing process
• Automate deployment




  Local code changes should result in production
                  deployments.
DON’T FORGET TO DEBIANIZE
• All codebases must be debianized
• Open source project not debianized yet, fork the repo and
  do it yourself!
• Take the time to teach others
• Debian directories can easily be reused after a simple search
  and replace
REPOMAN
HTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT


   repoman upload myrepo sg-example.tar.gz

   repoman promote development/sg-example staging
HUDSON
MAINTAINING MULTIPLE
ENVIRONMENTS
• Run unit tests in a development environment
• Promote to staging
• Run system tests in a staging environment
• Run consumption tests in a staging environment
• Promote to production




  Congratz, you have now just automated yourself
                   out of a job.
THE MONEY MAKER




       Github Plugin
TYING IT ALL TOGETHER



        TERMINAL
FLUME
Flume is a distributed, reliable and available
service for efficiently collecting, aggregating and
moving large amounts of log data.




             syslog on steriods
DATA-FLOWS
AGENTS
Physical Host   Logical Nodes            Source and Sink


                twitter_stream          twitter(“dsmitts”, “mypassword”, “url”)
                                                   agentSink(35853)




                                            tail(“/var/log/nginx/access.log”)
  i-192df98          tail                           agentSink(35853)



                                                collectorSource(35853)
                 hdfs_writer     collectorSink("hdfs://namenode.sg.com/bogus/", "logs")
RELIABILITY


         END-TO-END

       STORE ON FAILURE

         BEST EFFORT
GETTIN’ JIGGY WIT IT
Custom Decorators
HOW DO WE USE IT?
PERSONAL EXPERIENCES
• #flume
• Automation was gnarly
• Its never a good day when Eclipse is involved
• Resource hog (at first)
We’ buildg kick-s ols f vops  x,
tpt,  csume da cnect  a locn




                  MARCH 2011

More Related Content

Similar to Scalable Data Storage Getting You Down? To The Cloud!

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Mohammad Asif
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentationSergey Enin
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerMichael Rys
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect LavanyaMurthy9
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practicesHaseeb Alam
 
Sql azure introduction
Sql azure introductionSql azure introduction
Sql azure introductionSuherman .
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
 
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...Amazon Web Services
 

Similar to Scalable Data Storage Getting You Down? To The Cloud! (20)

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra
CassandraCassandra
Cassandra
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentation
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL Server
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Technical overview of Azure Cosmos DB
Technical overview of Azure Cosmos DBTechnical overview of Azure Cosmos DB
Technical overview of Azure Cosmos DB
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
 
Sql azure introduction
Sql azure introductionSql azure introduction
Sql azure introduction
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...
AWS January 2016 Webinar Series - Amazon Aurora for Enterprise Database Appli...
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Scalable Data Storage Getting You Down? To The Cloud!

  • 1. SCALABLE DATA STORAGE GETTING YOU DOWN? TO THE CLOUD! Web 2.0 Expo SF 20 Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop
  • 2. THE CAST MIKE MALONE INFRASTRUCTURE ENGINEER @MJMALONE MIKE PANCHENKO INFRASTRUCTURE ENGINEER @MIHASYA DEREK SMITH INFRASTRUCTURE ENGINEER @DSMITTS PAUL LATHROP OPERATIONS @GREYTALYN
  • 3. SIMPLEGEO We originally began as a mobile gaming startup, but quickly discovered that the location services and infrastructure needed to support our ideas didn’t exist. So we took matters into our own hands and began building it ourselves. Mt Gaig Joe Stump CSO & co-founder CTO & co-founder
  • 4. THE STACK www gnop AWS RDS AWS auth/proxy ELB HTTP data centers ... ... api servers record storage reads geocoder queues reverse geocoder GeoIP pushpin writes index storage Apache Cassandra
  • 5. BUT WHY NOT POSTGIS?
  • 6. DATABASES WHAT ARE THEY GOOD FOR? DATA STORAGE Durably persist system state CONSTRAINT MANAGEMENT Enforce data integrity constraints EFFICIENT ACCESS Organize data and implement access methods for efficient retrieval and summarization
  • 7. DATA INDEPENDENCE Data independence shields clients from the details of the storage system, and data structure LOGICAL DATA INDEPENDENCE Clients that operate on a subset of the attributes in a data set should not be affected later when new attributes are added PHYSICAL DATA INDEPENDENCE Clients that interact with a logical schema remain the same despite physical data structure changes like • File organization • Compression • Indexing strategy
  • 8. TRANSACTIONAL RELATIONAL DATABASE SYSTEMS HIGH DEGREE OF DATA INDEPENDENCE Logical structure: SQL Data Definition Language Physical structure: Managed by the DBMS OTHER GOODIES They’re theoretically pure, well understood, and mostly standardized behind a relatively clean abstraction They provide robust contracts that make it easy to reason about the structure and nature of the data they contain They’re ubiquitous, battle hardened, robust, durable, etc.
  • 9. ACID These terms are not formally defined - they’re a framework, not mathematical axioms ATOMICITY Either all of a transaction’s actions are visible to another transaction, or none are CONSISTENCY Application-specific constraints must be met for transaction to succeed ISOLATION Two concurrent transactions will not see one another’s transactions while “in flight” DURABILITY The updates made to the database in a committed transaction will be visible to future transactions
  • 10. ACID HELPS ACID is a sort-of-formal contract that makes it easy to reason about your data, and that’s good IT DOES SOMETHING HARD FOR YOU With ACID, you’re guaranteed to maintain a persistent global state as long as you’ve defined proper constraints and your logical transactions result in a valid system state
  • 11. CAP THEOREM At PODC 2000 Eric Brewer told us there were three desirable DB characteristics. But we can only have two. CONSISTENCY Every node in the system contains the same data (e.g., replicas are never out of date) AVAILABILITY Every request to a non-failing node in the system returns a response PARTITION TOLERANCE System properties (consistency and/or availability) hold even when the system is partitioned and data is lost
  • 12. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA
  • 13. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre
  • 14. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre
  • 15. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre ack
  • 16. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre aept ack
  • 17. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre FAIL! ni UNAVAILAB!
  • 18. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre FAIL! aept CSTT!
  • 19. ACID HURTS Certain aspects of ACID encourage (require?) implementors to do “bad things” Unfortunately, ANSI SQL’s definition of isolation... relies in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined. Joseph M. Hellerstein and Michael Stonebraker Anatomy of a Database System
  • 20. BALANCE IT’S A QUESTION OF VALUES For traditional databases CAP consistency is the holy grail: it’s maximized at the expense of availability and partition tolerance At scale, failures happen: when you’re doing something a million times a second a one-in-a-million failure happens every second We’re witnessing the birth of a new religion... • CAP consistency is a luxury that must be sacrificed at scale in order to maintain availability when faced with failures
  • 21. NETWORK INDEPENDENCE A distributed system must also manage the network - if it doesn’t, the client has to CLIENT APPLICATIONS ARE LEFT TO HANDLE Partitioning data across multiple machines Working with loosely defined replication semantics Detecting, routing around, and correcting network and hardware failures
  • 22. WHAT’S WRONG WITH MYSQL..? TRADITIONAL RELATIONAL DATABASES They are from an era (er, one of the eras) when Big Iron was the answer to scaling up In general, the network was not considered part of the system NEXT GENERATION DATABASES Deconstructing, and decoupling the beast Trying to create a loosely coupled structured storage system • Something that the current generation of database systems never quite accomplished
  • 24. APACHE CASSANDRA A DISTRIBUTED STRUCTURED STORAGE SYSTEM EMPHASIZING Extremely large data sets High transaction volumes High value data that necessitates high availability TO USE CASSANDRA EFFECTIVELY IT HELPS TO UNDERSTAND WHAT’S GOING ON BEHIND THE SCENES
  • 25. APACHE CASSANDRA A DISTRIBUTED HASH TABLE WITH SOME TRICKS Peer-to-peer architecture with no distinguished nodes, and therefore no single points of failure Gossip-based cluster management Generic distributed data placement strategy maps data to nodes • Pluggable partitioning • Pluggable replication strategy Quorum based consistency, tunable on a per-request basis Keys map to sparse, multi-dimensional sorted maps Append-only commit log and SSTables for efficient disk utilization
  • 26. NETWORK MODEL DYNAMO INSPIRED CONSISTENT HASHING Simple random partitioning mechanism for distribution Low fuss online rebalancing when operational requirements change GOSSIP PROTOCOL Simple decentralized cluster configuration and fault detection Core protocol for determining cluster membership and providing resilience to partial system failure
  • 27. CONSISTENT HASHING Improves shortcomings of modulo-based hashing 1 sh(alice) % 3 2 => 23 % 3 => 2 3
  • 28. CONSISTENT HASHING With modulo hashing, a change in the number of nodes reshuffles the entire data set 1 sh(alice) % 4 2 => 23 % 4 => 3 3 4
  • 29. CONSISTENT HASHING Instead the range of the hash function is mapped to a ring, with each node responsible for a segment 0 sh(alice) => 23 84 42
  • 30. CONSISTENT HASHING When nodes are added (or removed) most of the data mappings remain the same 0 sh(alice) => 23 84 42 64
  • 31. CONSISTENT HASHING Rebalancing the ring requires a minimal amount of data shuffling 0 sh(alice) => 23 96 32 64
  • 32. GOSSIP DISSEMINATES CLUSTER MEMBERSHIP AND RELATED CONTROL STATE Gossip is initiated by an interval timer At each gossip tick a node will • Randomly select a live node in the cluster, sending it a gossip message • Attempt to contact cluster members that were previously marked as down If the gossip message is unacknowledged for some period of time (statistically adjusted based on the inter-arrival time of previous messages) the remote node is marked as down
  • 33. REPLICATION REPLICATION FACTOR Determines how many copies of each piece of data are created in the system RF=3 0 sh(alice) => 23 96 32 64
  • 34. CONSISTENCY MODEL DYNAMO INSPIRED QUORUM-BASED CONSISTENCY W=2 0 wre sh(alice) => 23 ad 96 32 W+R>N R=2 64 Cstt
  • 35. TUNABLE CONSISTENCY WRITES ZERO DON’T BOTHER WAITING FOR A RESPONSE ANY WAIT FOR SOME NODE (NOT NECESSARILY A REPLICA) TO RESPOND ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND
  • 36. TUNABLE CONSISTENCY READS ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND
  • 37. CONSISTENCY MODEL DYNAMO INSPIRED READ REPAIR HINTED HANDOFF ANTI-ENTROPY W=2 wre fail
  • 38. CONSISTENCY MODEL DYNAMO INSPIRED READ REPAIR Asynchronously checks replicas during reads and repairs any inconsistencies HINTED HANDOFF ANTI-ENTROPY W=2 wre ad + fix
  • 39. CONSISTENCY MODEL DYNAMO INSPIRED READ REPAIR HINTED HANDOFF Sends failed writes to another node with a hint to re-replicate when the failed node returns ANTI-ENTROPY wre plica
  • 40. CONSISTENCY MODEL DYNAMO INSPIRED READ REPAIR HINTED HANDOFF Sends failed writes to another node with a hint to re-replicate when the failed node returns ANTI-ENTROPY * ck * pair
  • 41. CONSISTENCY MODEL DYNAMO INSPIRED READ REPAIR HINTED HANDOFF ANTI-ENTROPY Manual repair process where nodes generate Merkle trees (hash trees) to detect and repair data inconsistencies pair
  • 42. DATA MODEL BIGTABLE INSPIRED SPARSE MATRIX it’s a hash-map (associative array): a simple, versatile data structure SCHEMA-FREE data model, introduces new freedom and new responsibilities COLUMN FAMILIES blend row-oriented and column- oriented structure, providing a high level mechanism for clients to manage on-disk and inter-node data locality
  • 43. DATA MODEL TERMINOLOGY KEYSPACE A named collection of column families (similar to a “database” in MySQL) you only need one and you can mostly ignore it COLUMN FAMILY A named mapping of keys to rows ROW A named sorted map of columns or supercolumns COLUMN A <name, value, timestamp> triple SUPERCOLUMN A named collection of columns, for people who want to get fancy
  • 44. DATA MODEL { column family “users”: { key “alice”: { “city”: [“St. Louis”, 1287040737182], columns row (name, value, timestamp) “name”: [“Alice” 1287080340940], , }, ... }, “locations”: { }, ... }
  • 45. IT’S A DISTRIBUTED HASH TABLE WITH A TWIST... COLUMNS IN ARE STORED TOGETHER ON ONE NODE, IDENTIFIED BY <keyspace, key> { column family “users”: { key “ alice”: { “city”: [“St. Louis” 1287040737182], , columns “name”: [“Alice” 1287080340940], , }, ... }, } ... bob alice s3b 3e8
  • 46. HASH TABLE SUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY ANYTHING THAT’S NOT EXACT MATCH
  • 47. COLUMNS SUPPORTED QUERIES EXACT MATCH { RANGE “users”: { “alice”: { “city”: [“St. Louis”, 1287040737182], PROXIMITY “friend-1”: [“Bob” 1287080340940], , friends “friend-2”: [“Joe”, 1287080340940], “friend-3”: [“Meg” 1287080340940], , “name”: [“Alice” 1287080340940], , }, ... } }
  • 48. LOG-STRUCTURED MERGE MEMTABLES are in memory data structures that contain newly written data COMMIT LOGS are append only files where new data is durably written SSTABLES are serialized memtables, persisted to disk COMPACTION periodically merges multiple memtables to improve system performance
  • 49. CASSANDRA CONCEPTUAL SUMMARY... IT’S A DISTRIBUTED HASH TABLE Gossip based peer-to-peer “ring” with no distinguished nodes and no single point of failure Consistent hashing distributes workload and simple replication strategy for fault tolerance and improved throughput WITH TUNABLE CONSISTENCY Based on quorum protocol to ensure consistency And simple repair mechanisms to stay available during partial system failures AND A SIMPLE, SCHEMA-FREE DATA MODEL It’s just a key-value store Whose values are multi-dimensional sorted map
  • 50. ADVANCED CASSANDRA - A case study - SPATIAL DATA IN A DHT
  • 51. A FIRST PASS THE ORDER PRESERVING PARTITIONER CASSANDRA’S PARTITIONING STRATEGY IS PLUGGABLE Partitioner maps keys to nodes Random partitioner destroys locality by hashing Order preserving partitioner retains locality, storing keys in natural lexicographical order around ring z a alice a bob u h sam m
  • 52. ORDER PRESERVING PARTITIONER EXACT MATCH RANGE On a single dimension ? PROXIMITY
  • 53. SPATIAL DATA IT’S INHERENTLY MULTIDIMENSIONAL 2 x 2, 2 1 1 2
  • 56. Z-VALUE 14 x
  • 57. GEOHASH SIMPLE TO COMPUTE Interleave the bits of decimal coordinates (equivalent to binary encoding of pre-order traversal!) Base32 encode the result AWESOME CHARACTERISTICS Arbitrary precision Human readable Sorts lexicographically 01101 e
  • 58. DATA MODEL { “record-index”: { key <geohash>:<id> “9yzgcjn0:moonrise hotel”: { “”: [“”, 1287040737182], }, ... }, “records”: { “moonrise hotel”: { “latitude”: [“38.6554420”, 1287040737182], “longitude”: [“-90.2992910”, 1287040737182], ... } } }
  • 59. BOUNDING BOX E.G., MULTIDIMENSIONAL RANGE Gie stuff  bg box! Gie 2  3 1 2 3 4 Gie 4  5
  • 60. SPATIAL DATA STILL MULTIDIMENSIONAL DIMENSIONALITY REDUCTION ISN’T PERFECT Clients must • Pre-process to compose multiple queries • Post-process to filter and merge results Degenerate cases can be bad, particularly for nearest-neighbor queries
  • 64. Z-CURVE LOCALITY x o o o x o o o o
  • 65. THE WORLD IS NOT BALANCED Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive
  • 66. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4
  • 67. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4
  • 68. TOO MUCH LOCALITY 1 2 I’m sad. SAN FRANCISCO 3 4
  • 69. TOO MUCH LOCALITY I’m b. 1 2 I’m sad. Me o. SAN FRANCISCO 3 4 Let’s py xbox.
  • 71. HELLO, DRAWING BOARD SURVEY OF DISTRIBUTED P2P INDEXING An overlay-dependent index works directly with nodes of the peer-to-peer network, defining its own overlay An over-DHT index overlays a more sophisticated data structure on top of a peer-to-peer distributed hash table
  • 72. ANOTHER LOOK AT POSTGIS MIGHT WORK, BUT The relational transaction management system (which we’d want to change) and access methods (which we’d have to change) are tightly coupled (necessarily?) to other parts of the system Could work at a higher level and treat PostGIS as a black box • Now we’re back to implementing a peer-to-peer network with failure recovery, fault detection, etc... and Cassandra already had all that. • It’s probably clear by now that I think these problems are more difficult than actually storing structured data on disk
  • 73. LET’S TAKE A STEP BACK
  • 74. EARTH
  • 75. EARTH
  • 76. EARTH
  • 77. EARTH
  • 79. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } }
  • 80. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } }
  • 81. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } }
  • 82. SPLITTING IT’S PRETTY MUCH JUST A CONCURRENT TREE Splitting shouldn’t lock the tree for reads or writes and failures shouldn’t cause corruption • Splits are optimistic, idempotent, and fail-forward • Instead of locking, writes are replicated to the splitting node and the relevant child[ren] while a split operation is taking place • Cleanup occurs after the split is completed and all interested nodes are aware that the split has occurred • Cassandra writes are idempotent, so splits are too - if a split fails, it is simply be retried Split size: A Tunable knob for balancing locality and distributedness The other hard problem with concurrent trees is rebalancing - we just don’t do it! (more on this later)
  • 83. THE ROOT IS HOT MIGHT BE A DEAL BREAKER For a tree to be useful, it has to be traversed • Typically, tree traversal starts at the root • Root is the only discoverable node in our tree Traversing through the root meant reading the root for every read or write below it - unacceptable • Lots of academic solutions - most promising was a skip graph, but that required O(n log(n)) data - also unacceptable • Minimum tree depth was propsed, but then you just get multiple hot- spots at your minimum depth nodes
  • 84. BACK TO THE BOOKS LOTS OF ACADEMIC WORK ON THIS TOPIC But academia is obsessed with provable, deterministic, asymptotically optimal algorithms And we only need something that is probably fast enough most of the time (for some value of “probably” and “most of the time”) • And if the probably good enough algorithm is, you know... tractable... one might even consider it qualitatively better!
  • 85. - THE ROOT - A HOT SPOT AND A SPOF
  • 86. We have We want
  • 87. THINKING HOLISTICALLY WE OBSERVED THAT Once a node in the tree exists, it doesn’t go away Node state may change, but that state only really matters locally - thinking a node is a leaf when it really has children is not fatal SO... WHAT IF WE JUST CACHED NODES THAT WERE OBSERVED IN THE SYSTEM!?
  • 88. CACHE IT STUPID SIMPLE SOLUTION Keep an LRU cache of nodes that have been traversed Start traversals at the most selective relevant node If that node doesn’t satisfy you, traverse up the tree Along with your result set, return a list of nodes that were traversed so the caller can add them to its cache
  • 92. KEY CHARACTERISTICS PERFORMANCE Best case on the happy path (everything cached) has zero read overhead Worst case, with nothing cached, O(log(n)) read overhead RE-BALANCING SEEMS UNNECESSARY! Makes worst case more worser, but so far so good
  • 93. DISTRIBUTED TREE SUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF
  • 94. DISTRIBUTED TREE SUPPORTED QUERIES MUL EXACT MATCH DI P NS  RANGE NS! PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF
  • 95. LIFE OF A REQUEST
  • 96. THE BIRDS ‘N THE BEES ELB gate gate service service cass worker pool worker pool index
  • 97. THE BIRDS ‘N THE BEES ELB gate service cass worker pool index
  • 98. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate service cass worker pool index
  • 99. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service cass worker pool index
  • 100. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass worker pool index
  • 101. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool index
  • 102. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool buss logic - stage/xg index
  • 103. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool buss logic - stage/xg index awome sauce f qryg
  • 104. ELB •Traffic management •Control which AZs are serving traffic •Upgrades without downtime •Able to remove an AZ, upgrade, test, replace •API-level failure scenarios •Periodically runs healthchecks on nodes •Removes nodes that fail
  • 105. GATE •Basic auth •HTTP proxy to specific services •Services are independent of one another •Auth is Decoupled from business logic •First line of defense •Very fast, very cheap •Keeps services from being overwhelmed by poorly authenticated requests
  • 106. RABBITMQ •Decouple accepting writes from performing the heavy lifting •Don’t block client while we write to db/ index •Flexibility in the event of degradation further down the stack •Queues can hold a lot, and can keep accepting writes throughout incident •Heterogenous consumers - pass the same message through multiple code paths easily
  • 107. AWS
  • 108. EC2 • Security groups • Static IPs • Choose your data center • Choose your instance type • On-demand vs. Reserved
  • 109. ELASTIC BLOCK SUPPORT • Storage devices that can be anywhere from 1GB to 1TB • Snapshotting • Automatic replication • Mobility
  • 110. ELASTIC LOAD BALANCING • Distribute incoming traffic • Automatic scaling • Health checks
  • 111. SIMPLE STORAGE SERVICE • File sizes can be up to 5TBs • Unlimited amount of files • Individual read/write access credentials
  • 112. RELATIONAL DATABASE SERVICE • MySQL in the cloud • Manages replication • Specify instance types based on need • Snapshots
  • 114. WHY IS THIS NECESSARY? • Easier DevOps integration • Reusable modules • Infrastructure as code • One word: WINNING
  • 116. SAMPLE MANIFEST # /root/learning-manifests/apache2.pp package { 'apache2': ensure => present; } file { '/etc/apache2/apache2.conf': ensure => file, mode => 600, notify => Service[‘apache2’], source => '/root/learning-manifests/apache2.conf', } service { 'apache2': ensure => running, enable => true, subscribe => File['/etc/apache2/apache2.conf'], }
  • 117. AN EXAMPLE TERMINAL
  • 119. GET ‘ER DONE • Revision control • Automate build process • Automate testing process • Automate deployment Local code changes should result in production deployments.
  • 120. DON’T FORGET TO DEBIANIZE • All codebases must be debianized • Open source project not debianized yet, fork the repo and do it yourself! • Take the time to teach others • Debian directories can easily be reused after a simple search and replace
  • 121. REPOMAN HTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT repoman upload myrepo sg-example.tar.gz repoman promote development/sg-example staging
  • 122. HUDSON
  • 123. MAINTAINING MULTIPLE ENVIRONMENTS • Run unit tests in a development environment • Promote to staging • Run system tests in a staging environment • Run consumption tests in a staging environment • Promote to production Congratz, you have now just automated yourself out of a job.
  • 124. THE MONEY MAKER Github Plugin
  • 125. TYING IT ALL TOGETHER TERMINAL
  • 126. FLUME Flume is a distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. syslog on steriods
  • 128. AGENTS Physical Host Logical Nodes Source and Sink twitter_stream twitter(“dsmitts”, “mypassword”, “url”) agentSink(35853) tail(“/var/log/nginx/access.log”) i-192df98 tail agentSink(35853) collectorSource(35853) hdfs_writer collectorSink("hdfs://namenode.sg.com/bogus/", "logs")
  • 129. RELIABILITY END-TO-END STORE ON FAILURE BEST EFFORT
  • 130. GETTIN’ JIGGY WIT IT Custom Decorators
  • 131. HOW DO WE USE IT?
  • 132. PERSONAL EXPERIENCES • #flume • Automation was gnarly • Its never a good day when Eclipse is involved • Resource hog (at first)
  • 133. We’ buildg kick-s ols f vops  x, tpt,  csume da cnect  a locn MARCH 2011