Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Anti-social Databases
1. Open source, high performance database
Anti-social Databases: NoSQL and MongoDB
Will LaForest
Senior Director of 10gen Federal
will@10gen.com
@WLaForest
1
2. SQL Dynamic
invented Web Content released
10gen
Web applications founded
Oracle Client
IBM’s IMS founded Server SOA
BigTable
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
Codd publishes PC’s gain 3 tier Cloud
relational model paper traction architecture Computing
in 1970
Brewer’s Cap NoSQL
WWW born Movement
born
2
4. Category BlogCategory
N 1
N
1
N 1 N 1
Author Blog BlogTag
N
N
Content Tag
4
5. • Data stored in a RDBMS is very compact (disk was
more expensive)
• SQL and RDBMS made queries flexible with rigid
schemas
• Rigid schemas helps optimize joins and storage
• Massive ecosystem of tools, libraries and integrations
• Been around 40 years!
5
6. • Gartner uses the 3Vs to define
• Volume - Big/difficult/extreme volume is relative
• Variety
– Changing or evolving data
– Uncontrolled formats
– Does not easily adhere to a single schema
– Unknown at design time
• Velocity
– High or volatile inbound data
– High query and read operations
– Low latency
6
7. VOLUME/VELOCITY &
NEW ARCHITECTURES
• Systems need to scale horizontal
not vertically
• Commodity servers
• Cloud Computing
DATA VARIETY &
VOLATILITY
• Extremely difficult to
find a single fixed
schema
• Don’t know data
schema a-priori AGILE DEVELOPMENT
• Iterative & continuous
• New and emerging Apps
7
8. • Non-relational been hanging around (heard of
MUMPS?)
• Modern NoSQL theory and offerings started in early
2000s
• NoSQL = Not Only SQL
• A collection of very different products
• Alternatives to relational databases when a bad fit
• Common motivations
– Horizontally scalable (commodity server/cloud computing)
– Schema Flexibility
8
9. • I group by data model
– Some people use or include data arangement
• Data arrangement is important and cuts across data
models (see my talk at BigData DC next week)
– Column/Document
– Consistent hashing partitioning
– Range based partitioning
• Key/Value
• Big Table Descendents
• Document Oriented
• Graph
9
10. • Value (Data) mapped to a key (think primary)
• Some typed, some just BLOBs
• Fast hash based queries
• No range queries, no ordering, simple data model
“Will” • “will@10gen.com”
“Chris” • “chris@10gen.com”
Will-obj • [4e61 6d65 3a57 …
10
11. • Data stored on disk in a column oriented fashion
• Predominantly hash based indexing
• Rudimentary or no secondary indexes
• Range queries and ordering on one dimension (row key)
• Some consistent hashing some range based
row keys column family column family
“contact” “personal”
“twitter”: “wlaforest”
“bio”: “Will attended … ”
row Will “email”: “will@10gen.com
“picture”: …
“phone”: “555-555-5555”
“bio”: “ … ”
“email”: “chris@10gen.com”
row Chris “picture”: …
“phone”: “555-555-5555”
“hobby”: “golf”
11
12. • Data modeled as documents or objects (XML and JSON)
• No fixed schema
• Richest data model
• Consistent hashing and range based partitioning
{name: “will”, name: “jeff”, {name: “brendan”,
eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]}
birthplace: “NY”, height: 72,
aliases: [“bill”, “la boss: “ben”}
ciacco”], {name: “matt”,
gender: ”???”, pizza: “DiGiorno”,
boss: ”ben”} name: “ben”, height: 72,
hat: ”yes”} boss: 555.555.1212}
12
13. • Not a database!
• Map Reduce on HDFS or other data source
• Great for grinding through data
• When you want to use it
– Can’t use a index
– Distributing custom algorithms
– ETL
• Many NoSQL offerings have native map reduce
functionality
13
15. • 2007 founded
• 2009 first release of MongoDB
• MongoDB is open source
• 10gen has a Redhat-like business model
– Subscriptions (subscriber build, support)
– Training
– Consulting
• ~80M in funding
– Sequoia, NEA, In-Q-Tel, Union Square Ventures, Flybridge
Capital,
15
16. #2 on Indeed’s Fastest Growing Jobs Jaspersoft BigData Index
Demand for
MongoDB, the
document-oriented
NoSQL database, saw
the biggest spike
with over 200%
growth in 2011.
451 Group
Google Searches “MongoDB increasing its dominance”
16
19. • Scale horizontally over commodity hardware
• Agility essential (schema free/heterogeneous
interface)
• RDBMSs great so keep what works
– Rich data models
– Adhoc queries
– Fully featured indexes
• What doesn’t distribute well?
– Long running multi-row transactions
– Join
– Both artifacts of the relational data model
19
20. • Data stored as documents (JSON)
• Schema free
• CRUD operations – (Create Read Update Delete)
• Atomic document operations
• Consistent (but is tunable… advanced topic)
• Rich indexing (secondary, geospatial, covered)
• Ad hoc Queries like SQL
– Equality
– Ranges
– Regular expression searches
– Geospatial
• Replication – HA, read scalability, geo centric reads
• Sharding (sometimes called partitioning) for scalability
20
23. var p = { author: “roger”,
date: new Date(),
text: “Spirited Away”,
tags: *“Tezuka”, “Manga”+-
> db.posts.save(p)
23
24. >db.posts.find()
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "roger",
date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",
text : "Spirited Away",
tags : [ "Tezuka", "Manga" ] }
Notes:
- _id is unique, but can be anything you’d like
24
25. Create index on any Field in Document
// 1 means ascending, -1 means descending
>db.posts.ensureIndex({author: 1})
>db.posts.find({author: 'roger'})
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "roger",
... }
25
26. • Conditional Operators
– $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type
– $lt, $lte, $gt, $gte
// find posts with any tags
> db.posts.find( {tags: {$exists: true }} )
// find posts matching a regular expression
> db.posts.find( {author: /^rog*/i } )
// count posts by an author before a certain date
> db.posts.find( {author: ‘roger’, date:, $lt: Sat… -- ).count()
26
31. • Do them client side (Python or R)
• Native Map/Reduce in JS in MongoDB
– Distributes across the cluster with good data locality
• New aggregation framework
– Declarative (no JS required)
– Pipeline approach (like Unix ps -ax | tee processes.txt | more)
• Hadoop
– Intersect the indexing of MongoDB with the brute force parallelization
of hadoop
– Hadoop MongoDB connector
31
33. • Prod clusters on order of
• 100B objects
• 50k qps per server
• ~1400 nodes
Better data locality In-Memory Auto-Sharding
Caching
Replication /HA
Horizontal Scaling
33
34. • Sharding Details
• Replica Set Details
• Consistency Details
• Common Deployment Scenarios
• Citations
34
45. Value Meaning
<n:integer> Replicate to N members of
replica set
“majority” Replicate to a majority of
replica set members
<m:modeName> Use cutom error mode
name
45
61. 1 1.Query arrives at
mongos
4
mongos
2.mongos routes query
to a single shard
3.Shard returns results
2
of query
3
4.Results returned to
client
Shard 1 Shard 2 Shard 3
61
62. 1 1.Query arrives at
mongos
4
mongos 2.mongos broadcasts
query to all shards
3.Each shard returns
2 results for query
2 2
3
3 3 4.Results combined
and returned to client
Shard 1 Shard 2 Shard 3
62
63. 1 1.Query arrives at mongos
6 2.mongos broadcasts query
mongos to all shards
5 3.Each shard locally sorts
results
2 4.Results returned to
2 mongos
2
4 4 4
5.mongos merge sorts
individual results
3 3 3
6.Combined sorted result
Shard 1 Shard 2 Shard 3 returned to client
63
65. By Shard Routed db.users.find(
{email: “jsr@10gen.com”})
Key
Sorted by Routed in order db.users.find().sort({email:-1})
shard key
Find by non Scatter Gather db.users.find({state:”CA”})
shard key
Sorted by Distributed merge db.users.find().sort({state:1})
sort
non shard
key
65
72. • History of Database Management (http://bit.ly/w3r0dv)
• EMC IDC Study (http://bit.ly/y1mJgJ)
• Gartner & Big Data (http://bit.ly/xvRP3a)
• SQL (http://en.wikipedia.org/wiki/SQL)
• Database Management Systems
http://en.wikipedia.org/wiki/Dbms)
• Dynamo: Amazon’s Highly Available Key-value Store
(http://bit.ly/A8F8oy)
• CAP Theorem (http://bit.ly/zvA6O6)
• NoSQL Google File System and BigTable
(http://oreil.ly/wOXliP)
• NoSQL Movement whitepaper (http://bit.ly/A8RBuJ)
• Sample ERD diagram (http://bit.ly/xV30v)
72
77. Input data MAP Intermediate data REDUCE Output data
1 1
2 2
3 3
77
78. Input data MAP Intermediate data REDUCE Output data
1 1
2 2
3 3
78
79. • Impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees
– Consistency - All nodes see the same data at the same
time.
– Availability - A guarantee that every request receives a
response about whether it was successful or failed.
– Partition tolerance - No set of failures less than total
network failure is allowed to cause the system to respond
incorrectly
79
Editor's Notes
Database requirements are changing … because of i) volume ii) Type of data iii) Agile Development, iv) New architectures.. V) New Apps