SlideShare a Scribd company logo
1 of 79
Demystifying data structures
and algorithms adopted By
database storage engine
Adewumi Sunkanmi D.
Demystifying data
structures and
algorithms used by
database storage engine
Adewumi Sunkanmi D.
Senior Software Engineer at Acronis
working on Advanced Automation, one
of the cloud services offered by Acronis
Cyber Cloud.
Outline
1. Overview of a three-tier application
2. Criteria for selecting the best database for an application
3. Overview of database architecture
4. Types for database storage engines and their tradeoffs
5. Q/A
client
POST
GET
client
server
POST
GET
WRITE
READ
server
client
Database
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
Neo4J
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
4. Rate of write and read and how EXACTLY are these
operations handled at the hardware level?
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
SELECT
COLS FROM WHERE
COL_ID students >
score 70
firstname lastname
“SELECT firstname, lastname FROM students WHERE score > 70;”
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
Disk
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://www.cs.umb.edu/~poneil/lsmtree.pdf
Log Structured Merge Tree Storage
Engine
The LMS tree is an immutable disk resident data
structure and it is optimized for sequential writes while
maintaining the acceptable read performance.
Log Structured Merge Tree Storage
Engine
Three components
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red black
tree in RAM
ben: 300 josh: 500
Threshold reached!
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
SSD/HDD file (SSTable file)
T1
ben: 300
bin: 220
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
10MB
alexandar : 10
andreas : 50
…….
erik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (segment file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Find(apa)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 300 mia: 220
write
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
update?
Since we return from the most
recent memtable or segment file, we
just insert the key with the new
value,
Ben will be returned from T2 not T1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
delete?
Insert the key with a delete marker
called tombstone, since this will be
the most recent, we can tell it has
been deleted, e.g
ben->null
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine Key present?: Strict NO if not
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
Optimtimize reads with Bloom Filters
Maybe or Maybe
not(99% accurate)
https://brilliant.org/wiki/bloom-filter/
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if power failure
happens before data
is flushed to disk?
Compaction
1. Persist write in an append only log file before
writing to in-memory table. WAL
2. Recreate memtable from last Log Sequence
Number.
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
Where is LSM tree Storage engine
used?
1. Apache Cassandra
2. WiredTiger
3. InfluxDB
4. Yugabyte DB
5. ScyllaDB
6. CockroachDB
7. Google’s BigTable
8. RocksDB
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://carlosproal.com/ir/papers/p121-comer.pdf
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
B trees are page-oriented indexing structures
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
Important notes on B-tree
1. Store key value pairs (sorted by key)
2. Self balancing
3. Often used for indexing
4. Mutable data structure(in place update)
5. Each node is a fixed size block/page 4KB
6. Can only read or write one page at a time
https://carlosproal.com/ir/papers/p121-comer.pdf
Anatomy of B-Tree
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
B-Trees
https://sqlbak.com/academy/database-page
A database page
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
NOTE: Leaf Page contains both
the key and value
Anatomy of B-Trees
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
Branching factor = 5
Depth= 3
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
Searching for a key is faster because we are not scaning
all keys but only keys within range, takes O(log n)
Where n is the total number of keys
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
val val val val
70 78 85
INSERT(87)
87
69 val val val val
70 78 85 86 val
Branching factor - 5
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
Branching
factor
exceeded! > 5
Create new
page
val val val val
70 78 85 87
69 val val val val
70 78 85 86 val 87 val
Branching factor - 5
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
69 70 78
val val val 85 86 87
val val val
INSERT(87)
Branching
factor
exceeded! > 5
Create new
page
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
What if the parent page is full?
Split it
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
How does update work?
1. Find the leaf page with key
2. Edit the row
3. Overwrite the page
INSERT(87)
LSM trees Vs B-Trees storage engine
LSM Tree B-Tree
Optimized for write Optimized for read
Compressed better(No
Fragmentation)
Fragmentation wastes space
There can be duplicates before
compaction
Each key exist exactly in one
place
Strong transaction support
Spikes in write can cause slow
compaction due to many
SSTable files. Can cause Out
of Memory Error(OOM)
Space optimization in B-tree
Primary index(primary key index)
Secondary index
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Leaf page contains both key and value Leaf page contains both key and value
DUPLICATE !
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Store value offset(smaller in size)
Extra Disk I/O
So you can store important
columns in leaf page and less
important columns in heap file
@gifted_dl
@gifted_dl
Adewumi Sunkanmi D.

More Related Content

Similar to Database Storage Engine Internals

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
 
5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 GamesMichael Ewins
 
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeAerospike, Inc.
 
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaMichael Noel
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Knut Relbe-Moe [MVP, MCT]
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariJoseph Scott
 
MySpace Data Architecture June 2009
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009Mark Ginnebaugh
 
Virtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAsVirtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAsQuest Software
 
Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940gopalchsamanta
 
Sql server backup internals
Sql server backup internalsSql server backup internals
Sql server backup internalsHamid J. Fard
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliabilitybryanrandol
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Guy Harrison
 
The care and feeding of a MySQL database
The care and feeding of a MySQL databaseThe care and feeding of a MySQL database
The care and feeding of a MySQL databaseDave Stokes
 
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAmazon Web Services
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesEric Carter
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerJomaSoft
 

Similar to Database Storage Engine Internals (20)

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
 
5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games
 
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
 
Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
 
MySpace Data Architecture June 2009
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
 
Virtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAsVirtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAs
 
Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940
 
Sql server backup internals
Sql server backup internalsSql server backup internals
Sql server backup internals
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
 
ora_sothea
ora_sotheaora_sothea
ora_sothea
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
 
The care and feeding of a MySQL database
The care and feeding of a MySQL databaseThe care and feeding of a MySQL database
The care and feeding of a MySQL database
 
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
 

Recently uploaded

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Recently uploaded (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

Database Storage Engine Internals

  • 1. Demystifying data structures and algorithms adopted By database storage engine Adewumi Sunkanmi D.
  • 2. Demystifying data structures and algorithms used by database storage engine
  • 3. Adewumi Sunkanmi D. Senior Software Engineer at Acronis working on Advanced Automation, one of the cloud services offered by Acronis Cyber Cloud.
  • 4. Outline 1. Overview of a three-tier application 2. Criteria for selecting the best database for an application 3. Overview of database architecture 4. Types for database storage engines and their tradeoffs 5. Q/A
  • 15. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph?
  • 16. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database
  • 17. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling
  • 18. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes)
  • 19. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes)
  • 20. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database
  • 21. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database 4. Rate of write and read and how EXACTLY are these operations handled at the hardware level?
  • 23. SELECT COLS FROM WHERE COL_ID students > score 70 firstname lastname “SELECT firstname, lastname FROM students WHERE score > 70;”
  • 26. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 28. Log Structured Merge Tree Storage Engine The LMS tree is an immutable disk resident data structure and it is optimized for sequential writes while maintaining the acceptable read performance.
  • 29. Log Structured Merge Tree Storage Engine Three components 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table)
  • 30. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177
  • 31. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM
  • 32. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM josh: 500
  • 33. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red black tree in RAM ben: 300 josh: 500 Threshold reached!
  • 34. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 SSD/HDD file (SSTable file) T1 ben: 300 bin: 220 josh: 500
  • 35. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB
  • 36. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 10MB alexandar : 10 andreas : 50 ……. erik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ………
  • 37. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB
  • 38. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (segment file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB Find(apa)
  • 39. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 300 mia: 220 write
  • 40. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 SSD/HDD file (SSTable file)
  • 41. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) SSD/HDD file (SSTable file)
  • 42. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 SSD/HDD file (SSTable file)
  • 43. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 SSD/HDD file (SSTable file)
  • 44. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 45. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 46. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle update? Since we return from the most recent memtable or segment file, we just insert the key with the new value, Ben will be returned from T2 not T1 SSD/HDD file (SSTable file)
  • 47. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle delete? Insert the key with a delete marker called tombstone, since this will be the most recent, we can tell it has been deleted, e.g ben->null SSD/HDD file (SSTable file)
  • 48. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( SSD/HDD file (SSTable file)
  • 49. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help SSD/HDD file (SSTable file)
  • 50. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help Compaction SSD/HDD file (SSTable file)
  • 51. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction SSD/HDD file (SSTable file)
  • 52. Log Structured Merge Tree Storage Engine Key present?: Strict NO if not 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction Optimtimize reads with Bloom Filters Maybe or Maybe not(99% accurate) https://brilliant.org/wiki/bloom-filter/ SSD/HDD file (SSTable file)
  • 53. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if power failure happens before data is flushed to disk? Compaction 1. Persist write in an append only log file before writing to in-memory table. WAL 2. Recreate memtable from last Log Sequence Number. SSD/HDD file (SSTable file)
  • 54. Log Structured Merge Tree Storage Engine Where is LSM tree Storage engine used? 1. Apache Cassandra 2. WiredTiger 3. InfluxDB 4. Yugabyte DB 5. ScyllaDB 6. CockroachDB 7. Google’s BigTable 8. RocksDB
  • 55. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 58. https://carlosproal.com/ir/papers/p121-comer.pdf B-Trees Important notes on B-tree 1. Store key value pairs (sorted by key) 2. Self balancing 3. Often used for indexing 4. Mutable data structure(in place update) 5. Each node is a fixed size block/page 4KB 6. Can only read or write one page at a time
  • 59. https://carlosproal.com/ir/papers/p121-comer.pdf Anatomy of B-Tree 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val
  • 61. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val NOTE: Leaf Page contains both the key and value Anatomy of B-Trees
  • 62. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val Branching factor = 5 Depth= 3 Anatomy of B-Tree
  • 63. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 64. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 65. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 66. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree
  • 67. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree Searching for a key is faster because we are not scaning all keys but only keys within range, takes O(log n) Where n is the total number of keys
  • 68. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 val val val val 70 78 85 INSERT(87) 87 69 val val val val 70 78 85 86 val Branching factor - 5
  • 69. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 Branching factor exceeded! > 5 Create new page val val val val 70 78 85 87 69 val val val val 70 78 85 86 val 87 val Branching factor - 5 INSERT(87)
  • 70. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 69 70 78 val val val 85 86 87 val val val INSERT(87) Branching factor exceeded! > 5 Create new page
  • 71. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page INSERT(87)
  • 72. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page What if the parent page is full? Split it INSERT(87)
  • 73. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page How does update work? 1. Find the leaf page with key 2. Edit the row 3. Overwrite the page INSERT(87)
  • 74. LSM trees Vs B-Trees storage engine LSM Tree B-Tree Optimized for write Optimized for read Compressed better(No Fragmentation) Fragmentation wastes space There can be duplicates before compaction Each key exist exactly in one place Strong transaction support Spikes in write can cause slow compaction due to many SSTable files. Can cause Out of Memory Error(OOM)
  • 75. Space optimization in B-tree Primary index(primary key index) Secondary index
  • 76. Space optimization in B-tree Secondary index Primary index(primary key index) Leaf page contains both key and value Leaf page contains both key and value DUPLICATE !
  • 77. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File
  • 78. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File Store value offset(smaller in size) Extra Disk I/O So you can store important columns in leaf page and less important columns in heap file