Making KVS 10x Scalable

Making KVS 10x Scalable
Sadayuki Furuhashi
PLAZMA TD Tech Talk 2018 at Shibuya
リアルタイム配信サーバを10倍最適化した方法
Senior Principal Engineer
@frsyuki

About Sadayuki Furuhashi
A founder of Treasure Data, Inc.
Located in Silicon Valley, USA.
OSS Hacker. Github: @frsyuki
OSS projects I initially designed:

What's CDP KVS?
✓ Streaming data collection
✓ Bulk data collection
CDP KVS
(Today's topic)
✓ On-demand data delivery

What's CDP KVS?
CDP KVS
Source data tables
Data collection
in many ways Audience data set
customers
behaviors
Segment data sets
US, JP, (EU)
Mobile, PC, Devices
Segmentation
workﬂows
Preprocess
workﬂows

JavaScript & Mobile Personalization API

REST API call
(using JavaScript SDK)
Returned value

Architecture (Old)
CDP KVS
Server
CDP KVS
Server
AWS JP
AWS US
DynamoDBDAX
DAX: DynamoDB's 
write-through cache
Ignite
Ignite: Distributed cache
PrestoBulk write
Random lookup
by browsers / mobile

DynamoDB's auto-scaling doesn't scale in time
Request failure!
Load spikes right after noon

Expensive Write Capacity Cost
Request failure!
Already too expensive
Bigger margin = even more expensive

Workload analysis
Read Write
API Random lookup by ID
Bulk write & append 
No delete
Temporal Locality
（時間的局所性）
High
(repeating visitors)
Low
(daily or hourly batch)
Spacial Locality
（空間的局所性）
Moderate
(hot & cold data sets)
High
(rewrite data sets by batch)
Number of records: 300 billion (3,000億件)
Size of a record: 10 bytes
Size of total records: 3 TB
Read traﬃc: 50 requests/sec

Ideas
(A) Alternative distributed KVS (Aerospike)
(B) Storage Hierarchy on KVS
(C) Edit log shipping & Indexed archive

Idea (A)
Alternative Distributed KVS

(A) Alternative Distributed KVS
CDP KVS
Server
CDP KVS
Server
DynamoDBDAX
Ignite
Presto
Presto
Aerospike
node
Aerospike
node
Aerospike
node

Aerospike: Pros & Cons
• Good: Very fast lookup
• In-memory index + Direct IO on SSDs
• Bad: Expensive (hardware & operation)
• Same cost for both cold & hot data 
(Large memory overhead for cold data)
• No spacial locality for write 
(a batch-write becomes random-writes)

SSD /dev/sdb
Aerospike: Storage Architecture
Aerospike
node
DRAM
hash(k01): 
addr 01, size=3
...
hash(k02): 
addr 09, size=3
hash(k03): 
addr 76, size=3
k01 = v01
k02 = v02
k03 = v03
addr 01:
addr 09:
addr 76:
GET hash(k01)
✓ Primary keys (hash) are always in-memory => Always fast lookup
✓ Data is always on SSD => Always durable
✓ IO on SSD is direct IO (no ﬁlesystem cache) => Consistently fast without warm-up
Load index 
at startup
(cold-start)

Aerospike: System Architecture
{
k01: v01
k02: v02
k03: v03
k04: v04
k05: v05
}
{
k06: v06
k07: v07
k08: v08
k09: v09
k0a: v0a
}
Aerospike
node
Aerospike
node
Aerospike
node
Aerospike
node
hash(key) = Node ID
Aerospike
node
Aerospike
node
Bulk write 1
Bulk write 2
Batch write => Random write:
No locality, No compression, More overhead
Note: compressing 10-byte data isn't eﬃcient

Aerospike: Cost estimation
• 1 record needs 64 bytes of DRAM for primary key indexing
• Storing 100 billion records (our use case) needs 
6.4 TB of DRAM.
• With replication-factor=3, our system needs 
19.2TB of DRAM.
• It needs r5.24xlarge × 26 instances on EC2.
• It costs $89,000/month (1-year reserved, convertible).
• Cost structure:
• Very high DRAM cost per GB
• Moderate IOPS cost
• Low storage & CPU cost
• High operational cost

Idea (B)
Storage Hierarchy on KVS

Analyzing a cause of expensive DynamoDB WCU
PK Col1 Col2
Key1 Key1 Col1 Key1 Col2
Key1 Key1 Col1 Key1 Col2
1KB 1KB 1KB 1KB
3.2KB
Consumes 4 Write Capacity
(0.8 WCU wasted)

DynamoDB with record size <<< 1KB
PK Value
Key1 Val1
Key2
Key3
Key4
Val2
Val3
Val4
1KB
=> 1 Write Capacity
=> 1 Write Capacity
=> 1 Write Capacity
=> 1 Write Capacity
10 bytes
4 Write Capacity consumed to store 40 bytes.
99% WCU wasted!

Solution: Optimizing DynamoDB WCU overhead
PK Value
Key1 Val1
Key2
Key3
Key4
Val2
Val3
Val4
10 bytes
10 bytes
10 bytes
10 bytes
=> 1 Write Capacity
=> 1 Write Capacity
=> 1 Write Capacity
=> 1 Write Capacity
PK Value
Part ID {Key1: Val1, Key2: Val2, Key3: Val3, Key4: Val4} 30 bytes => 1 Write Capacity
(Note: expected 5x - 10x compression ratio)

(B) Storage Hierarchy on KVS
{
k01: v01
k03: v03
k06: v06
k08: v08
k0a: v0a
}
{
k02: v02
k04: v04
k05: v05
k07: v07
k09: v09
}
Bulk write 1
Bulk write 2
Compress
& Write
DynamoDBDAX
hash(partition id) = Primary key

Storage Hierarchy on KVS: Pros & Cons
• Good: Very scalable write & storage cost
• Data compression (10x less write & storage cost)
• Fewer number of primary keys 
(1 / 100,000 with 100k records in a partition)
• Bad: Complex to understand & use
• More diﬃcult to understand
• Writer (Presto) must partition data by partition id

Data partitioning - write
k01: v01
k03: v03 
k06: v06
k08: v08 
k0a: v0a
k02: v02 
k04: v04
k05: v05
k07: v07 
k09: v09
{
k01: v01
k02: v02
k03: v03
k04: v04
k05: v05
k06: v06
k07: v07
k08: v08
k09: v09
k0a: v0a
}
Original data set Partition id=71
Partition=69
Encoded Partition id=71
k01
v01
k03
v03
k06
v06
k08
v08
k0a
v0a
Encode &
Compress
Split 1 Split 2 Split 3
PK
Split
1
Split
2
Split
3
71
69
Store
hash(key) = partition id | split id
Partitioning using Presto 
(GROUP BY + array_agg query)
DynamoDB

DynamoDB
Data partitioning - read
PK
Split
1
Split
2
Split
3
71
69
Get
hash(key) = partition id | split id
GET k06
k06 is at:
partition id=71
split id=2
k03
v03
k06
v06
Scan
{
k06: v06
}
DAX
(cache)
Encoded split

Idea (C)
Edit log shipping & Indexed Archive

(C) Edit log shipping & Indexed Archive
Kafka / Kinesis
(+ S3)
Writer API
NodeWriter API
Stream of bulk-write data sets
Indexing & 
Storage Node RocksDB
Shard 
0, 1
Indexing & 
Shard 
1, 2
Indexing & 
Shard 
2, 3
Indexing & 
Shard 
3, 0
Writer API
NodeReader API
etcd, consul
Shard & node list 
management
Write
Read
Bulk-write
S3 for backup
& cold-start
Subscribe
Read

Architecture of RocksDB
Optimization of RocksDB for Redis on Flash, Keren Ouaknine, Oran Agra, and Zvika Guz

Pros & Cons
• Good: Very scalable write & storage cost
• Data compression (10x less write & storage cost)
• Bad: Expensive to implement & operate
• Implementing 3 custom server components 
(Stateless: Writer, Reader. Stateful: Storage)
• Operating stateful servers - more work to implement
backup, restoring, monitoring, alerting, etc.
• Others:
• Flexible indexing
• Eventually-consistent

Our decision: Storage Hierarchy on DynamoDB
• Operating stateful servers is harder than you think!
• Note: almost all Treasure Data components are
stateless (or cache or temporary buﬀer)
• Even if data format becomes complicated, stateless
servers on DynamoDB is better option for us.

Appendix: Split format
PK
Split
1
Split
2
Split
3
71
69
{
k03: v03,
k06: v06,
...
}
msgpack( [
[keyLen 1, keyLen 2, keyLen 3, ...],
"key1key2key3...",
[valLen 1, valLen 2, valLen 3, ...],
"val1val2val3...",
] )
zstd( msgpack( [
,
msgpack( [
[keyLen, keyLen, keyLen, ...],
"keykeykey...",
[valLen, valLen, valLen, ...],
"valvalval...",
] )
,
...
] )
,
bucket 1
bucket 2
bucket NHash table 
serialized by MessagePack 
compressed by Zstd
Size of a split:
approx. 200KB
(100,000 records)
Nested MessagePack to omit
unnecessary deserialization
when looking up a record

Bulk write performance
6x less total time
8x faster single bulk-write
(which loops 18 times)

DynamoDB Write Capacity Consumption
210
105
921k Write Capacity in 45 minutes.

170 Write Capacity per second average (≒ 170 WCU).

What's Next?
• Implementation => DONE
• Testing => DONE
• Deploying => on-going
• Designing Future Extensions => FUTURE WORK

A possible future work
Read Write
API Random lookup by ID Random write
Temporal Locality
（時間的局所性）
High
(repeating visitors)
Low => High?
Spacial Locality
（空間的局所性）
Moderate
(hot & cold data sets)
High => Low?
Extension for streaming computation

(=> An on-demand read operation updates a value)

Making KVS 10x Scalable

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making KVS 10x Scalable

Similar to Making KVS 10x Scalable (20)

More from Sadayuki Furuhashi

More from Sadayuki Furuhashi (20)

Recently uploaded

Recently uploaded (20)

Making KVS 10x Scalable