So einfach geht modernes Roaming fuer Notes und Nomad.pdf
Making KVS 10x Scalable
1. Making KVS 10x Scalable
Sadayuki Furuhashi
PLAZMA TD Tech Talk 2018 at Shibuya
リアルタイム配信サーバを10倍最適化した方法
Senior Principal Engineer
@frsyuki
2. About Sadayuki Furuhashi
A founder of Treasure Data, Inc.
Located in Silicon Valley, USA.
OSS Hacker. Github: @frsyuki
OSS projects I initially designed:
4. What's CDP KVS?
✓ Streaming data collection
✓ Bulk data collection
CDP KVS
(Today's topic)
✓ On-demand data delivery
5. What's CDP KVS?
CDP KVS
Source data tables
Data collection
in many ways Audience data set
customers
behaviors
Segment data sets
US, JP, (EU)
Mobile, PC, Devices
Segmentation
workflows
Preprocess
workflows
8. Architecture (Old)
CDP KVS
Server
CDP KVS
Server
AWS JP
AWS US
DynamoDBDAX
DAX: DynamoDB's
write-through cache
Ignite
Ignite: Distributed cache
PrestoBulk write
Random lookup
by browsers / mobile
11. Expensive Write Capacity Cost
Request failure!
Already too expensive
Bigger margin = even more expensive
12. Workload analysis
Read Write
API Random lookup by ID
Bulk write & append
No delete
Temporal Locality
(時間的局所性)
High
(repeating visitors)
Low
(daily or hourly batch)
Spacial Locality
(空間的局所性)
Moderate
(hot & cold data sets)
High
(rewrite data sets by batch)
Number of records: 300 billion (3,000億件)
Size of a record: 10 bytes
Size of total records: 3 TB
Read traffic: 50 requests/sec
15. (A) Alternative Distributed KVS
CDP KVS
Server
CDP KVS
Server
DynamoDBDAX
Ignite
Presto
Presto
Aerospike
node
Aerospike
node
Aerospike
node
16. Aerospike: Pros & Cons
• Good: Very fast lookup
• In-memory index + Direct IO on SSDs
• Bad: Expensive (hardware & operation)
• Same cost for both cold & hot data
(Large memory overhead for cold data)
• No spacial locality for write
(a batch-write becomes random-writes)
17. SSD /dev/sdb
Aerospike: Storage Architecture
Aerospike
node
DRAM
hash(k01):
addr 01, size=3
...
hash(k02):
addr 09, size=3
hash(k03):
addr 76, size=3
k01 = v01
k02 = v02
k03 = v03
addr 01:
addr 09:
addr 76:
GET hash(k01)
✓ Primary keys (hash) are always in-memory => Always fast lookup
✓ Data is always on SSD => Always durable
✓ IO on SSD is direct IO (no filesystem cache) => Consistently fast without warm-up
Load index
at startup
(cold-start)
18. Aerospike: System Architecture
{
k01: v01
k02: v02
k03: v03
k04: v04
k05: v05
}
{
k06: v06
k07: v07
k08: v08
k09: v09
k0a: v0a
}
Aerospike
node
Aerospike
node
Aerospike
node
Aerospike
node
hash(key) = Node ID
Aerospike
node
Aerospike
node
Bulk write 1
Bulk write 2
Batch write => Random write:
No locality, No compression, More overhead
Note: compressing 10-byte data isn't efficient
19. Aerospike: Cost estimation
• 1 record needs 64 bytes of DRAM for primary key indexing
• Storing 100 billion records (our use case) needs
6.4 TB of DRAM.
• With replication-factor=3, our system needs
19.2TB of DRAM.
• It needs r5.24xlarge × 26 instances on EC2.
• It costs $89,000/month (1-year reserved, convertible).
• Cost structure:
• Very high DRAM cost per GB
• Moderate IOPS cost
• Low storage & CPU cost
• High operational cost
25. Storage Hierarchy on KVS: Pros & Cons
• Good: Very scalable write & storage cost
• Data compression (10x less write & storage cost)
• Fewer number of primary keys
(1 / 100,000 with 100k records in a partition)
• Bad: Complex to understand & use
• More difficult to understand
• Writer (Presto) must partition data by partition id
31. Pros & Cons
• Good: Very scalable write & storage cost
• Data compression (10x less write & storage cost)
• Bad: Expensive to implement & operate
• Implementing 3 custom server components
(Stateless: Writer, Reader. Stateful: Storage)
• Operating stateful servers - more work to implement
backup, restoring, monitoring, alerting, etc.
• Others:
• Flexible indexing
• Eventually-consistent
32. Our decision: Storage Hierarchy on DynamoDB
• Operating stateful servers is harder than you think!
• Note: almost all Treasure Data components are
stateless (or cache or temporary buffer)
• Even if data format becomes complicated, stateless
servers on DynamoDB is better option for us.
33. Appendix: Split format
PK
Split
1
Split
2
Split
3
71
69
{
k03: v03,
k06: v06,
...
}
msgpack( [
[keyLen 1, keyLen 2, keyLen 3, ...],
"key1key2key3...",
[valLen 1, valLen 2, valLen 3, ...],
"val1val2val3...",
] )
zstd( msgpack( [
,
msgpack( [
[keyLen, keyLen, keyLen, ...],
"keykeykey...",
[valLen, valLen, valLen, ...],
"valvalval...",
] )
,
...
] )
,
bucket 1
bucket 2
bucket NHash table
serialized by MessagePack
compressed by Zstd
Size of a split:
approx. 200KB
(100,000 records)
Nested MessagePack to omit
unnecessary deserialization
when looking up a record
38. A possible future work
Read Write
API Random lookup by ID Random write
Temporal Locality
(時間的局所性)
High
(repeating visitors)
Low => High?
Spacial Locality
(空間的局所性)
Moderate
(hot & cold data sets)
High => Low?
Extension for streaming computation
(=> An on-demand read operation updates a value)