Apache Pegasus (incubating): A distributed key-value storage system

APACHE PEGASUS(INCUBATING) - A
DISTRIBUTED KEY-VALUE STORAGE
SYSTEM
Yuchen He & Shuo Jia
Software Engineer from XiaoMi, Apache Pegasus PPMC
Incubator

Speakers
Yuchen He
• Graduate from Renmin University of China
• Software engineer from XiaoMi
• Pegasus project leader in XiaoMi
• Apache Pegasus PPMC
Shuo Jia
• Graduate from Beijing Jiaotong University of China
• Software engineer from XiaoMi
• Apache Pegasus PPMC
• Participated in the development of Pegasus for 2
years

Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community

Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable

Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server

Dual WAL
Disk
Data
Log
Replica1
Data
Log
Replica2
Data
Log
Replica3
client
Traditional solution
• Data background compaction may strongly affect WAL sync performance

Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log

Performance
Read:Write Client*Thread --- QPS AvgLatency P99Latency(us)
0:1 3*15
read --- --- ---
write 46128 972 5591
1:0 3*50
read 282648 542 1674
write --- --- ---
1:1 3*30
read 36014 1068 15345
write 36016 1421 8197
1:3 3*15
read 11622 779 10417
write 34989 1021 5467
2.2.0 (Newest release) benchmark

Duplication
Region2
Table
Region1
Table
async-duplication
Basic introduction
• Design for cross-region online backup
• Transfer log, write asynchronously
• Supporting single-master and multi-master

Duplication
Case1: Online Migration
Target Cluster
Table
Source Cluster
Table
client
1. Reserve logs
Remote storage
2. cold backup
3. restore
4. duplication
5. switch

Duplication
Case2: Master-Slave cluster
client client
Slave region
Table
Master region
Table
duplication
Eventually-consistent
read
client client
Table
Region1 Region2

Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…

Bulk Load
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
Fast import lots of data offline

Access Control
Authentication: Kerberos
Authorization: Whitelist based coarse-grained table-level access control
Cluster
KeytabA
X
TableA
KeytabB
TableB
KeytabA
client

Partition Split
• Replica divide into two replicas
• Replica[i] -> Replica[i], Replica[i+original_partition_count]
Basic introduction
Replica group0
Replica0 Replica4
Replica0
Replica group1
Replica1 Replica5
Replica1
Replica group2
Replica2 Replica6
Replica2
Replica group3
Replica3 Replica7
Replica3

Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica

Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering

Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Merged in master, will be released in 2.3.0
• GC dup-data by compaction

Pegasus-Spark
Best practices
• Large offline data analysis (SQL)
• Large offline data load (BulkLoad)

Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server Replica server
Hive
Schema RDD

Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data

Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy

Meta Proxy
Switch primary and standby cluster
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch

Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance

Process
2016
Release 1.0.0
Join Apache
Release 2.0.0
Meet UP
2015
Start
Open GitHub
2017.9
2020.6
2020.9
2021.8

Tools
Start contribution from API and tools
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …

In future
Issues, Roadmap, RFC
• New Features
• Cluster load balance
• Table Migrator Tools
• Read throughput throttling
• Support K8S
...
• Feature enhancement
• Duplication
• Bulk load
• Hot partition detection
…
• Tests
• Documents

Activities
• August 21st Beijing
First offline meetup will be coming soon

THANK YOU
QUESTIONS?
https://github.com/apache/incubator-pegasus
https://pegasus.apache.org/
Apache Pegasus

Apache Pegasus (incubating): A distributed key-value storage system

Recommended

Recommended

More Related Content

Similar to Apache Pegasus (incubating): A distributed key-value storage system

Similar to Apache Pegasus (incubating): A distributed key-value storage system (20)

More from acelyc1112009

More from acelyc1112009 (11)

Recently uploaded

Recently uploaded (20)

Apache Pegasus (incubating): A distributed key-value storage system