A presentation in ApacheCon Asia 2021 from Yuchen He and Shuo Jia.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
Apache Pegasus (incubating): A distributed key-value storage system
1. APACHE PEGASUS(INCUBATING) - A
DISTRIBUTED KEY-VALUE STORAGE
SYSTEM
Yuchen He & Shuo Jia
Software Engineer from XiaoMi, Apache Pegasus PPMC
Incubator
2. Speakers
Yuchen He
• Graduate from Renmin University of China
• Software engineer from XiaoMi
• Pegasus project leader in XiaoMi
• Apache Pegasus PPMC
Shuo Jia
• Graduate from Beijing Jiaotong University of China
• Software engineer from XiaoMi
• Apache Pegasus PPMC
• Participated in the development of Pegasus for 2
years
3. Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community
5. Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable
6. Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server
9. Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log
15. Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…
16. Bulk Load
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
Fast import lots of data offline
19. Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica
20. Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering
21. Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Merged in master, will be released in 2.3.0
• GC dup-data by compaction
25. Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
26. Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy
27. Meta Proxy
Switch primary and standby cluster
client client client
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
client client client
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch
28. Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance