SlideShare a Scribd company logo
1 of 35
Apache Pegasus
离在线融合建设与实践
王伟
2022.11.05
目录
• 功能背景
• 实现原理
• 应用实践
• 未来展望
功能背景
3
需求
• Transaction
– 随机读写、支持事务ACID、锁,面向DBA
– MySQL、PostgreSQL
• Analytics
– 大规模数据扫描、过滤、汇总,分布式,列式存储,数据更新弱,面
向分析师
– Hive、Spark、ClickHouse、Doris
• Serving
– 高并发、查询简单,数据可更新,面向在线应用
– Cassandra、Redis、 Pegasus、Hbase
需求
• 在线离线
– 在线引擎收集的Trace、用户行为数据,需要进行离线分析
• 离线在线
– 广告、画像等业务,离线计算结果需要进行在线查询
读写方式
接口 说明
读取
get 读取单条记录
multiGet 同一HashKey下读取多条记录
batchGet 跨HashKey读取多条记录
hashScan 同一HashKey下范围扫描
fullScan 全表扫描
写入
set 写入单条记录
multiSet 同一HashKey下原子写入多条记录
batchSet 跨HashKey写入多条记录
MemoryTable
Replica
Data
Data
LSM Tree
SSTable SSTable SSTable
SSTable
SSTable
数据结构
Memory
Minor
Major
SSTable
SSTable
Replica
MemoryTable ImmuMemoryTables
Write
SSTable SSTable SSTable
SSTable
SSTable
Memory
Compaction流程
读写痛点
• 读放大
– 读操作需要从新到旧(从上到下)一层一层查找,直到找到想要的数据。这
个过程可能需要不止一次 I/O。特别是 Range Query 的情况,影响很明显
• 空间放大
– 写入为顺序写(append-only)的,不是 in-place update ,过期数据不会马
上被清理掉
• 写放大
– RocksDB 通过后台的 compaction 来减少读放大(减少 SST 文件数量)和空
间放大(清理过期数据),但也因此带来了写放大(Write Amplification)的
问题。
• 吞吐
– 受限于RPC、2PC,单节点极限吞吐,写入10M/S,读取100M/S
BulkLoad模式
• 原理
– 通过RocksDB调参,提升灌数据场景的QPS和吞吐
• 禁止自动compactions
• 允许level0 SST数量无穷大
• 取消compactions过慢导致的写入限速及停写
• 提高write_buffer大小及个数
• 实践
– 切为BulkLoad模式 -> 通过AIP写入数据 -> 写入完成->
– Manual Compact -> 切回Normal模式
BulkLoad模式
• 通过RocksDB调参,提升灌数据场景的QPS和吞吐
参数 值
disable_auto_compactions true
level0_file_num_compaction_trigger
level0_slowdown_writes_trigger
level0_stop_writes_trigger
∞
soft_pending_compaction_bytes_limit
hard_pending_compaction_bytes_limit
no limit
max_compaction_bytes ∞
write_buffer_size raise to 256M form 64M
max_write_buffer_number raise to 6 form 4
实现原理
12
系统架构
MetaServer
(Master)
MetaServer
(Slave)
MetaServer
( Slave )
Zookeeper
Primary 0
Secondary 1
Secondary 2
Secondary 3
Secondary 0
Primary 1
Secondary 2
Primary 3
Secondary 0
Secondary 1
Primary 2
Secondary 3
在线  离线
解析快照数据,进行离线分析
数据解析
将数据快照从ReplicaServer上传到HDFS
数据上传
通过RocksDB CheckPoint接口生成数据快照
数据快照
在线  离线 MetaServer
Zookeeper
Primary 0
Secondary 1 Secondary 2
Primary 0
Secondary 1 Secondary 2
HDFS
SST file SST file SST file SST file
Upload
数据解析 Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server
Replica
server
Hive
Schema
RDD
离线  在线
通过RocksDB IngestSST接口加载数据文件
导入引擎
将数据文件从HDFS下载到ReplicaServer
数据下载
通过离线系统ETL生成RocksDB底层数据文件
数据生成
执行ManualCompaction,优化读取性能和空间占用
数据整理
数据生成
Executer 0
Original
Schema
Data
SST file
SST file
SST file
SST file
Executer 1
Executer 2
Executer 3
Partition Sort
Remove duplication keys
数据生成 Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
数据下载 MetaServer
Zookeeper
Primary 0
Secondary 1 Secondary 2
Primary 0
Secondary 1 Secondary 2
HDFS
SST file SST file SST file SST file
Download
数据导入
MetaServer
Zookeeper
Primary 0
Secondary 1 Secondary 2
Primary 0
Secondary 1 Secondary 2
Client Client
Read
Write
Read
Write
数据整理 - Compaction
• 数据导入后,必须进行全量数据Compaction
• BulkLoad与Compaction解耦,支持多次BulkLoad之后,
一次性Compaction
• 防止用户BulkLoad之后忘记Compaction,强制凌晨进行
一次Compaction
Replica server
HDFS
Spark
融合全景
Table Replica
New
Schema
Data
SST file
1.SSTWriter
2. Download SST
3. Ingest SST
Original
Schema
Data
SST file SST file
SST file SST file SST file
3.SSTReader
2.Upload SST
1.CheckPoint
Schema change
Via Spark Task
导入
导出
一致性问题
• 导出
– 表的多个分片无法保证同一时刻进行CheckPoint操作
– 假设T1时间开始导出,T4时间最后一个分片完成。则表的各个分
片为T1~T4之间某一时刻的快照
Partition 0
Partition 1
Partition 2
Partition 3
T1
T4
T2
T3
Start
一致性问题
• 导入
– 单个分片的Ingestion为原子操作,无法回滚
– 多个分片无法保证同一时刻进行Ingestion操作
– 全表的Ingestion非原子,无法回滚
– Ingestion瞬间集群阻写
Downloading Download Ingesting Ingested Success
Start Download Start Ingest Start CleanUp
性能优化
• 校验
– 去除冗余数据校验
• RocksDB调参
– move_files参数
• DirectIO
– 下载阶段文件写入开启DirectIO
• 限速
– 数据上传下载限速
– 下载、Ingest操作并发度限制
应用实践
27
国际广告算法
• 业务背景
– 静态特征:年龄、性别、职业
– 行为数据:APP的安装、卸载、使用、登录等
– 在线喂给推荐引擎,预测打分,推荐App列表
– 每天离线计算过去30天的用户行为,全量灌库
– 3亿个HashKey,41个SortKey,100+特征,数据总量2.2TB
• 方案
– 离线计算结束,数据对接PegasusSpark,生成数据文件
– 每天低峰使用BulkLoad灌库,完成后进行Compaction
国际广告算法
• 收益
– 灌库时间从12小时缩短为1小时( 20台节点,100M限速)
– 灌库过程中multi_get请求P99时延在30ms之内
multi_get耗时
国际广告算法
• 收益
– 相比于随机写入,磁盘IOPS、吞吐明显下降
磁盘写吞吐/分钟
磁盘读吞吐/分钟
磁盘读请求个数/分钟 磁盘写请求个数/分钟
Zili数据迁移
• 背景
– feedsprofile表迁移到印度机房自建新集群,总数据量2.3T
– 原集群版本2.0-wirte-optim,数据格式为V1
– 为了支持HDFS,新集群必须使用V2
– 业务允许低峰停写迁移
• 流程
– 原集群从2.0-wirte-optim升级到2.2.2,支持备份到HDFS
– 原集群生成数据快照,上传到HDFS
– Spark离线计算
• 读取并解析数据  V1转换为V2  排序去重  数据分区  写入HDFS
– BulkLoad将HDFS的数据灌入新集群
未来展望
32
未来展望
• 引擎调参
– Disable seqno
• ReadOnly集群
– 周期性BulkLoad写入数据,只针对读取调参
• 数据迁移
– 业务透明的无损数据迁移
数据迁移
Zookeeper
Client Client
Read
Write
Read
Write
MetaProxy
MetaServer
Replica 0 Replica 1
Replica 2 Replica 3
MetaServer
Replica 0 Replica 1
Replica 2 Replica 3
Source Destination
① duplication
② BulkLoad
Ingested_behiend
③ stop R/W old
cluster
④ start R/W
new Cluster
https://pegasus.apache.org
Apache Pegasus
https://github.com/apache/incubator-pegasus
谢 谢

More Related Content

Similar to The Construction and Practice of Apache Pegasus in Offline and Online Scenarios Integration

大众点评网的技术变迁之路
大众点评网的技术变迁之路大众点评网的技术变迁之路
大众点评网的技术变迁之路jeffz
 
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)Shanda innovation institute
 
鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobao鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobaoterryice
 
淘宝对象存储与Cdn系统到服务
淘宝对象存储与Cdn系统到服务淘宝对象存储与Cdn系统到服务
淘宝对象存储与Cdn系统到服务drewz lin
 
Taobao图片存储与cdn系统到服务
Taobao图片存储与cdn系统到服务Taobao图片存储与cdn系统到服务
Taobao图片存储与cdn系统到服务Wensong Zhang
 
Spark sql培训
Spark sql培训Spark sql培训
Spark sql培训Jiang Yu
 
大规模网站架构
大规模网站架构大规模网站架构
大规模网站架构drewz lin
 
Mr&ueh数据库方面
Mr&ueh数据库方面Mr&ueh数据库方面
Mr&ueh数据库方面Tianwei Liu
 
Mysql企业备份发展及实践
Mysql企业备份发展及实践Mysql企业备份发展及实践
Mysql企业备份发展及实践maclean liu
 
海量日志分析系统实践,Dba
海量日志分析系统实践,Dba海量日志分析系统实践,Dba
海量日志分析系统实践,DbaCevin Cheung
 
Ocean base海量结构化数据存储系统 hadoop in china
Ocean base海量结构化数据存储系统 hadoop in chinaOcean base海量结构化数据存储系统 hadoop in china
Ocean base海量结构化数据存储系统 hadoop in chinaknuthocean
 
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰Scourgen Hong
 
豆瓣网技术架构变迁
豆瓣网技术架构变迁豆瓣网技术架构变迁
豆瓣网技术架构变迁reinhardx
 
Exadata那点事
Exadata那点事Exadata那点事
Exadata那点事freezr
 
Bypat博客出品-服务器运维集群方法总结2
Bypat博客出品-服务器运维集群方法总结2Bypat博客出品-服务器运维集群方法总结2
Bypat博客出品-服务器运维集群方法总结2redhat9
 
浅析分布式存储架构—设计自己的存储- 58同城徐振华
浅析分布式存储架构—设计自己的存储- 58同城徐振华浅析分布式存储架构—设计自己的存储- 58同城徐振华
浅析分布式存储架构—设计自己的存储- 58同城徐振华zhuozhe
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Hanborq Inc.
 
Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
 
Azure Data Lake 簡介
Azure Data Lake 簡介Azure Data Lake 簡介
Azure Data Lake 簡介Herman Wu
 

Similar to The Construction and Practice of Apache Pegasus in Offline and Online Scenarios Integration (20)

大众点评网的技术变迁之路
大众点评网的技术变迁之路大众点评网的技术变迁之路
大众点评网的技术变迁之路
 
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)
[.Net开发交流会][2010.06.19]大众点评网的技术变迁之路(王宏)
 
鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobao鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobao
 
淘宝对象存储与Cdn系统到服务
淘宝对象存储与Cdn系统到服务淘宝对象存储与Cdn系统到服务
淘宝对象存储与Cdn系统到服务
 
Taobao图片存储与cdn系统到服务
Taobao图片存储与cdn系统到服务Taobao图片存储与cdn系统到服务
Taobao图片存储与cdn系统到服务
 
Spark sql培训
Spark sql培训Spark sql培训
Spark sql培训
 
大规模网站架构
大规模网站架构大规模网站架构
大规模网站架构
 
Mr&ueh数据库方面
Mr&ueh数据库方面Mr&ueh数据库方面
Mr&ueh数据库方面
 
Mysql企业备份发展及实践
Mysql企业备份发展及实践Mysql企业备份发展及实践
Mysql企业备份发展及实践
 
海量日志分析系统实践,Dba
海量日志分析系统实践,Dba海量日志分析系统实践,Dba
海量日志分析系统实践,Dba
 
Ocean base海量结构化数据存储系统 hadoop in china
Ocean base海量结构化数据存储系统 hadoop in chinaOcean base海量结构化数据存储系统 hadoop in china
Ocean base海量结构化数据存储系统 hadoop in china
 
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰
从林书豪到全明星 - 虎扑网技术架构如何化解流量高峰
 
豆瓣网技术架构变迁
豆瓣网技术架构变迁豆瓣网技术架构变迁
豆瓣网技术架构变迁
 
Exadata那点事
Exadata那点事Exadata那点事
Exadata那点事
 
Bypat博客出品-服务器运维集群方法总结2
Bypat博客出品-服务器运维集群方法总结2Bypat博客出品-服务器运维集群方法总结2
Bypat博客出品-服务器运维集群方法总结2
 
浅析分布式存储架构—设计自己的存储- 58同城徐振华
浅析分布式存储架构—设计自己的存储- 58同城徐振华浅析分布式存储架构—设计自己的存储- 58同城徐振华
浅析分布式存储架构—设计自己的存储- 58同城徐振华
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
um-talk
um-talkum-talk
um-talk
 
Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Azure Data Lake 簡介
Azure Data Lake 簡介Azure Data Lake 簡介
Azure Data Lake 簡介
 

More from acelyc1112009

Apache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage systemApache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage systemacelyc1112009
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataacelyc1112009
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosacelyc1112009
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...acelyc1112009
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0acelyc1112009
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
The Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache PegasusThe Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache Pegasusacelyc1112009
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomiacelyc1112009
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partacelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partacelyc1112009
 

More from acelyc1112009 (11)

Apache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage systemApache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage system
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
The Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache PegasusThe Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache Pegasus
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
 

The Construction and Practice of Apache Pegasus in Offline and Online Scenarios Integration

Editor's Notes

  1. https://juejin.cn/post/7098585141953429512
  2. WAL + Compaction,离散的随机写请求都转换成批量的顺序写请求,以此提高写性能
  3. 1. Compaction中实现过期数据删除
  4. Move参数调整完成