Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
4. Overview
Control Terminal
Cluster Manage
Backup Recovery
Job Manage
Workflow Manage
Cloud Monitor
…
Control
HDFS Clod Disk OSSHDFS Local DiskFileSystem
HBase SolrStorage
Paged Queries Global Index
Search Index
Salt Table
UDFStatistics Collection
Dynamic Columns
Local IndexTransactionsFeatures
Ecosystem
BulkLoad BDS Datax
Flink XPack Spark Kafka MR
Phoenix
• 200+ Instances
• Maximum increment 4TB
(daily & single instance )
• Maximum instance 200 TB
• Maximum table 80 TB
5. About Ecosystem
Tools Use Scenario Transmission method Data sources Address
BulkLoad
> 100 million rows,
large history/increment data
MR, API read source and
generate HFile
Phoenix/Text/JSON/CSV/
HBase
https://phoenix.apache.org/
bulk_dataload.html
Datax
< 100 million rows,
small history/increment data
API read source and write
target
Phoenix/Mysql/PG/Hive/
HBase/CSV…etc
https://github.com/alibaba/DataX
BDS
history/increment/real time
data
Copy HFile + WAL sync Phoenix/HBase/MYSQL (provide by apsaradb phoenix )
• Data Migration Tools
• X-Connector
Spark / Kafka / Flume/ Hive/ Pig / Flink / MapReduce
7. Use Cases Summary
• Query: milliseconds to seconds latency
• Write: high throughput
• Scale out
• Scale up(There is a advantage for cloud products)
• Well-known & fewer query pattern
• The filters of where clause hit result set less than 1 million
• Non transactions & cross table/row transactions
• Online/offline business
Business FeaturesBusiness Requirements
Some typical reasons why users choose Phoenix
• RDBMS(MYSQL) slow down when data size increase to TB
• Sharding store of RDBMS will make business logical becomes complex
• The latency of some operational query business is too high on Data warehouse(Hive/ODPS)
17. Global Index Query
1. Use filters to retrieve pk data from the index table
2. Generate new SQL: select * from dataTable where pk in (x1,x2,x3…)
Some project columns haven't been indexed
18. Problems & Solutions
1. There is a size limitation for hitting index table result set
2. Query primary table is inefficient, especially big table is obvious.
• Problems
• Solutions
1. Push down filters of the primary table to the server side
2. Batch query(multi get) the primary table when scanning filtered data from the index
table on the server side
3. Return Tuple of projected columns of the primary table to the client
4. The client merge sort & top n
19. Performance Improvement
• Rows: 5 million rows
• Query : select /*+INDEX(TT IDXTT)*/ * from Test where col_1 = '28' limit 500 offset 50
latency(mms)
0
15000
30000
45000
60000
original optimization + non bloomfilter optimization + bloomfilter
8X
4X
Average 10X performance Improvement
21. Query Server Tips
1. The default format of Date type is different between thick client and thin client, the format is yyyy-
MM-dd hh:mm:ss.SSS and yyyy-MM-dd respectively
2. The columns Date type can not be used aggregation or group when the format is yyyy-MM-dd
3. Use Round-robin HTTP load balancer need set model = TCP
4. Query Server query OPS is mainly decided by scanning region numbers of per query
5. Recommend the serialized option use Protocol Buffers
6. Thin client default use JVM time zone, Thick client default use GTM timezone
22. Avoid Usage Pitfalls
1. BulkLoad text data must guarantee row key unique if the primary table has index tables, or the index
data will be out of sync
2. The fields of VARCHAR type:
• An empty string will be stored as NULL value for the VARCHAR type
• ‘0’ is reserve value which shouldn’t exist in actual data
3. The Index columns should avoid using DESC in create index table clause. Because of indexed data
will be changed to variable data type to store, query these fields may get incorrect results
23. Best practices
1. For big data scenarios, the pre-split table is a better choice than the salted table
2. Use secondary indexes or primary key to accelerate order and group queries
3. Reduce redundant indexed columns and index tables number as far as possible
4. Set autocommit = true before executing delete from … where…
5. Set UPDATE_CACHE_FREQUENCY parameter when creating the view table
25. • Search Index
• Supports native SQL
• CBO
• Index merge
• Support cancel full scan query or slow query
• Query Server memory manage
• Continue contributing the community