Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Product Manager, @Alluxio)
In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.
5. The Evolution of the Modern Data Stack
Tightly-Coupled
MapReduce & HDFS
On-Prem HDFS YARN
10yr
Ago
Compute-Storage
Separation
Cloud Data Lake K8s/Containerization
Today
More Elastic, Easier to Manage, More Scalable
Loses Data Locality
6. The Challenges of Losing Data Locality
Slow and inconsistent
data access performance
Longer Queries
slower insight
More cost on the cluster
Fast-growing cloud
storage costs
API call costs
Data egress costs
High data operation costs
when migrating to the
cloud
Data copy pipelines
maintenance
error prone
7. A Framework to Fit Different Needs
Alluxio Edge
L1
Alluxio
Enterprise Data
L2
Run as a library in the compute worker process
to leverage local disks (typically NVMe)
Standalone cache service across
applications for virtualization
Alluxio Enterprise Data
Caching, Virtualization, Data Management
Alluxio Edge
9. 9
Large Scale Analytics with Trino / PrestoDB
ALLUXIOĘĽS SOLUTION RESULTS ACHIEVED
Real-time responses & analysis, while saving costs on S3 storage
End to End Query
Performance Improvement
I/O Speed-up
Cloud Storage Cost
Reduction
PUBLIC CLOUD / ON PREM 1.5
-10x
Trino Node
Alluxio Edge
Reduced Network Congestion
Off Load Under Storage
50-
90%
10-50x
>10%
Alluxio Edge Dashboard
Cluster summary
Cost saving
Resource status
10. 10
A Deeper Dive - How Does Alluxio Edge Work?
Alluxio Edge
Cache File System
Alluxio Edge
Cache Manager
11. Key Capabilities of Alluxio Edge
11
Caching Data
Local SSD
Memory
Connector Support
Iceberg
Hudi
Delta Lake
Hive
Data Formats
Parquet
ORC
CSV
Txt
JSON
Avro
Flexible Cache
Eviction/ Admission
LRU and FIFO eviction
Customized admission
TTL
Data quota
13. 13
Local Dashboard for Insight
Content
â—Ź Cluster summary
â—Ź Cost saving
â—Ź Resource status
Value
â—Ź Easy ROI demonstration
â—Ź Monitor the cluster
â—Ź Tuning advice
15. 15
User Case 1: Trino in the cloud - BI Query
Execution time:
â—‹ 4X faster performance
ROI at scale:
â—‹ Freed up 30% compute capacity to give to other applications while
increasing Trino traffic by 20%
16. 16
User Case 2: Presto w/ Alluxio Edge On Cloud
Deployed in the cloud:
â—Ź Scale: 1500 nodes, 500k query/day, 90PB data
read/day
Cost reduction and Performance improvement:
â—Ź Up to 80% reduction of # of read requests to cloud
● 228s → 50s reduction of P90 query latency
# of Read requests
Query Latency
Without Cache
With Cache
17. Key pain point solved: Unstable HDFS
● Reduced traffic to HDFS
17
↓ 40%
Query Latency (Second)
10x
IO throughput (MB)
User Case 3: Trino w/ Alluxio Edge On Local HDFS
19. Alluxio Edge: A Framework to Fit a Different Need
Alluxio Edge
L1
Alluxio
Enterprise Data
L2
… when hot data fits within disks on a single server
How?
Run as a library in the compute worker process
to leverage local disks (typically NVMe)
No data sharing or communication across workers
… when requiring
(a) cache capacity that scales-out horizontally, or
(b) cross-region access
How? Standalone cache service across applications
What else is different from Edge?
(a) Virtualization
(b) Data Management
Alluxio Enterprise Data
Caching, Virtualization, Data Management
Alluxio Edge